The Ontological Nature of Subject Taxonomies

Christopher A. Welty

Computer Science Dept.
Vassar College
Poughkeepsie, NY 12604-0462
Tel: (914) 437-5992
Fax: (914) 437-7498
weltyc@cs.vassar.edu

Subject based classification is an important part of information retrieval, and has a long history in libraries, where a subject taxonomy was used to determine the location of books on the shelves. We have been studying the notion of subject itself, in order to determine a formal ontology of subject for a large scale digital library card catalog system. Deep analysis reveals a lot of ambiguity regarding the usage of subjects in existing systems and terminology, and we attempt to formalize these notions into a single framework for representing it.


Subject Areas: Subject based classification, digital libraries, description logics.


1 Introduction

Until recently, library card catalog systems have worked successfully because the amount of material referenced by the system was fairly small. Digital libraries, both formal as in the United States National Digital Library, or informal as in the World Wide Web, promise the potential of billions of electronic documents, and will render the existing card catalog paradigm useless. It has begun already, as web users find themselves frequently getting thousands or more matches on keyword searches, and unable to find a useful method to prune the result further. The overwhelming amount of information has made it inaccessible, and no large scale digital library will succeed unless this problem is addressed.

Knowledge Representation has a lot to offer the efforts to solve these problems, as part of the solution is a deeper understanding by these search engines of the available information. A critical part of clarifying this understanding is ontological analysis of the relevant concepts. We have been working for several years on a new ontology for digital library card catalog systems [Welty, 1994][Welty, 1996]. Central to this research has been the issue of subjects and their taxonomic nature, which was inspired by the success of the Yahoo! web directory.

2 Description Logics

This project has been focusing on description logics as a representation medium [Baader, et al., 1991]. The main reason for this choice is that description logics are particularly good at handing taxonomic information efficiently, and it was the success of the Yahoo! subject taxonomy that inspired this project to begin with.

In addition to handling taxonomies efficiently, description logics are excellent at allowing succinct expression of the reasons for subsumption. We want to be able not just to say what the taxonomy is, but why. For example, why is a "biography" a more specialized notion than "non-fiction"? In a description logic, we say that a "biography is non-fiction which is about a person":

(define-concept biography (and non-fiction (all about person)))
The semantics of the logic makes this a sufficient condition for membership in the biography class. That is, anything that is non-fiction and is about a person will automatically be classified as a biography.

This capability is useful in several ways, but most importantly it assists in ensuring that the data entered is as accurate as possible. One of the biggest problems with the Yahoo! taxonomy is that it is populated by non-experts, who for various reasons do not always completely explore or understand the taxonomy before making their selection as to where a new entry belongs. It is very common to find something like a biography listed in the non-fiction category and not in the biography sub-category of non-fiction. This implies that a user searching for a biography who is exploiting the taxonomy to focus a search will miss that entry. Incorporating knowledge about, e.g. what makes a biography more specialized than a work of non-fiction will help to address this problem.

The purpose of this paper is not to explain description logics in general. Please see [Patel-Schneider and Swartout, 1993], and the wealth of information available via the description logic workshops [Padgham, 1997].

3 Library Card Catalogs

Classification has been a useful tool in the sciences for centuries, and this desire of the scientist to "identify and classify" was inherited by library science. In libraries, taxonomies of subjects have existed for a long time as a way to organize books.

3.1 Limits of the Existing Techniques

The notion of a subject, as discussed Section 4, is quite complex and ambiguous. While library science has been dealing with subjects for a long time, they have not actually grappled with this issue in an epistemological way, because they have had until recently two fairly good simplifying assumptions:

1. Only books need to be classified by subject.

2. Books are only classified by subject in order to determine where they should be shelved, and a book can only be shelved in one place, therefore each book has only one subject.

These simplifying assumptions were also directly linked to the use of a card catalog system, and did not significantly change with the introduction of electronic card catalog systems. In fact, this technology only added the capability to do keyword searches of the old style cards. The underlying representation was not changed at all in order to take advantage of the possibilities: you can search by title, by author, and by subject, as you could fifty years ago.

This paradigm has worked well until recently because the scope of the searches was still rather small. Libraries have always been limited by physical space - only so many books will fit within the building. It has been the possibility of digital libraries that has finally outdated this paradigm. Searching by title, author, and a very course-grained subjects simply won't do anymore. The Web has made it possible for anyone to be an author, and electronic data clearly opens the door to whole realms of search possibilities.

3.2 Subject Taxonomies

The taxonomic nature of subjects has always been understood in library science, yet mostly hidden from library users. Few on-line catalogs allowed users to exploit the taxonomy itself to support browsing or narrowing a search. It was not until the Yahoo! web service that general users were empowered to use the notion of specialization in narrowing the possibilities for a search. The success of Yahoo! clearly demonstrates that general users are perfectly able to understand and use taxonomic structures effectively.

Yahoo has some problems, too. The subject taxonomy itself is highly course-grained, even more so than a library. They have not over-specified the taxonomy because they believe people adding data, who are in a rough sense advertising, will not make use of subject categories that are too specific, either because they do not understand the depth, they do not take the time to explore the full depth, or because they themselves fear users will not find something that is overly deep in the taxonomy (believing e.g. that more people search the broader categories). This, as mentioned in Section 2, creates a further problem in which users who do make use of the full depth of the taxonomy miss information that was classified too generally (e.g. a person searching for "biographies" may miss a biography that was listed under "non-fiction").

3.3 Initial Goals

We seek to use KR&R principles to extend the library card catalog system in such a way that they can meet the needs of users of vast digital libraries. Among the principles of KR&R being applied, formal ontological analysis is considered of primary importance.

Our goals are broadly defined as:

1. To add the ability to reason with objects other than just books, such as people, places, organizations, events, etc.

2. To allow for classification of all objects by subject, such as an "AI Book" or an "AI Person".

3. To represent the relationships between objects, such as an "AI Person" who is "affiliated with" a "Communications Company".

4. To deepen the subject taxonomies to a much finer grain, such as "Book about formal ontologies"

5. To allow for multiple classification of objects under more than one subject, i.e. a book on "AI and Molecular Biology."

6. To make the subject taxonomy available to users, both in a browseable hierarchy as with Yahoo!, and as an follow-up option for searching (e.g. "Your search returned 1000 matches, you may broaden or narrow the search...").

7. To allow for the representation of fully marked up documents. This goal has numerous implications, and is not the focus of this paper, see [Ide, et al., 1997] or [Welty, et al., 1998] for more information.

The main reasons for expanding the "card catalog" representation (beyond books) in this way is to support far more expressive searches. Queries such as "Publications written by people at Vassar College," or "works of fiction reviewed by someone at Vassar who is interested in AI," or even, "Papers on AI published at a conference sponsored by ACM." The increased expressiveness should serve to enhance the utility of the card catalog for three basic user models:

1. Users with vague notions about what the object of the search is. Library users frequently don't know the title or author of the book they are looking for, or even if it exists. For example, someone may be interested in reading about "Formal Ontology," and might search for a publication on that subject. If none were found, the user could generalize the search to, perhaps, "Knowledge Representation." The key to generalizing (or specializing) a search in this manner is access to the taxonomy, which is not the case in existing card catalog systems. Furthermore, if the user were aware that interesting research in formal ontologies was taking place at a particular institution, the user might search for "Publications on knowledge representation written by someone at Vassar College." Again, since we are dealing with a library containing potentially billions of publications, the ability to refine a search will be critical in making the information accessible.

2. Users doing scholarly research. We are working with two text-encoding groups [Welty, et al., 1998] who are in the business of marking up historical manuscripts and making them available in electronic form. Their primary customers are scholars who do not have direct access to the manuscripts (because of proximity, because the manuscript is fragile, etc.), and having the data in electronic form opens up vast new possibilities. Etymologists (people who study words) would like to make queries such as "what is the date of the first manuscript that uses the word 'cleave,'" other researchers might ask for "Books by women authors in the 18th century containing mythological characters." It would be impossible to exaggerate the implications of this technology for scholarly research in the humanities.

3. Users with vague recollections of the object of the search. It is not uncommon for a library user to be searching for something they read or heard about at some time, but don't remember exactly what it was. The more information about each publication that can be provided, the better the chance that something the user remembers is actually recorded. For example, "The paper on KR written in the early '80s by someone at BBN," or, "A paper on planning at an AI conference in Florida." This category was, in fact, the initial motivation for this research.

The card catalog ontology itself is documented elsewhere [Welty, 1996], and we focus here on the specifics of the subject representation. It is important, however, to keep in mind when considering the discussion of subjects below that the key issue here, which represents dramatic improvement over existing card catalog systems, is the existence of objects other than books, such as people, places, events, organizations, etc.

4 Representing Subjects

The essence of ontological analysis is asking what something is. What, therefore, is a subject? What makes it a subject? What are the properties, necessary and sufficient? The answers to these questions reveal that the notion of a subject is very ambiguous.

4.1 Basic Usage

The first step in understanding subjects is to consider the way in which the notion is currently used. Books, for example, can be about a subject, e.g. "a book about cars". A person can be interested in a subject, e.g. "Chris is interested in cars." One does not typically say or think that a person is about a subject, nor that a book can be interested in a subject.

Books, or publications in general, may not be about a certain subject, and yet be classified under subject. A Sherlock Holmes novel is not about mystery, yet you would expect to be able to find it under that heading - it is a mystery book. This use of subjects is drastically different, as it implies an intensional interpretation. When something is about a subject, the interpretation seems extensional, as a relationship is being expressed between two objects: about(book1, cars). A book in a particular genre is normally thought of as being an instance: mystery("A Study in Scarlet"), and therefore the subject "mystery" becomes intensional. This is at odds with classifying a person's interests, it would be natural to state "Chris is interested in mystery" as interested-in(Chris, mystery), and not mystery(Chris).

Any discussion of the ontology of subjects can quickly become confused by the fact that, linguistically, any noun can be the subject of a sentence. This opens up the possibility for any noun to be a subject, whether naturally intensional or extensional. For example, "cars" is a category of objects which has instances, such as "Chris' car," and which we would naturally take as intensional. Cars can also be a subject, as in "a book about cars." Chris' car, as an instance of a car, would naturally be taken as extensional, yet it, too, could be a subject, as in, "a book about Chris' car."

4.2 Taxonomic Structure

One property of subjects that is widely recognized is that they have an inherent taxonomic structure. This is part of their appeal, and perhaps a defining characteristic. Most people understand that there are fiction and non-fiction books, that non-fiction books come in varieties such as tutorials, biographies, etc. What people are only starting to realize is that, as electronic media becomes the status quo, this hierarchy can be exploited to focus searches that would otherwise find millions of possible results for a query.

The taxonomy of subjects is not well understood below a very course grained level, because existing subject classifications do not go very deep. "AI" is one of a set of the most specific subject categories modern libraries provide, an area that clearly has quite a bit more specialization.

One area where the lack of any depth in existing subject taxonomies leads to problems is in understanding the difference between e.g. "cars" as a subject and "Chris' car" as a subject. How do these two concepts fit into a taxonomy, or do they? This issue was never a problem in previous card catalog systems because there were only books, and relationships between objects were never represented. In addition, subjects were merely selected from a defined set of keywords, and the fact that a subject keyword may represent something else in the card catalog was never specified. For example, card catalog systems easily represented the fact that Hemingway is the author of several books, and also easily represented that there are books about Hemingway and books about Hemingway's books. These systems did not, however, give you any clue that "Hemingway the author" and "Hemingway the subject" were one and the same. This may seem trivial since many people know who Hemingway is, but with a digital library containing references to millions of unheard-of authors, these distinctions will be critical.

There are clearly some difficulties and ambiguities involving the notion of a subject. Taking this vague discussion into a formal ontology for subjects necessarily involves incorporation of the underlying building blocks the specification language provides, and it is here that some of these problems become more clearly defined.

4.3 Formal Ontology in a Description Logic

Description logics provide for three basic types of objects, concepts, individuals, and roles. Concepts are terminological descriptions of classes of individuals, such as "biography" in the example above. Individuals are assertional, and are considered instances of concepts. Roles are the relationships between individuals. These objects are loosely equivalent, respectively, to unary predicates, object symbols, and binary predicates in FOL [Borgida, 1996].

Figure 1

To begin with, our basic card catalog representation calls for concepts such as PERSON, PUBLICATION, BOOK, ORGANZATION, EVENT, etc. Instances of these concepts, such as "Hemingway," and "Old Man in the Sea" (shown in Figure 1), are individuals in the description logic, and are the objects to be classified by subject.

Figure 2

It is the nature of description logics that subsumption reasoning is only computed for taxonomies expressed at the terminological level. This means that any taxonomy which will exploit subsumption must be a taxonomy of concepts, and therefore subjects must be concepts, as shown in Figure 2. The question then becomes, how do we represent the fact that some individual, such as a book or a person, is to be classified under a particular subject?

This question brings to light three major problems with a taxonomy of subject concepts:

1. Is a book about AI an individual of the concept "AI"?

2. What happens to the notions of "about" and "interested in" mentioned in Section 4.1, when subjects are concepts?

3. Is there any way to represent specific extensional objects (individuals) as subjects?

Figure 3

The first problem may seem to some a little nit-picky, but it just doesn't "sound right" for a book to be an instance of a subject, the way "Chris" is an instance of a "Person" (see Figure 3).

Figure 4

The second problem is a little more serious. The notions of "a book about AI" and "a person interested in AI" do not seem compatible with an approach in which subjects are concepts. The most natural representation of "about" and "interested-in" would be for the subject to be an individual, in order that it could be predicated by a role (see Figure 4). Again, concepts can not be role fillers because a role is a binary predicate and a concept is a unary predicate, therefore using the "about" and "interested-in" roles would not be possible with a taxonomy of subject concepts.

The third problem is also serious. If specific instances, e.g. "Chris" can be a subject (for example, "a web page about Chris"), we have inconsistent use of subjects as concepts in some places and individuals in others.

4.4 Alternative Ontologies

We considered five different approaches to representing subject taxonomies in a description logic, each of which has characteristics that attempt to deal with the ambiguous nature of the notion.

4.4.1 Object-Based Subject Taxonomies

One slightly different approach to the taxonomy of subjects shown in Figure 2, is to change the name of the subject concepts to something more suitable, such as AI-Book. This does make more sense, but requires a duplicate subject taxonomy be created for each thing that can be classified by subject, e.g. there would be an entire subject taxonomy for people that included the "AI-Person" concept, and also hierarchies for companies, labs, events (like conferences), etc. In addition, these subject concepts are all connected in some way, e.g. the concepts "AI Book" and "AI Person" are related, yet there would be no way, in general, to say "x Book" and "x Person" and "x Company", etc., are all related because they all have to do with x as a subject.

4.4.2 Concept-Individual Pairs

Another approach is to have every subject be represented by a concept-individual pair (note that this is not a "meta-individual" [Brachman, et al., 1991]). At first glance, this seems to solve both problems: objects are no longer instances of their subjects, and the individual part of a represented subject can be predicated. It creates a problem, however, because the subject information is not passed down the hierarchy as one would expect.

Figure 5

We would expect, for example, a query for "Books about Computer Science" to return a list of all books about computer science, and also any books about any of the subjects below computer science in the taxonomy, including books about AI. In order to achieve this, either objects must be restricted to be about only one subject (which is not desirable), or the description logic would require the SOME operator [Patel-Schneider and Swartout, 1993] (which CLASSIC does not provide). With the SOME operator, this query would be:

(AND BOOK (SOME ABOUT CompSci))
The problem would then become that, while some of the subjects of an object would appear as role values, the "inherited" ones would not. In the example shown in Figure 5, ai-ind and physics-ind are the fillers for the about role on Book-1, however compsci-ind is not, even though CompSci should be considered a subject. Queries, therefore, would need to take this into account and remain mostly terminological, which in turn makes query formulation inconsistent. Again, the query above for "books about computer science," reads, "Books that are about an instance of Computer Science" (a loose translation), however the query "Books about Person-10" would be:

(AND BOOK (FILLS ABOUT PERSON-10))
Which reads, "Books about Person-10." The inconsistency is that the latter case is not terminological, and the former is.

Another problem with this approach is that it is not entirely clear what things are. The instance parts of the subjects (e.g. ai-ind) are instances of subject and so perhaps they are the subjects themselves, but what are the concept parts? Are they simply taxonomic placeholders or do they have some meaning beyond that?

This representation would be quite a bit less efficient than the first approach in focusing on objects in a certain subject area. Despite these drawbacks, this is a possible solution, given the SOME operator.

4.4.3 Subject-Based Instantiation

Figure 6

Another alternative is to alter the syntax and semantics of description logics to include a special relationship between individuals and their subject concepts. Currently, description logics define only one link between individuals and concepts, the "instance of" link. This link is part of the syntax of the language, and has nothing to do with roles, which can only link individuals to each other. A special "has-subject-class" link would require fairly extensive modifications to the language, since operationally it would behave precisely as the instance-of link with respect to subsumption, yet would need it's own record keeping and so forth within the implementation.

This approach lacks the ability to deal with the third problem above (individuals as subjects), and like the others pushes a lot of interpretation onto the user interface. We are not currently considering it due to the complexity of the implementation changes required. Other groups may be trying this approach [Lambrix, et al., 1997].

4.4.4 Subject Things

A fourth approach to consider is only a slight modification from the first one proposed: keep the single taxonomy of subjects, however name each concept the subject followed by "thing," e.g. AI-Thing. In other words, "a book about AI" becomes something which is a book and an AI Thing, and a query for that would be:

(AND BOOK AI-THING)
A person interested in AI is a person and an AI Thing, and a query would be:

(AND PERSON AI-THING)
This solves the first and second problems, again by slightly altering the way subjects are considered and pushing any interpretation onto the user interface. It does not deal with the third problem of individuals as subjects at all.

This approach also allows the succinct expression of rules for propagating subject information. For example, it makes sense to say that "A person who writes a book about AI is interested in AI." This rule can be expressed with this representation because there is a single concept for the subject of AI (unlike the Object-Based Subject Taxonomies approach), all subjects are concepts (unlike the Concept-Individual Pairs approach), and can be expressed using the existing syntax (unlike the Subject-Based Instantiation approach, in which it isn't clear how rules would be expressed).

This is the approach currently in use because it requires no changes to the underlying language, while still providing most of the desired functionality and efficiency (modulo a knowledgeable user interface).

4.4.5 Subjects as Instances

The final approach to consider is a rather major shift in thinking from the previous approaches, though there are some similarities to Concept-Individual Pairs. What led us to this approach is thinking about whether a subject exists or not. That is, when we say a particular book is about some subject, do we mean that subject exists in the same way the book does?

The use of subjects as objects which can be predicated (such as about(Book-1,AI)), implies that a subject does exist and should be represented in a description logic as an individual. This gives us the problem of not being able to use a taxonomy of subjects.

We think, in this approach, of subjects as individuals which correspond not to concepts in a taxonomy as in the Concept-Individual Pairs approach, but to each item to be classified by subject. That is, for each individual that has some sort of subject classification, there is an individual which represents the subject of that item. The subject taxonomy is represented as concepts (as in Figure 2), and each subject individual classified under concepts in that taxonomy. The objects (books, people, etc.) themselves not classified directly by subject.

Figure 7

A simple example of classification in this manner is shown in Figure 7. In this example, Person-1 is Ernest Hemingway, Book-2 is The Old Man and the Sea, Book-1 represents a biography of Ernest Hemingway, and Person-2 is Jeffrey Meyers, the author of the biography. Each book has an associated genre and a subject that it is about. The subject and genre is always an individual, and will be instances of all the concepts that are the subjects, making each subject instance potentially unique, since it represents the aggregation of the all the subject categories.

The interpretation of the blank subject instances in Figure 7 is that they are things which represent the subjects (and genres) of their associated books.

This approach seems to address all the problems that have been mentioned in this paper. It does not require the SOME operator, uses a consistent representation, and takes advantage of subsumption reasoning. We are only now beginning to actually apply the idea to our card-catalog system. At this time we have only preliminary results, which are promising. This subject ontology seems to continue to offer new benefits as we experiment more.

5 Conclusion

Ontological analysis requires deep consideration of the nature of the domain being analyzed. We have been studying what makes something a subject, and how objects in the library domain can be classified by subjects in a consistent way.

Our subject ontology is still in development, yet it represents a significant step in a long effort to formalize this notion with a useful taxonomic structure.

One can easily criticize these efforts in terms of scale. Knowledge representation systems have still not come close to dealing with the scale of the systems we hope to replace - systems that are themselves dwarfed by the scale of digital libraries. We believe this to be a technology issue, however. Experts are now predicting the end to disk-drives as a storage medium, with 64 gigabyte flash memory cards roughly five years away [Newton, 1997]. When it is possible to store that much information in a high-speed random access structure, the use of classification mechanisms will be absolutely essential in assisting users to find the information they are looking for.

References

[Baader, et al., 1991] Baader, F., Bürkert, H., Heinsohn, J., Hollunder, B., Müller, J., Nebel, B., Nutt, W., and Profitlich, H. Terminological Knowledge Representation: A Proposal for a Terminological Logic. DFKI Technical Memo TM-90-04. May, 1991.

[Borgida, 1996] Borgida, A. On the Relative Expressiveness of Description Logics and Predicate Logics. The Artificial Intelligence Journal. To appear.

[Brachman, et al., 1991] Brachman, R., McGuinness, D., Patel-Schneider, P., Borgida, A. and Resnick, L. Living with CLASSIC: When and How to Use a KL-ONE-Like Language. Principles of Semantic Networks. Morgan Kaufman. Pp. 401-456. May, 1991.

[Ide, et al., 1997] Ide, N., McGraw, T., and Welty, C. Representing TEI Documents in the CLASSIC Knowledge Representation System. Proceedings of the Tenth Workshop of the Text-Encoding Initiative.November, 1997.

[Lambrix, et al., 1997] Lambrix, P., Shamehri, N., and Wallöf, N. Dwebic: An Intelligent Search Engine based on Default Description Logics. Proceedings of DL-97, The International Workshop on Description Logics. September, 1997.

[Newton, 1997] Newton, R. The Revolution in Electronic Design Automation: Implications for Automated Software Engineering. Keynote Address, 1997 International Conference on Automated Software Engineering. IEEE Computer Society Press. November, 1997.

[Padgham, 1997] Padgham, P. The Description Logic Home Page. Available at http://www.dl.kr.org/dl.

[Patel-Schneider and Swartout, 1993] Patel-Schneider, P., and Swartout, B. Description Logic Knowledge Representation System Specification. From the KRSS Group of the ARPA Knowledge Sharing Effort. November, 1993. Available at http://www-db.research.bell-labs.com/user/pfps/krss-spec.ps

[Welty, 1994] Welty, Chris. Knowledge Representation for Intelligent Information Retrieval. Proceedings of the CAIA-94 Workshop on Intelligent Access to Digital Libraries. March, 1994.

[Welty, 1996] Welty, Chris. Intelligent Assistance for Navigating the Web. Proceedings of the 1996 Florida AI Research Symposium. May, 1996.

[Welty, et al., 1998] Welty, C., and Ide, N. Knowledge Representation for Text Markup. The International Journal of Computers and the Humanities. To appear.