Computer Science Dept.
Vassar College
Poughkeepsie, NY 12604-0462
Tel: (914) 437-5992
Fax: (914) 437-7498
weltyc@cs.vassar.edu
Subject based classification is an important part of information retrieval, and has a long history in libraries, where a subject taxonomy was used to determine the location of books on the shelves. We have been studying the notion of subject itself, in order to determine a formal ontology of subject for a large scale digital library card catalog system. Deep analysis reveals a lot of ambiguity regarding the usage of subjects in existing systems and terminology, and we attempt to formalize these notions into a single framework for representing it.
Subject Areas: Subject based classification, digital libraries, description logics.
Knowledge Representation has a lot to offer the efforts to solve these problems, as part of the solution is a deeper understanding by these search engines of the available information. A critical part of clarifying this understanding is ontological analysis of the relevant concepts. We have been working for several years on a new ontology for digital library card catalog systems [Welty, 1994][Welty, 1996]. Central to this research has been the issue of subjects and their taxonomic nature, which was inspired by the success of the Yahoo! web directory.
In addition to handling taxonomies efficiently, description logics are excellent at allowing succinct expression of the reasons for subsumption. We want to be able not just to say what the taxonomy is, but why. For example, why is a "biography" a more specialized notion than "non-fiction"? In a description logic, we say that a "biography is non-fiction which is about a person":
(define-concept biography (and non-fiction (all about person)))The semantics of the logic makes this a sufficient condition for membership in the biography class. That is, anything that is non-fiction and is about a person will automatically be classified as a biography.
This capability is useful in several ways, but most importantly it assists in ensuring that the data entered is as accurate as possible. One of the biggest problems with the Yahoo! taxonomy is that it is populated by non-experts, who for various reasons do not always completely explore or understand the taxonomy before making their selection as to where a new entry belongs. It is very common to find something like a biography listed in the non-fiction category and not in the biography sub-category of non-fiction. This implies that a user searching for a biography who is exploiting the taxonomy to focus a search will miss that entry. Incorporating knowledge about, e.g. what makes a biography more specialized than a work of non-fiction will help to address this problem.
The purpose of this paper is not to explain description logics in general. Please see [Patel-Schneider and Swartout, 1993], and the wealth of information available via the description logic workshops [Padgham, 1997].
This paradigm has worked well until recently because the scope of the searches was still rather small. Libraries have always been limited by physical space - only so many books will fit within the building. It has been the possibility of digital libraries that has finally outdated this paradigm. Searching by title, author, and a very course-grained subjects simply won't do anymore. The Web has made it possible for anyone to be an author, and electronic data clearly opens the door to whole realms of search possibilities.
Yahoo has some problems, too. The subject taxonomy itself is highly course-grained, even more so than a library. They have not over-specified the taxonomy because they believe people adding data, who are in a rough sense advertising, will not make use of subject categories that are too specific, either because they do not understand the depth, they do not take the time to explore the full depth, or because they themselves fear users will not find something that is overly deep in the taxonomy (believing e.g. that more people search the broader categories). This, as mentioned in Section 2, creates a further problem in which users who do make use of the full depth of the taxonomy miss information that was classified too generally (e.g. a person searching for "biographies" may miss a biography that was listed under "non-fiction").
Our goals are broadly defined as:
Books, or publications in general, may not be about a certain subject, and yet be classified under subject. A Sherlock Holmes novel is not about mystery, yet you would expect to be able to find it under that heading - it is a mystery book. This use of subjects is drastically different, as it implies an intensional interpretation. When something is about a subject, the interpretation seems extensional, as a relationship is being expressed between two objects: about(book1, cars). A book in a particular genre is normally thought of as being an instance: mystery("A Study in Scarlet"), and therefore the subject "mystery" becomes intensional. This is at odds with classifying a person's interests, it would be natural to state "Chris is interested in mystery" as interested-in(Chris, mystery), and not mystery(Chris).
Any discussion of the ontology of subjects can quickly become confused by the fact that, linguistically, any noun can be the subject of a sentence. This opens up the possibility for any noun to be a subject, whether naturally intensional or extensional. For example, "cars" is a category of objects which has instances, such as "Chris' car," and which we would naturally take as intensional. Cars can also be a subject, as in "a book about cars." Chris' car, as an instance of a car, would naturally be taken as extensional, yet it, too, could be a subject, as in, "a book about Chris' car."
The taxonomy of subjects is not well understood below a very course grained level, because existing subject classifications do not go very deep. "AI" is one of a set of the most specific subject categories modern libraries provide, an area that clearly has quite a bit more specialization.
One area where the lack of any depth in existing subject taxonomies leads to problems is in understanding the difference between e.g. "cars" as a subject and "Chris' car" as a subject. How do these two concepts fit into a taxonomy, or do they? This issue was never a problem in previous card catalog systems because there were only books, and relationships between objects were never represented. In addition, subjects were merely selected from a defined set of keywords, and the fact that a subject keyword may represent something else in the card catalog was never specified. For example, card catalog systems easily represented the fact that Hemingway is the author of several books, and also easily represented that there are books about Hemingway and books about Hemingway's books. These systems did not, however, give you any clue that "Hemingway the author" and "Hemingway the subject" were one and the same. This may seem trivial since many people know who Hemingway is, but with a digital library containing references to millions of unheard-of authors, these distinctions will be critical.
There are clearly some difficulties and ambiguities involving the notion of a subject. Taking this vague discussion into a formal ontology for subjects necessarily involves incorporation of the underlying building blocks the specification language provides, and it is here that some of these problems become more clearly defined.


This question brings to light three major problems with a taxonomy of subject concepts:


The third problem is also serious. If specific instances, e.g. "Chris" can be a subject (for example, "a web page about Chris"), we have inconsistent use of subjects as concepts in some places and individuals in others.

(AND BOOK (SOME ABOUT CompSci))The problem would then become that, while some of the subjects of an object would appear as role values, the "inherited" ones would not. In the example shown in Figure 5,
ai-ind and physics-ind are the fillers for the about role on Book-1, however compsci-ind is not, even though CompSci should be considered a subject. Queries, therefore, would need to take this into account and remain mostly terminological, which in turn makes query formulation inconsistent. Again, the query above for "books about computer science," reads, "Books that are about an instance of Computer Science" (a loose translation), however the query "Books about Person-10" would be:
(AND BOOK (FILLS ABOUT PERSON-10))Which reads, "Books about Person-10." The inconsistency is that the latter case is not terminological, and the former is.
Another problem with this approach is that it is not entirely clear what things are. The instance parts of the subjects (e.g. ai-ind) are instances of subject and so perhaps they are the subjects themselves, but what are the concept parts? Are they simply taxonomic placeholders or do they have some meaning beyond that?
This representation would be quite a bit less efficient than the first approach in focusing on objects in a certain subject area. Despite these drawbacks, this is a possible solution, given the SOME operator.

This approach lacks the ability to deal with the third problem above (individuals as subjects), and like the others pushes a lot of interpretation onto the user interface. We are not currently considering it due to the complexity of the implementation changes required. Other groups may be trying this approach [Lambrix, et al., 1997].
AI-Thing. In other words, "a book about AI" becomes something which is a book and an AI Thing, and a query for that would be:
(AND BOOK AI-THING)A person interested in AI is a person and an AI Thing, and a query would be:
(AND PERSON AI-THING)This solves the first and second problems, again by slightly altering the way subjects are considered and pushing any interpretation onto the user interface. It does not deal with the third problem of individuals as subjects at all.
This approach also allows the succinct expression of rules for propagating subject information. For example, it makes sense to say that "A person who writes a book about AI is interested in AI." This rule can be expressed with this representation because there is a single concept for the subject of AI (unlike the Object-Based Subject Taxonomies approach), all subjects are concepts (unlike the Concept-Individual Pairs approach), and can be expressed using the existing syntax (unlike the Subject-Based Instantiation approach, in which it isn't clear how rules would be expressed).
This is the approach currently in use because it requires no changes to the underlying language, while still providing most of the desired functionality and efficiency (modulo a knowledgeable user interface).
The use of subjects as objects which can be predicated (such as about(Book-1,AI)), implies that a subject does exist and should be represented in a description logic as an individual. This gives us the problem of not being able to use a taxonomy of subjects.
We think, in this approach, of subjects as individuals which correspond not to concepts in a taxonomy as in the Concept-Individual Pairs approach, but to each item to be classified by subject. That is, for each individual that has some sort of subject classification, there is an individual which represents the subject of that item. The subject taxonomy is represented as concepts (as in Figure 2), and each subject individual classified under concepts in that taxonomy. The objects (books, people, etc.) themselves not classified directly by subject.

Person-1 is Ernest Hemingway, Book-2 is The Old Man and the Sea, Book-1 represents a biography of Ernest Hemingway, and Person-2 is Jeffrey Meyers, the author of the biography. Each book has an associated genre and a subject that it is about. The subject and genre is always an individual, and will be instances of all the concepts that are the subjects, making each subject instance potentially unique, since it represents the aggregation of the all the subject categories.The interpretation of the blank subject instances in Figure 7 is that they are things which represent the subjects (and genres) of their associated books.
This approach seems to address all the problems that have been mentioned in this paper. It does not require the SOME operator, uses a consistent representation, and takes advantage of subsumption reasoning. We are only now beginning to actually apply the idea to our card-catalog system. At this time we have only preliminary results, which are promising. This subject ontology seems to continue to offer new benefits as we experiment more.
Our subject ontology is still in development, yet it represents a significant step in a long effort to formalize this notion with a useful taxonomic structure.
One can easily criticize these efforts in terms of scale. Knowledge representation systems have still not come close to dealing with the scale of the systems we hope to replace - systems that are themselves dwarfed by the scale of digital libraries. We believe this to be a technology issue, however. Experts are now predicting the end to disk-drives as a storage medium, with 64 gigabyte flash memory cards roughly five years away [Newton, 1997]. When it is possible to store that much information in a high-speed random access structure, the use of classification mechanisms will be absolutely essential in assisting users to find the information they are looking for.
[Borgida, 1996] Borgida, A. On the Relative Expressiveness of Description Logics and Predicate Logics. The Artificial Intelligence Journal. To appear.
[Brachman, et al., 1991] Brachman, R., McGuinness, D., Patel-Schneider, P., Borgida, A. and Resnick, L. Living with CLASSIC: When and How to Use a KL-ONE-Like Language. Principles of Semantic Networks. Morgan Kaufman. Pp. 401-456. May, 1991.
[Ide, et al., 1997] Ide, N., McGraw, T., and Welty, C. Representing TEI Documents in the CLASSIC Knowledge Representation System. Proceedings of the Tenth Workshop of the Text-Encoding Initiative.November, 1997.
[Lambrix, et al., 1997] Lambrix, P., Shamehri, N., and Wallöf, N. Dwebic: An Intelligent Search Engine based on Default Description Logics. Proceedings of DL-97, The International Workshop on Description Logics. September, 1997.
[Newton, 1997] Newton, R. The Revolution in Electronic Design Automation: Implications for Automated Software Engineering. Keynote Address, 1997 International Conference on Automated Software Engineering. IEEE Computer Society Press. November, 1997.
[Padgham, 1997] Padgham, P. The Description Logic Home Page. Available at http://www.dl.kr.org/dl.
[Patel-Schneider and Swartout, 1993] Patel-Schneider, P., and Swartout, B. Description Logic Knowledge Representation System Specification. From the KRSS Group of the ARPA Knowledge Sharing Effort. November, 1993. Available at http://www-db.research.bell-labs.com/user/pfps/krss-spec.ps
[Welty, 1994] Welty, Chris. Knowledge Representation for Intelligent Information Retrieval. Proceedings of the CAIA-94 Workshop on Intelligent Access to Digital Libraries. March, 1994.
[Welty, 1996] Welty, Chris. Intelligent Assistance for Navigating the Web. Proceedings of the 1996 Florida AI Research Symposium. May, 1996.
[Welty, et al., 1998] Welty, C., and Ide, N. Knowledge Representation for Text Markup. The International Journal of Computers and the Humanities. To appear.