Toward a Unified Docuverse:
Standardizing Document Markup and Access without Procrustean Bargains

Submitted to ASIS 97

Nancy M. Ide
Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520
ide@cs.vassar.edu
tel: +1 (914) 437-5988
fax: +1 (914) 437-7498

C. M. Sperberg-McQueen
Computer Center M/C 135
University of Illinois at Chicago
Chicago, Illinois 60612-7352
u35395@uicvm.uic.edu
tel: +1 (312) 413-0317
fax: +1 (312) 996-6834

Abstract

One reason to form a collection of any type is to provide simpler access, both physical and intellectual, to the material collected. In the case of digital collections of text, provision of a simple unified user interface faces several challenges, among them an unabating tension between intellectual adequacy and simplicity of access. Access is simpler if all texts are encoded consistently and can be searched using the same simple textual model. But serious work with texts often requires painstaking attention to the unusual, anomalous, and unique features of a text, which often cannot be captured at all using a simple monolithic markup scheme like HTML. Some texts may be worth marking up in much more detail than is possible for the entire collection.

How can we allow cross-collection searching and display while preserving, not falsifying, the variation in our texts and the resultant variation in their markup?

First, markup schemes can be used which allow for significant variation within a common framework; the best known of such schemes is that of the Text Encoding Initiative, an international cooperative project to develop and disseminate guidelines for the encoding and interchange of textual materials for research.

Second, what cannot be unified in the markup of each individual text can be unified at the user interface, by a more intelligent and flexible design of user models and interfaces. Several distinct tasks arise here:

defining relatively simple user models or `document architectures' which can successfully be imposed on the texts in a collection, to allow unified access to the entire collection in terms of simple concepts like volume/page/line or work/chapter/paragraph, or in terms of the full detailed markup of individual texts. Some existing markup schemes provide a useful basis for such models. There is no need to limit a collection to a single textual model; indeed, there are reasons to provide more than one model.
defining methods to specify transductions from the actual document markup into the various user models or architectures; under what circumstances will such transductions be possible, and when won't they be?
building indices and search interfaces to support each model; it is neither necessary nor desirable to support only a single search interface: it is probably better to have several.

The paper discusses the current state of the art for these problems and points to outstanding questions which remain to be resolved.

1 The Problem

In developing electronic document collections for research or other uses, it is clear that the electronic representations should usefully represent the content and structure of the documents represented. It is equally clear that a collection should provide simple, consistent, access to all documents in the collection, through the same user interface.

Unfortunately, these two goals conflict, and no method has yet been found to achieve them both fully at the same time.

It is easy to provide a simple, consistent user interface to the documents in a collection if the documents themselves have a simple, consistent structure which is in turn simply and consistently represented in the electronic form. The simplest interface may require no structure in the document at all: many programs can search for an arbitrary word form, or an arbitrary string of characters, in any file, without knowledge of the file's internal structure, if any. More useful queries are possible, however, if the logical structure of the document is visible to the search engine. In a collection of electronic abstracts, for example, it's easy to search for a given author's name, a date of publication within a specified range, a specific keyword supplied by the author or abstracting service, or for any word in the abstract. Such specific searches are possible, of course, only if the search engine is capable of distinguishing the author's name from the title of the paper, and each from the publication details and the text of the abstract. When each abstract has the same set of searchable fields, a single search can be performed across the entire collection, or across several collections, of abstracts.

When the full text of documents is included in the collection, a new range of searches becomes possible: we can search not just for words in the abstract, but patterns of words appearing in the same paragraph, or section, or sub-section -- assuming, of course, that the search engine pays attention to the boundaries of such units, and that they are made explicit in the text in some manner, e.g. through the tags of some encoding scheme based on the Standard Generalized Markup Language (SGML).[1] If the documents are tagged in sufficient detail, and the search engine is sufficiently attentive to the markup, a tremendous range of searches can be made possible: searches for individual word forms or strings of characters; for a given word in any of its grammatical forms; for a word or phrase occurring in a particular structural context; for a word occurring in proximity to some other word or phrase. For example:

Find the word-form buyout.
Find any occurrence of the verb phrase buy out. This should retrieve also the forms buying out, bought out, etc. In English there are several more possible forms, and for other languages there may be still more for both verbs and nouns.
Find a book with a title containing the phrase Syntax and a chapter title containinng the phrase programming languages. The context restrictions here are critical: a search for a book with the phrase programming languages in the title and a chapter entitled Syntax would retrieve a very different set of works.
Find occurrences of the word court within two sentences (or paragraphs, or other unit) of lawyer, sue, witness, or affidavit.
Display every date in the first heading within a section in chapter 3
Find every instance of Winston occurring within the same paragraph of any form of the verb hate, anywhere in part II.
Find any occurrence of the word physician, including variant archaic and obsolete spellings
Find all occurrences of bear used as a noun, or only as a plural noun, or only as a verb, etc.
Find all occurrences of the following syntactic pattern: verb phrase containing a main verb followed by a noun phrase followed by a verb in the infinitive form.
Retrieve the word or sentence in text A (in language X) which corresponds to a given word or sentence in the parallel text B (in language Y).

Given texts suitably analysed and marked up, such queries are well within the bounds of current technical possibility. They pose several problems, however, for the development of a simple, consistent interface to all the texts in a collection. It is difficult to make the more elaborate and powerful searches available within a user interface simple enough to be useful to casual or naive users, in part because the more elaborate searches not only allow the user to exploit the logical structure of a text, as recorded in the markup, but seem to require a detailed knowledge of that markup.

If a collection is sufficiently homogeneous, users may find such detailed knowledge worth acquiring, and easier to acquire. The user interface may also exploit the consistencies of text structure and markup. In collections of useful, realistic size, however, texts are highly variable. A wholly satisfactory representation of their structure and content will be equally variable. In the first place, texts of different genres may have structural units of quite different types, at some or all levels: paragraph, chapter, section, act, scene, stanza, canto, dictionary entry, etc. Literary texts, and other materials studied by humanists, are not only structurally different but often far more structurally complex than many of the text types (technical articles, etc.) which have so far received most attention in document-handling research. These complex structures, with their many different kinds of structural units, need to be made explicit in the electronic text in order that the search engine can use them in handling queries, the display routines can use them to guide the proper display of the text, and so on. At present the only plausible method of representing structural units in a text is to tag them using some system of markup.

Document collections containing texts in prose, verse, and drama are not at all rare: the well-known ARTFL database at the University of Chicago may be the best known, but any library electronic-text center will have a similar mix of genres. Language corpora, from the Brown and Lancaster-Oslo-Bergen corpora of the 1960s to contemporary projects like the British National Corpus, the European Corpus Initiative, and the Linguistic Data Consortium, sometimes avoid verse and drama, but they almost invariably contain sizable quantities of newspaper material, which has a similarly specialized structure. Even a single work may harbor a variety of genres and text structures within itself: the novel Moby Dick, for example, contains a dictionary entry, and a chapter in the form of a scene in a play, in addition to chapters of prose more common in novels. Orwell's 1984 contains diary entries, newspaper excerpts, and poetry. Such mixing of genres is not at all rare in literary texts. Non-literary works -- such as newspaper articles -- sometimes exhibit a similar complexity (e.g. a sudden shift from reportage to interview transcribed in dramatic style).

In the second place, electronic texts will vary in their form because some disciplines routinely subject their materials to much more complex kinds of analysis than do others. A wholly satisfactory electronic collection should support whatever type of research a user of the collection needs to perform; this will typically mean making the researcher's analysis or interpretation of some aspects of the text accessible to the search-and-retrieval system by means of markup.[2] Even texts of the same genre may have been encoded by different people, for very different purposes: a historical linguist's encoding of an eighteenth-century novel may differ markedly from a literary scholars, and a social historian might produce an electronic encoding of the novel different from either of the others. Even if all the texts were produced by the same individuals, cost or other extraneous factors may cause some texts to be marked up more intensively than others.

In the future, most collections will contain heterogenous data, including texts of different genres, both simpler and more complex, texts marked up for very different purposes, and texts marked up at varying levels of detail. Is it possible to allow simple cross-collection searching and display while preserving, not falsifying, the variation in our texts and the resultant variation in their markup?

The searches described above make extensive use of the markup structure and thus require that the details of the markup be exposed to the search engine. In many systems, the markup is exposed to the user as well: this is the approach taken by DynaText, the SGML-based browser developed by the then Electronic Book Technologies (now part of Inso Corporation); Pat, the search engine developed at the Waterloo Centre for the New Oxford English Dictionary and now sold by Open Text; mtsgmlql, the SGML query language developed in the European corpus project Multext; and many other markup-aware search engines. The user must have a deep knowledge of the exact element types defined in the document type definition, their use in the document, and the structure of the document.

Exposing the markup to the user, however, can lead to serious complication of the user interface. The user may not know or understand the markup system. Encoders may or may not have marked occurrences of particular element types consistently or exhaustively. And for reasons described above, different texts in a collection may contain very different kinds of markup.

Such variation is a natural phenomenon. But exposing that variation to the user of the collection will tend to make the user interface complex, confusing, and unsatisfactory: if the user can search for word cooccurrences within a specified context (e.g. to find texts in which the words love and hate--or programming language and syntax--occur within the same sentence), it will be frustrating and confusing if ninety-nine novels have `chapters' and one has identical-looking structures which go by the name episodes.

Most existing large- and medium-sized corpora and collections adopt a simple method of dealing with the complexity and confusion caused by variation in the texts and their markup: they impose restrictions on the markup that may appear in the text, most usually by forbidding any text to have any structural units which are not also possessed by all of the other texts in the collection. Failing that, they achieve a similar effect by refusing to mark those structural units, thus making them invisible to the search interface, and for all practical purposes non-existent. This solution makes the heterogeneity of texts in the collection inaccessible by defining it out of existence and by fitting the texts into a Procrustean bed of simplified and simple-minded markup. Given the needs outlined above, it is increasingly imperative to find a better way.

For some users, simply exposing the markup structure and letting them frame their searches accordingly is that better way. But one main reason to create a collection of any type in the first place is to provide simpler access, both physical and intellectual, to the material collected. Where no uniformity is imposed, it's hard to see how the collection can fulfill its function of making the materials easier to use than they would be if not collected. We therefore need a way to provide uniformity and simplicity without falsifying the data.

In the sections that follow, we discuss two possible approaches to a solution to this problem. The first approach focuses on the markup itself, and provides for a standard, consistent markup scheme which could be applied to any text. However, unlike the markup schemes imposed by existing systems, which sacrifice variability for the sake of consistency, a usable standard scheme would provide a common framework for encoding even variable texts, variable analyses, and variable levels of markup. The second approach places no restrictions on the markup scheme itself, but instead provides a coherent user interface to the texts by projecting user models of text markup which may differ from the actual underlying markup. We consider the viability of each approach and suggest a framework for the future development of adequately powerful and usable searching capabilities.

2 Standardizing Markup

One way to address, or at least ameliorate, the problem described above is to encode the texts of a collection in using a common markup scheme. Ideally, this scheme should be simple enough that the user can understand it with very little investment of time and effort. Therefore, one possible solution is to dictate that very document has a common, familiar structure, such as book, chapter, verse; or chapter, paragraph, sentence; or volume, page, line, etc. But the need for simplicity is at odds with the need for variability and richness of markup, leading potentially to the following problems:

distortions of actual text structure and content, due to the need to force, for example, a text where page and line information is irrelevant (possibly because it never existed in printed or displayed form) to contain divisions based on these categories, thus rendering the markup semantics meaningless;
inability to include complex markup such as that providing detailed linguistic information etc.;
generally, a lack of support for multiple viewsof a document, which enable the user to view the same document differently for different purposes--say, as a logical structure, a physical structure, and a linguistic structure.

Therefore, a usable standard would necessarily provide markup that is richer, more descriptive, and more flexible.

The introduction of the ISO standard SGML[3] in 1986 provided a mechanism for defining richer markup schemes. SGML is not a markup scheme itself, but rather a metalanguage for defining a textual markup system (much as BNF is a metalanguage for describing programming language syntax). It has now become standard practice to describe a marked-up text using Document Type Definitions (DTDs) expressed in SGML. Any SGML document consists of two parts: the DTD providing a set of element names or generic identifiers associated with corresponding element type definitions which show the generic structure of the marked up document; and the document instance itself. It is well known that DTDs can be formalized by means of Extended Context Free Grammars (ECFGs).[4] Thus SGML allows for the definition of a wide variety of text structures, specifically, the set of structures definable using the ECFG formalism, which are fundamentally hierarchical. Furthermore, because element names are user-defined, the number of distinct DTDs is virtually infinite.

Several DTDs intended for wide applicability have been developed by different organizations: HTML, the markup language of the World Wide Web; ISO 12083, originally developed by the Association of American Publishers Electronic Manuscript Project; and the TEI DTDs, developed by the Text Encoding Initiative to support the creation and interchange of electronic texts for research purposes. Other DTDs (such as the Docbook DTD developed by the Davenport Group, primarily for software documentation, the CALS DTD developed by the U.S. Department of Defense for technical documentation of military systems, or the Encoded Archival Description DTD developed by the Berkeley Finding Aids project and now maintained by the Society of American Archivists) are extensively used within specific industries or user communities, but are not (as far as we know) used to provide a uniform frame of reference in a document collection intended for general use.

HTML can be processed directly by a large number of browsers and editors, and enjoys a user community larger than any other SGML DTD. But as a tag set for general-purpose documents, it has a number of shortcomings.[5] In order to keep HTML simple enough for quick learning, its designers have been forced to keep it too simple to do justice to texts of even moderate complexity. Even so obvious a query as find the names of the authors of this document has no simple formulation in HTML terms, because HTML provides no element type for marking the names of authors.[6]

The DTD defined by ISO 12083 allows much richer markup than the HTML DTD, but is designed very specifically for the preparation of journal articles and books for modern publication. It does not cater to literary texts, nor for the scholarly or technical analysis of texts. It appears not to be a plausible candidate for the encoding of literary texts in verse or dramatic form, or for collections which must support the needs of textual researchers.

The Text Encoding Initiative DTD covers a much broader range of text types and research interests than do the HTML or ISO 12083 DTDs. It was developed over a period of several years by an international team of more than 100 researchers and scholars, and is specifically intended to provide a single, coherent markup scheme that will accommodate texts of widely varying types and annotations germane to a broad range of disciplines and applications, including natural language processing, information retrieval, hypertext, electronic publishing, various forms of literary and historical analysis, lexicography, etc. It includes encoding conventions for written and spoken texts in any genre or text type, without restriction on form or content, and covers both continuous materials (running text) and discontinuous materials such as dictionaries and linguistic corpora.

The TEI DTD defines a number of tag sets which can be used in virtually arbitrary combination, and specifies an informal semantics for the 400-odd elements that comprise them. The ability to `mix and match' tag sets in the TEI DTD is the basis of its breadth and flexibility. A header tag set is required and allows for the definition of a full bibliographic description for the electronic text, including its sources, encoding system, and revision history.[7] For encoding the text proper, the encoder chooses from among a set of base tag sets depending on the text type; there are base tag sets for prose, verse, drama, transcribed speech, dictionaries, and terminological databases. The TEI also provides core tag sets which define a set of discipline-independent textual features present in almost any text, including tags for elements such as

paragraphs
segmentation
lists, glossaries and indexes
highlighted phrases, emphasis, foreign words, titles etc.
quoted speech, quotation, terms, glosses, cited phrases etc.
names, numbers and measures, dates and times, etc.
correction, regularization and normalization; additions, deletions and omissions, etc.
simple links and cross references
passages of verse or drama
simple bibliographic citations
simple or complex referencing systems

Optional additional tag sets may also be selected by the encoder; these provide tags for special application areas such as alignment and linkage of text segments to form hypertexts; a wide range of other analytic elements and attributes; a tag set for detailed manuscript transcription and another for the recording of an electronic variorum modelled on the traditional critical apparatus; tag sets for the detailed encoding of names and dates; abstractions such as networks, graphs or trees; mathematical formulae and tables etc. Finally, the TEI DTD also provides several optional auxiliary tag sets, such as a tag set based on feature structure notation for the encoding of entirely abstract interpretations of a text, either in parallel or embedded within it.

The TEI DTD provides a single, coherent scheme sufficiently rich and detailed to describe a wide variety of text types and features. At the same time, because it aims at maximal generality, the TEI DTD provides common `generic' mechanisms for some common features; for example, the TEI's <div> tag, which can be nested to any level, provides common structural terminology which simplifies cross-text structural searching. Similarly, core tags provide standard terminology for low- level features found in all kinds of texts. However, for the purposes of inter- amd intra-text searching, the sheer breadth of the TEI DTD leads to some difficulties:

In the interests of flexibility, the TEI necessarily provides multiple solutions for the same encoding problems. This opens the door for (potentially substantial) encoding differences and variations in encoding style which are problematic for searching. For example, structural divisions in text can be marked using nested <div> tags or with a similar set of tags which bear numbers to indicate their nesting level: <div1>, <div2>, etc. Also, there are several alternative mechanisms by which part of speech for each word in a corpus could be tagged (feature structures, entities, etc.).
The need for generality is at odds with the need for descriptive naming and specificity. For example, the names of structural divisions vary widely in different genres, and therefore the TEI provides tags specifying generic structural divisions. The specific type of structural division is indicated in an SGML attribute. So, for verse, hierarchical divisions may be marked as <div1 type=canto>, <div2 type=stanza>, etc. In other cases, where there is consensus on more descriptive structural divisions, more specific tags can be used: for example, in dictionaries, divisions are explicitly named with <entry>, <homograph>, <sense>. Obviously, one does not want to give up the more descriptive dictionary tags, which are better for retrieval purposes; but the generic <div> tags are necessary in situations where structural divisions may differ widely even within a given text type. The resulting inconsistency, however, causes problems for cross-text searching.
To accommodate unforeseen encoding needs, the TEI provides several mechanisms to enable easy DTD modification and extension. However, if users take extensive advantage of the TEI's mechanisms for modification and extension to develop customized encoding formats, the result will be an increasing number of TEI `dialects' or sub-schemes which will in fact defeat the intention of providing a single markup standard.
The textual model implicit in the TEI DTD is probably already too complex to count as a really simple model suitable for use in simple user interfaces.

In sum, while the TEI provides a markup scheme that could serve as a basis of a search system, there continue to be problems that must be addressed. In the following section, we discuss an alternative solution that could take advantage of the TEI's flexibility and at the same time enable a common and potentially simple user interface for searching.

3 Standardizing Access

We have argued that it is not feasible or desirable to unify a collection by using precisely the same markup each individual text. It is, however, both feasible and desirable to unify the collection at the user interface, by taking a different approach to the design of the user interface.

In discussing user interfaces, it is useful to distinguish the socalled `user model' of the program and the data from the actual data structures used in the program. The user interface typically presents a somewhat simpler model of activity than the internal data structures. In the context of a search and retrieval system, the query interface may provide a model of text different from, and usually simpler than, the one embodied in the text's actual markup. A simple model can provide a simple interface, without forcing the actual markup into a Procrustean bed.

It is also sometimes useful to provide more than one user interface to the same basic functionality.[8] A simple textual model can be used in a user interface for casual or naive users; a more complex model in an interface for those engaged in intensive study of the text; a separate user interface can provide direct access to the underlying markup in cases where that is desirable, without cluttering or confusing the basic interface.

To provide useful access to texts in a simple, powerful way, several tasks must be performed; further research is needed in all of these areas.

Useful models of text must be defined, and used to design user interfaces.
The underlying markup of texts in a collection must be translated into the markup required by the user models, either explicitly, before indexing, or (much more likely to be useful) virtually, at indexing time or at query-processing time.
The behavior of indexers and query processors must be redefined.

The following sections describe, briefly, the current state of play in each of these areas and sketch some areas that would benefit from further attention.

3.1 Common Document Models

The most obvious sources of simple user models or `document architectures' are current systems which impose the same structures on all the texts in a collection, in order to allow unified access to the entire collection in terms of that model. The ARRAS (Archival Search and Retrieval) system developed by John B. Smith, for example, provides two simple hierarchical structures for all texts, and allows the user to perform sophisticated proximity searches in terms of each. One structure reflects the physical organization of the source text: the text is divided into pages, the pages into lines, and the lines into words. The other structure reflects the formal or logical organization of the text: the text is divided into chapters, the chapters into paragraphs, the paragraphs into sentences, and the sentences into words.

These simple structures allow the user to formulate quite sophisticated searches, but the logical model of text is quite simple, even when compared to a simple SGML DTD like that of HTML. HTML currently has about fifty SGML element types; the ARRAS structures can be defined in terms of the following DTDs, together containing just eight element types:

 
<!DOCTYPE physical [
<!--* DTD for the physical organization of the work *-->
<!ELEMENT physical (page*)   >
<!ELEMENT page     (line*)   >
<!ELEMENT line     (word*)   >
<!ELEMENT word     (#PCDATA) >
]>
<!DOCTYPE logical [
<!--* DTD for the logical organization of the work *-->
<!ELEMENT logical  (chapter*)  >
<!ELEMENT chapter  (para*)     >
<!ELEMENT para     (sentence*) >
<!ELEMENT sentence (word*)     >
<!ELEMENT word     (#PCDATA)   >
]>

The DTDs just given serve, first of all, to express formally the nature of the user model. They will not typically be used in actual markup of the text; instead, a user interface can present these structures as a user model of the text by allowing the user to formulate queries in terms of these elements, as if the text were marked up using these two DTDs, even when in reality it uses a different DTD. The elements of the actual markup must be mapped into the elements of the virtual DTD (sometimes called an architecture) at indexing time or query-processing time. How this might happen in practice is described briefly in later sections.

For many texts, particularly technical documents, the logical structure just given might usefully be made more complex, by allowing chapters to contain sections and subsections, and possibly yet further subdivisions, and by distinguishing titles of sections from the paragraphs within the sections. Front and back matter might also usefully be distinguished, and a <misc> structure added to handle things that are not plausibly called paragraphs. The DTD for such a richer model might look like this:

 
<!DOCTYPE logical [
<!--* DTD for the logical organization of the work *-->
<!ELEMENT logical  (front?, chapter*, back?)          >
<!ELEMENT front    (head | para | misc | section)*    >
<!ELEMENT back     (head | para | misc | section)*    >
<!ELEMENT chapter  (head*, (misc | para)*, section*)  >
<!ELEMENT section  (head*, (misc | para)*, subsect*)  >
<!ELEMENT subsect  (head*, (misc | para)*, subsub*)   >
<!ELEMENT subsub   (head*, (misc | para)*)            >
<!ELEMENT head     (word*)                            >
<!ELEMENT para     (sentence*)                        >
<!ELEMENT misc     (sentence | word)*                 >
<!ELEMENT sentence (word*)                            >
<!ELEMENT word     (#PCDATA)   >
]>

For some texts, particularly electronic versions of annotated scholarly editions, it is desirable to distinguish authorial text from editorial annotation, and in some cases, from authorial notes. This could be done simply by mapping such material into the <misc> architectural form of the preceding DTD, or by introducing two elements for authorial and editorial annotation, respectively.

Where a document has been analysed linguistically, the search engine can provide access to the linguistic annotation by adding attributes to the text model for <sentence> and <word>. For phrase-structure analysis of syntax, an intermediate <phrase> element may be added, thus:

 
<!ELEMENT sentence (phrase*)           >
<!ATTLIST sentence
          ID       ID        #REQUIRED
          type     CDATA     'simple'  >
<!ELEMENT phrase   (phrase | word)*    >
<!ATTLIST phrase
          type     CDATA     #REQUIRED
          function CDATA     #IMPLIED  >
<!ELEMENT word     (#PCDATA)           >
<!ATTLIST word
          cat      CDATA     #REQUIRED
          baseform CDATA     #IMPLIED  >
]>

The type attribute on <sentence> would allow the user to search for sentence of particular types (e.g. simple, compound, complex, or compound/complex, to borrow an old grammar-school set of categories). The attributes on <phrase> allow both a structural label (e.g. NP or noun phrase) and a functional label (e.g. direct object). Words are marked, in this virtual DTD, with their part of speech (cat) and dictionary form (baseform). The actual markup of linguistic categories might be rather different (and usually will be); this DTD simplifies the model of annotation in order to make it more readily searchable.

To provide good access to an entire collection, we believe that it will be useful to provide several simple text models like those just described, as well as more complicated models that expose more of -- or all of -- the markup actually present in the document. Some models will be simple enough that they can be imposed, with little strain or distortion, on every document in a collection; for example, a collection of ballads might be treated as though each ballad in the collection were a <chapter> and each stanza a <para>, in order to make it possible to search across prose and verse works simultaneously. Sometimes, the fit will involve a certain amount of distortion (what counts as a `paragraph' in Hamlet or Beowulf?). How well a given text model fits the works in the collection will vary from collection to collection; the more homogeneous the collection, the more specific and useful the simplest user model can become.

Sometimes the textual model simply cannot be applied to every text in a collection: a text for which no linguistic analysis has been provided cannot usefully be searched for part-of-speech information. A text in which proper nouns are not marked cannot usefully be searched for all references to people, places, or organizations. The user interface will need to make clear to the user what texts in the collection are available through each interface.

The notion of providing more than one user interface for a text has analogues in the well-established practice, in linguistics and computer science, of providing more than one formal description of the same sentence or document. For example, the CLASSIC knowledge representation system, which uses the basic knowledge-representation notion of taxonomies to represent documents, has been outfitted with an interface for HTML documents, which is similar in some ways to the technique we are proposing here.[9]

Very simple textual models such as those presented in this section may frequently be difficult to apply to texts in which multiple structural organizations are visible. As a simple example: the metrical structure of Hamlet and the dramaturgical structure overlap in complex ways, and the play within the play adds a further level of complication.[10] Separating the actual markup of the text from the model of the text presented by the user interface, however, may provide a useful way of solving the problems allowing users to search texts using any of the available logical structures. Further work is needed on this topic.

3.2 Projecting Documents onto Common Models

To present marked-up documents using simplified textual models of the sort described in the previous section, the organizer of the collection must specify how the actual document markup is mapped onto the simplified markup of the various user models or architectures. This will require detailed understanding of the meaning of the markup in the actual texts, as well as a clear grasp of the model into which it is being mapped. In some cases, this will present a serious challenge, since SGML itself provides no methods of formally specifying the meaning of the markup defined by a DTD: in specifying a mapping, it will be necessary to have recourse to documentation for the DTD.[11]

Several methods already exist for describing the relationship between two SGML DTDs, or rather for describing how to transform a document tagged using one DTD into a document using another DTD:

architectural forms, as defined in the international standard HyTime.[12] Architectural forms allow a very simple mapping in which each element in the source document is simply renamed using a generic identifier in the target DTD; elements without equivalents are typically renamed to a generic <dontcare> element type.
the Tree Transformation Language defined by the Document Style Semantics and Specification Language (DSSSL).[13] This language allows the specification of arbitrarily complex tree-transformation operations, using a combination of primitives defined for the purpose and a declarative subset of the Scheme programming language.
various commercial and public-domain systems for specifying SGML-to-SGML conversions; perhaps the best known are OmniMark (from Omnimark, Inc.), Balise (from AIS-Berger Levreault), CoST (Copenhagen SGML Tool, public domain), sgml.pm (a set of Perl routines for processing the output of James Clark's public-domain SGML parsers sgmls and nsgmls; sgml.pm was written by David Megginson).

Not all of these systems are widely implemented, but in practice, it will be highly undesirable to perform an actual SGML-to-SGML transformation for each user model, prior to indexing. The maintenance and synchronization problems alone are sufficient reason to prefer systems that allow the indexing or query-processing system to behave as if such a transformation had been carried out. The specification of the required mapping may affect the indexing of the document, in which case any change to the user-interface model of text will require reindexing all documents presented through that interface. In lucky cases, however, the user models will be so defined that the documents can be indexed solely with reference to their actual markup, and the user model will affect only the processing of queries.

Mathematically, the mapping from actual markup to user model is a projection. Because any SGML document is described by a context-free grammar, its structure can be represented as syntax trees constructed according to that grammar; the tree structure is determined by the document markup in the document instance. Therefore, the projection may be defined purely in terms of tree operations which project the tree defining the structure of the document instance onto that defining the architectural model. The projection will only work without human intervention if the architecture we are projecting to has at most the same information as the fully marked up text.

The projection will be simplest if it is restricted to

dropping entire subtrees
dropping attributes of a node (an element)
renaming nodes (supplying a new generic identifier for an attribute)
selecting from one of several concurrent markup hierachies

The architectural forms of HyTime, and the SGML link type declaration suffice for transformations of this type.

In many cases, however, more complex transformations will be required:

calculating new generic identifiers or attributes based on the existing ones; e.g., changing <list type=ordered> elements to <ol>, <list type=unordered> to <ul>, etc.
dropping a node and promoting its children one level (this has the effect of simply ignoring the start- and end-tags for an element)
restructuring the tree (e.g. building a physical-structure view out of the empty <pb> elements used in the TEI DTD to mark page breaks, or creating a single virtual element out of the fragments pointed to by a TEI <join> element)
building a view from external annotation (e.g. a TEI <linkgrp> containing linguistic annotation of the text, possibly more than one analysis for each word or sentence)

Such mappings will require more sophisticated specification than is possible with architectural forms or link type declarations. Finding the best way to specify such mappings so as to guide indexing or query processing remains a topic for further research.

There are still other tree manipulations documented in the computer science literature on data structures and language theory, which can be performed without disturbing the ability to project one tree onto another. They raise the difficult and important question of defining the notion of equivalence for DTDs.[14]

Work is needed to show how indexing of SGML documents must be modified to account for the user models to be provided; it seems likely that indexing will need to record not only the generic identifiers and attribute values of open elements, when indexing words and phrases, but also take account of

links to or from analytic or interpretive information such as linguistic analysis stored externally
links to information in the TEI header of a document (e.g. demographic information describing the speakers in recorded conversations, or the various manuscripts and scribal hands in text-critical editions
the nearest occurrences of page-break and similar `milestone' elements to the left and right of the words being indexed
some semantic information (such as that <note> and similar elements contain annotation, the contents of which should not be counted in proximity searches involving the text being annotated, or that some elements logically imply a word break, while others do not)

With a sufficiently powerful index on a well-marked-up text, it should be possible to translate queries expressed in terms of the user model directly into queries against the native index. For example, a search for the phrases programming language and syntax on the same page might be rendered, in the user model roughly as find "programming language" within (<page> containing "syntax". If the text itself is encoded using the TEI DTD, an equivalent back-end query might be find "programming language" and "syntax" such that no <pb> occurs between them. The query find <word> with baseform='buy out' and cat='V' might take various forms depending on which of the various techniques for linguistic annotation had been used in the actual markup. One might be something like find <w> with (lemma pointing at <fs> with type='entry' containing <string> containing "buy out") and pointed at by (<fs> with type='cat' containing <sym> containing "V").

4 Problems for Future Work

By separating the conceptual model of text presented by the user interface from the model used to encode the text itself, it is possible to provide multiple interfaces to the same texts, some simple and suitable for convenient cross-collection searching, others more complex and suitable for detailed analysis of a particular text or group of texts. It is no longer necessary to force texts into a Procrustean bed in order to provide a consistent search interface for them, nor necessary to expose all of the details of the text markup in order to allow complex searches.

To put this idea into practice, further work will be needed in several areas:

the specification of textual models which are sufficiently general to apply to virtually all texts, simple and specific enough to be easily comprehended by the user, and sufficiently powerful to allow the formulation of useful searches
languages for specifying the mapping from existing DTDs onto such textual models
actual specifications for DTD/model pairs
techniques of indexing SGML documents which capture more information about the structural position of elements, and the relations between linked elements, than is common today
techniques for translating queries from a user model to the underlying markup

Apart from its intrinsic interest, such work will make it significantly easier to use heterogeneous collections, and to use electronic texts for purposes not foreseen by the original encoders.

Notes

[1] In what follows, we will generally assume that the markup language in use is defined using SGML. Other markup systems exist, and our observations apply (mutatis mutandis) to them, but none matches SGML-based systems either in expressive power or in simplicity of processing.

[2] Some readers will object at this point that such interpretive markup has no place cluttering up the texts in a generally accessible collection. That such markup need not interfere with the ability of other researchers to view the text in their own terms, is a consequence of our proposals here. The analytic or interpretive markup of the text can, in any case, be stored separate from the base text being analysed; such separation can be motivated by technical considerations, as well as by a questionable notion of interpretive hygiene.

[3] ISO 8879: Information Processing--Text and Office Systems-- Standard Generalized Markup Language (SGML), Geneva: International Standards Organization (1986).

[4] See D. Wood, "Standard Generalized Markup Language: Mathematical and Philosophical Issues," in Computer Science Today: Recent Trends and Developments, Lecture Notes in Computer Science No. 1000, ed. J. van Leeuwen (New York: Springer, 1995), pp. 344-365.

[5] For a discussion of the inadequacies of HTML for representing varied and complex documents, see Barnard, D., Burnard, L., DeRose, S., Durand, D., Sperberg-McQueen, C.M., Lessons for the World Wide Web from the Text Encoding Initiative, in World Wide Web Journal, Issue One: Proceedings 4th International Conference on the World Wide Web, (Boston, 1995), 349-357.

[6] The recent introduction of the <meta> tag, to be placed inside the HTML header, provides a glimmer of hope here; wide adoption of standard values for the attributes of this element would dramatically improve the quality of information which Web search engines could provide. The so-called Dublin Core set of metadata elements is the best known and most promising proposal for such standard attribute values.

[7] The TEI header is based on the International Standard Bibliographic Description format for computer files; it is intended to elicit from the encoder or information provider all the information required to allow a qualified cataloguer to create a full catalogue record using the Anglo-American Cataloguing Rules or whatever cataloguing rules are in local use. The header is not intended as a substitute for AACR2 cataloguing records.

[8] Nathanial S. Borenstein, for example, recommends this approach as a method of managing evolution of the user interface:

One technique is to provide several alternative user-interface `flavors.' The user of a given piece of software might be given three user-interface flavors to choose from, which would differ substantially enough to tell them apart and to attract different kinds of users to each. [...] the main benefit of the `flavors' approach to user interfaces is that it provides a graceful evolutionary path for the program's actual user interface. [...] In this way, users at least have control over the timing of changes to their interfaces.

Nathaniel S. Borenstein, Programming as if People Mattered: Friendly Programs, Software Engineering, and Other Noble Delusions (Princeton: Princeton University Press, 1991), p. 47.

[9] CLASSIC is described in Chris Welty, "An HTML Interface for Classic," Proceedings of the 1996 International Workshop on Description Logics (New York: AAAI Press, 1996).

[10] See, for discussion, David Barnard et al., "SGML-Based Markup for Literary Texts," Computing and the Humanities 22 (1988): 265-276, and Barnard et al., "Hierarchical Encoding of Text: Technical Problems and SGML Solutions," Computing and the Humanities 29 (1995): 211-231.

[11] The quality of DTD documentation is thus a critical consideration, when choosing a DTD for use in creating a collection of electronic texts.

[12] ISO 10744, Hypermedia / Time-based Structuring Language: HyTime (Geneva: ISO, 1993).

[13] ISO 10179, Document Style Semantics and Specification Language (DSSSL) (Geneva: ISO, 1996).

[14] See, among others, Darrell Raymond, F. Tompa, and D. Wood, " From Data Implementation to Data Model: Meta-Semantic Issues in the evolution of SGML," Computer Standards and Interfaces 1995; and D. Calvanese, G. de Giacomo, and M. Lenzerini, "Representing SGML Documents in Description Logics," Proceedings of the 1996 International Workshop on Description Logics (New York: AAAI Press, 1996), pp. 102-106. There has been background work in this area but still little progress in fully understanding the computational properties of equivalences and the development of usable systems based on this work.

Toward a Unified Docuverse:Standardizing Document Markup and Access without Procrustean Bargains