Corpus Encoding Standard - Document CES 1. Part 1. Version 1.3.1 Last modified 2 February 1999

Part 1

General Principles

1.1. Definitions
1.2. Interchange vs. local processing
1.3. Levels of standardization
1.4. Types of information
1.5. Criteria
1.6. Customization of the TEI

1.1. Definitions

This section gives the sense in which we use the terms in this report.

A text is a piece of human language communication in the broader sense, that one has reason to consider as a whole.

A representation of a text is a material transcription of a text. It can be a paper printing, an electronic form, an audio recording, etc.

An interpretation of a text, or collection of texts, is any information added to a text. There are degrees of interpretation: at one end of the continuum is interpretation for which there exists widely accepted criteria, such as the labeling of a part of a text as a title or paragraph. On the other end is interpretation of a more debatable or subjective nature, such as the provision of linguistic annotation or the identification of the presence of a given topic in some part of a text.

A text plus its interpretation itself satisfies the definition of text, and therefore can be considered as a new text. For example, the editor's annotations of an original manuscript can be viewed as a part of a new text comprising the edited version. Similarly, a text plus part of speech annotation can itself be seen as a new text.

Encoding is any means of making explicit some interpretation of a text or collection of texts.

Marking up is one possible means of encoding texts by interspersing sequences of

markup or tags, which represent the interpretation of segments of text
content which consists of segments of the represented text

A markup scheme is a triple consisting of

a character set;
a syntax (rules defining what constitutes a well-formed marked-up text);
a semantics (rules defining what constitutes a valid marked-up text in some universe of interpretation.

Syntactic rules define:

legal markup
legal content
legal ways of interspersing markup and content.

Syntactically well-formed texts are not necessarily semantically valid. For example, the sequence <word>New York</word> might be syntactically well-formed in some markup scheme, but this does not ensure that "New York" is a word in a given universe of interpretation. Note that in some universes of interpretation, the interpretation of "New York" as a word would be valid, while in others it may not. Semantic rules define which syntactically well-formed texts are valid in different universes of interpretation.

A markup metalanguage is a set of rules that formally describe the form of the syntactic rules of a markup scheme.

Semantic rules are very often not formalized in a markup scheme. In many cases semantics relies on common knowledge about, for example, what constitutes a title or a chapter. This leaves room for confusion in the way markup is applied.

1.2. Interchange vs. local processing

There are two general uses for a markup scheme:

local processing, including data capture, as well as various applications such as text editing and formatting; search and retrieval; linguistic, semantic, metrical, etc. analysis; text collation; etc.
data interchange between individuals or sites.

The CES is intended for data interchange. A standard for interchange is desirable, so that only translation between a single markup scheme and a local format is required to use an externally-acquired text, rather than pair-wise translation between all possible local formats. A standard for local processing is not possible or even desirable at this point, given the wide range of application domains and platforms.

An interchange standard must necessarily be domain, application, and platform-independent, and therefore maximally general. Ideally, it must be as expressive as any local format in order to enable translation of all local formats into the interchange format with no loss of information. Therefore, before an interchange standard can be developed, it is necessary to identify a set of common text categories and a common text model. The existence of a standard set of categories and a common text model will contribute to the convergence of existing local practices and provide a framework for the development of local processing formats in the future.

1.3. Levels of standardization

We distinguish three levels of text standardization. The successive levels are increasingly prescriptive in terms of the markup conventions that must be used to conform to that level of standardization. Each level requires standardization at the preceding level as a prior condition. In addition, the three levels of standardization are interdependent; that is, decisions at one level will affect what can be done at the next level.

Because each of the three levels imposes increasing uniformity of encoding, the data become more and more reusable as standardization is tightened. At the same time, the application areas within which the data are reusable typically become more and more restricted. Thus, there is a trade-off between generality and reusability as the level of standardization is increased.

1.3.1. Metalanguage level

Standardization at the metalanguage level regulates the form of the syntactic rules and the basic mechanisms of markup schemes. It does not specify the markup itself (tag names, allowable sequences of tags, etc.).

SGML (ISO 8879:1986; see also Goldfarb, 1990; Bryan, 1988; and van Herwijnen, 1991) is unique in that it is a standard at the metalanguage level only. The SGML reference concrete syntax defines the forms of tags (including internal attributes), the base character set, naming rules, reserved words, allowable features (e.g., omission of end tags), etc. It does not define actual tag names or rules for their use in a marked-up text.

Using the SGML Document Type Definition (DTD) mechanism, the user can define tag names and "document models" which specify the relations among tags. This constitutes the syntactic level (see below).

1.3.2. Syntactic level

Standardization at the metalanguage level does not fully achieve the goal of universal document interchange, since it is possible, for example, to have entirely different document structures and markup even though the texts are encoded using the same metalanguage specifications.

A more powerful way to standardize texts is to specify precise tag names and syntactic rules for using the tags (i.e., the context(s) in which they can legally appear), as well as constraints on content (for example, by specifying that a tag can be associated with numeric data only). Most familiar markup schemes are at this level: they provide precise tag names and rules for using the tags (i.e., the context(s) in which they can legally appear). SGML documents are standardized at this level if they have common DTDs.

Conformance to a syntactic standard can be checked by parsing; that is, there are formal means to verify that a given text follows the markup syntax rules.

1.3.3. Semantic level

Standardization at the syntactic level does not guarantee that markup has been consistently applied with the same interpretation. For example, even if a tag such as <word> appears in a legal syntactic context in a given interchanged text, it is possible that the sender and receiver do not have the same understanding of the content marked by that tag. This impairs immediate reusability of the data, since, for example, even a simple word count or the content of a lexicon created from the text could vary considerably depending on the definition of a word.

Markup semantics are typically informal, usually relying on the user to apply a given tag appropriately. For example, a tag such as <title> is likely to be used to mark those things which humans more or less agree upon to be titles. This kind of semantics is typically specified in accompanying user manuals; TEI P3 is an extensive example of the specification of tag semantics at this level.

Standardization at this level requires more precise definitions and constraints on the content of markup. The CES aims to standardize at the semantic level for those elements most relevant to language engineering applications, in particular, linguistic elements. Although it is not always possible to provide precise formal specifications for markup semantics, we attempt to identify a definition or set of definitions for linguistic elements that best serve the needs of language engineering applications.

1.4. Types of information

We distinguish three broad categories of information which are of direct relevance for the encoding of corpora for use in language engineering applications.

Documentation

This includes global information about the text, its content, and its encoding. This type of markup corresponds roughly to the TEI header. For example:

bibliographic description of the document;
documentation of character sets and entities;
description of encoding conventions;

etc.

Primary data

Within the primary data, we can distinguish two types of information that may be encoded:

Gross structure
This includes universal text elements down to the level of paragraph, which is the smallest unit that can be identified language-independently; for example:
- structural units of text, such as volume, chapter, etc., down to the level of paragraph; also footnotes, titles, headings, tables, figures, etc.;
- features of typography and layout, for previously printed texts: e.g., list item markers;
- non-textual information (graphics, etc.).
etc.
Sub-paragraph structures
This includes elements appearing at the sub-paragraph level which are usually signalled (sometimes ambiguously) by typography in the text and which are language dependent; for example:
- orthographic sentences, quotations;
- orthographic words;
- abbreviations, names, dates, highlighted words;
etc.

Linguistic annotation

This type of information enriches the text with the results of some linguistic analyses; most often in language engineering applications, such analysis is at the sub-paragraph level. For example:

morphological information;
syntactic information (e.g., part of speech, parser output);
alignment of parallel texts;
prosody markup;

etc.

1.5. Criteria

1.5.1. Adequate coverage

An obvious criterion for the CES is that it enable marking those features and properties of texts that are required for language engineering applications. This means that the set of features must be extensive enough to serve at least a large percentage of corpus encoding needs. At the same time, it is desirable that the scheme does not include a vast array of unnecessary or peripheral elements or encoding options. This is important for the following reasons:

a simpler scheme is easier to understand and use;
for a corpus encoding standard to be effective, it should disallow, where possible, multiple different ways to encode the same phenomenon, but rather should allow the one which is best suited to the application;
similarly, it should not allow encoding options which are not appropriate for this application.

Therefore, the CES has been designed to include a small but adequate set of elements for corpus-based work. In some instances, this has meant including only specific TEI elements where more general tags exist; and in other cases, the reverse is true. In each case, the choices are made on the basis of what is required for corpus-based language engineering research and applications.

1.5.2. Consistency

An encoding scheme should be built around consistent principles to determine what kind of objects are tags, what kind of objects are attributes, what kind of object(s) appear as tag content, etc. A well-thought out system with strong principles (for example, tags for structural and logical pieces, attributes for properties, etc.) ensures the intellectual integrity and coherence of the encoding scheme and provides a basis for those who modify or extend it. Conversely, a lack of such a principled basis leads to practical problems in processing an encoded text, for example, for validation, search and retrieval, etc., since different encoding styles can be mixed within the same document. Consistency is also essential to facilitate the mapping of the SGML encoded text into other formats, for example, data base formats.

For more discussion and examples, see excerpt from MUL/EAG-CES 3: Corpus Encoding Standard: Background and Principles.

1.5.3. Recoverability

When a text is encoded from a printed or electronic source (typesetter's tapes, etc.) the ability to recover the source text from the encoded version--that is, to distinguish what was in the source from the markup and potential additional information--is often desirable. There are a number of different ways to define what is to be recovered from a source text, (e.g., a facsimile of a particular printed version of a text, layout, typography, etc.). For many purposes (comparison and validation between the source and the encoded text, operations such as word counts, search, concordance generation, linguistic analysis, etc.), it is sufficient to recover the sequence of characters constituting the text, independent of any typographic representation.

Recovery is an algorithmic process and should be kept as simple as possible, since complex algorithms are likely to introduce errors. Therefore, an encoding scheme should be designed around a set of principles intended make recovery possible with simple algorithms. Processes such as tag removal, simple mappings are more straightforward and less error prone than, say, algorithms which require rearranging the sequence of elements, or which are context-dependent, etc. In order to provide a coherent and explicit set of recovery principles, various recovery algorithms and a related encoding principles need to be worked out, taking into account such things as the role and nature of mappings (tags to typography, normalized characters, spellings, etc. with the original, etc.), the encoding of rendition characters and rendition text, definitions and separability of the source and annotation (such as linguistic annotation, notes, etc.), linkage of different views or versions of a text, etc.

1.5.4. Validatability

Validation is the process by which software checks that the markup in a document conforms to the structural specifications given in a DTD. SGML validation software checks that tags have legal names, are properly nested, appear in the correct order, contain all required tags, etc.; that attributes appear when and only when they should, have legal values; etc.

The ability to validate is important because it enables trapping errors during data capture. It also enables ensuring that the encoded text corresponds to the model given in the DTD, thus providing a possible means by which the adequacy of the model itself can be verified.

There is a tension between the generality of an encoding scheme and the ability to validate. Over-generative DTDs allow many tag sequences which, for any given text, are not valid. In addition, the use of abstract, general tags also constrains the ability to validate; for example, the use of a general tag such as <div> to mark hierarchical divisions of a text (corresponding, for example, to book, chapter, section, etc.) disallows constraints on what can appear within a given text division, making it impossible to ensure that tighter structural constraints for a given book are observed, (e.g., that titles do not appear within chapters, or that a paragraph does not appear outside the chapter level, etc.).

1.5.5. Capturability

Data capture involves

capture of the text itself, either by hand or via OCR, acquisition of word processor output, typesetter tapes, etc.; we assume that by-hand capture is not very likely for applications, although it is not excluded.
addition of markup. Fully automatic markup is rarely possible; markup is typically achieved either by hand or semi-automatically, via format translators, annotation programs such as POS taggers, etc.

The kind of markup that is added to a text directly affects the costs of capture. Some kinds of markup can be very costly, if, for example, no program can accomplish it automatically or if markup programs leave so many ambiguities that a large amount of post-editing is required. Capturability is an important concern when defining minimum requirements for conformance to a standard because corpora often consist of millions of words of text, making hand marking and substantial post-editing too costly to be practical. Capturability has important repercussions for the design of the encoding scheme:

The scheme should accomodate the various levels of analysis of the text and provide markup for both very crude element designation (which can be much less costly to achieve) as well as more precise tagging. For example, markup indicating that a word or any arbitrary segment appears in italics already exists in many texts (such as typesetter's tapes), and it is therefore virtually cost-free to mark it as such; to determine more precisely what the italics mean can be much more costly, since italics can indicate any number of things (title, caption, quotation, emphasis, foreign word, term, etc.). Similarly, it may be cheap and sufficient for many applications to make only a gross distinction between the main text (to which one may want to restrict linguistic analysis, e.g.) and auxiliary text (titles, divisions headers, captions, tables, footnotes, bibliographic references, etc.).
The scheme should be refinable, by providing tags at various levels of specificity together with a taxonomy identifying the hierarchical relations among them. For example, a word marked in italics could later be further analyzed and identified as a highlighted word, and later more precisely marked as a term, and still later further identified as a foreign term, etc.
Minimum requirements for conformance to the standard must be made in view of the costs of capture. Minimum requirements cannot include tagging that is actually or even potentially costly. For example, requiring that italics are disambiguated to the lowest level of the hierarchy would result in high costs for data capture since it requires substantial hand intervention. Even seemingly simple tagging, such as tagging paragraphs, can be costly depending on the input, if, for example, line breaks are not differentiated from paragraph breaks (as in electronic mail, etc.).

1.5.6. Processability

The CES must take into consideration processing considerations and needs, such as the overhead of use of SGML mechanisms (e.g., entity replacement, use of optional features), as well as concerns such as the ability to (efficiently) select texts according to user-specified criteria; also, the need to use special mechanisms, such as inter-textual pointers, linkage of related texts or other sub-corpus segments, vs. constraints their use may incur (e.g., the use of inter-textual pointers may demand that the entire corpus be available at all times for processing).

A related concern is mappability to internal representation schemes that may be used for local processing or special applications. Although ideally an encoding standard would serve both the needs of interchange and local processing (which will be eventually ensured by the coordinated development of a standard encoding format and specifications for tool development), for the near term it is likely that researchers will continue to use local formats. In addition, existing commercially available SGML software is scarce and typically expensive, and is therefore not widely available to the research community. Although mappability to local formats cannot be a driving criterion for encoding design, where possible it can be taken into account.

1.5.7. Extensibility

Absolute completeness of any markup scheme is impossible to achieve. Therefore, it is essential that any encoding scheme be extensible.

As mentioned above, there is a tension between validatability and generality. If the goal of validatability is served, DTDs will be more restrictive, and the need for extensibility will be even greater. Therefore, it is essential that systematic means for extension of the scheme are developed, which will ensure that extensions are made in a controlled and predicatable way.

1.5.8. Compactness

SGML is often criticized for its verbosity, since document size can be dramatically increased by the addition of SGML tags. This is a particular concern for annotated corpora, where each word (and possibly each morpheme) can be marked for part of speech and/or other information, often increasing file size by a factor of 10 or more. This can cause problems for various kinds of processing (e.g., retrieval) as well as for interchange, since in the state of the art it is still often problematic to transfer large files over data networks. However, the costs and difficulties of handling large files are being reduced every day, so compactness is not necessarily an overriding concern for CES design, but may be taken into account as a secondary criterion.

There are several possible means to reduce the number of characters added to a text when markup is introduced:

tag minimization, e.g., start and end tag omission, short start and end-tag, minimization of attribute values, etc.;
SGML entities used in place of any string, possibly including markup;
DATATAG feature, which allows a certain character to be interpreted as the end tag of an element;
non-SGML notations, involving the use of private, less verbose non- SGML schemes within tags or as attribute values.

For more discussion of these mechanisms and examples, see excerpt from MUL/EAG-CES 3: Background and context for the development of a Corpus Encoding Standard.

Each of these methods for markup reduction has drawbacks. For example, tag minimization can cause problems for some users without sophisticated software; the use of entities results in considerable processing overhead; some features such as DATATAG are not implemented in all SGML processors; private notations require the use of special software for processing, etc.

The CES makes recommendations concerning minimization based on

the degree to which and the circumstances under which markup minimization is important for corpus encoding; and
an assessment of the the advantages and drawbacks of the various minimization methods for reusability within the corpus research community.

1.5.9. Readability

There are two points of view concerning readability. One assumes that the text will be captured, displayed, or in general dealt with using processing software which could make the markup either invisible or human-readable; therefore, readability need not be a concern. However, it can be argued that such software is not readily available, or that no software will ever answer all the user's needs. Therefore, there will always be a need for dealing directly with the encoded text.

Note that readability is related to compactness in two ways, in part dependent upon the object to be read: i.e., the original text or the text plus markup. When minimization techniques are used to reduce or eliminate markup in an encoded text, readability of the original text is likely to be enhanced. Minimization may even facilitate the readability of the text plus markup in some cases. On the other hand, when the object to be read includes text plus markup, in many cases minimization techniques will decrease readability.

In general, readability is a secondary concern among encoding criteria, to be aimed at only when other concerns are adequately addressed.

1.6. Customization of the TEI

The CES is conformant to the TEI Guidelines for Electronic Text Encoding and Interchange (referred to as "TEI-P3" or the "TEI Guidelines") developed by the TEI. The CES is instantiated using the TEI.2 DTD and the TEI customization mechanisms.

At present, the CES provides three different TEI customizations, each instantiated using the TEI.2 DTD and the appropraite TEI customization files, for use with different documents:

documents containing a primary data encoding, including texts with gross structural markup only to texts heavily and consistently marked for elements of relevance for language engineering;
documents containing morphosyntactic annotation of the primary data, which is hyperlinked to that data;
documents containing links indicating alignment between two documents.

For convenience, we also provide a version of each of these three TEI instantiations as a stand-alone DTD, together with a means to browse the element tree as a hypertext document.

Because the TEI Guidelines are intended to cover a wide range of applications, they offer means to encode a vast array of elements. In addition, because they are intended to be maximally flexible, they provide often several ways to encode the same phenomenon. Therefore, via the TEI customization mechanisms, the CES limits the TEI scheme in order to:

include only the sub-set of the TEI tagset relevant for corpus-based work;
make choices among encoding options, with an eye toward satisfying the criteria outlined in section 1.5, above.

The TEI scheme is not complete; many areas relevant to language engineering applications are not covered. In addition, there are areas the TEI is not intended to cover, such as precise specifications for many kinds of tag content. Therefore, the CES also uses the TEI customization mechanisms to specify:

extensions to the TEI Guidelines to serve needs of language engineering.
precise values for some attributes.
required/recommended/optional elements to be marked.
detailed semantics for elements relevant to language engineering.

We constrain or simplify the TEI specifications as appropriate to serve the principles outlined in section 1.5, primarily in terms of element content, which is substantially simplified in the CES. Depending on the particular needs for encoding corpora, we constrain or extend legal and required attributes and attribute values specified by the TEI.

We adopt the TEI use of element and attribute classes, implemented using SGML parameter entities. However, these element classes are simplified, forming a shallow hierarchy with no overlaps among classes.

We do not rename TEI elements except where confusion may arise; also, three TEI-specific names are renamed to reflect their use in the CES (i.e., <TEI.2> becomes <cesDoc>, <teiCorpus.2> becomes <cesCorpus> and <teiHeader> becomes <cesHeader>).

Part 1

General Principles

Contents

Documentation