Corpus Encoding Standard - Document CES 1. Part 4. Version 1.3. Last modified 18 June 1996.


  Part 4

  Encoding Primary Data


| Prev | Next | CES 1 Table of contents |

4.0. Overview

For the foreseeable future, the greatest portion of texts that will be encoded exist already in electronic form. Such texts are referred to as legacy data. The vast majority of these documents were originally intended to be printed and therefore already contain markup in the form of typesetter codes, word processing formats, etc., primarily related to visual presentation.

The goal of encoding for corpus linguistics is to describe text structure that is linguistically relevant and mark objects relevant to analysis. Thus, for the purposes of corpus work in language engineering applications, a text (prior to linguistic annotation) is a set of linguistic objects, comprising at least

The text seen as a printed or displayed object, including fonts, layout, etc., and the text seen as a collection of linguistic objects represent two different views of the text. Some of the components of one of these views correspond to components of the other, while others do not. Therefore, the process of preparing a corpus originally existing as legacy data involves

This process is potentially very costly, depending on how well presentational categories map directly into distinct linguistic categories, and how much additional markup for elements not marked in the original, or which are not easily distinguishable based on typography, is desired.

Because of the potential cost, data preparation is often accomplished by taking the data through by a series of transformations, each of which raises the information level to some extent. The final state models the richest possible information state. The transformation process cannot be completely deterministic, since raising the information level often involves deciding which among several possible candidates a given tag maps to, as well as adding structural information that is not present or fully explicit in the previous state. Therefore, the transformation process is not fully automatic or entirely cost-free. However, it is possible to minimize transformation costs from one information state to the next higher one.

The CES provides a TEI-conformant DTD that can be used in such a process for encoding primary data. It has been designed to enable representing the text at any of various stages of information transformation (i.e., translating existing markup into relevant, increasingly information-rich categories). The representation of the text in the first (minimum required) representation can often be accomplished by automatic means and may be nearly cost-free. Users of the CES can encode their texts to conform to intermediate stages, aiming toward a rich representation of relevant linguistic informaton, depending on cost considerations, application needs, etc.

4.1. Levels of encoding for primary data

For the encoding of primary data the CES identifies three levels of encoding:
Level 1
This is the minimum encoding level required for CES conformance, requiring markup for gross document structure (major text divisions), down to the level of the paragraph, conformant to the cesDoc DTD.

Level 2
This level requires that paragraph level elements are correctly marked, and (where possible) the function of rendition information at the sub-paragraph level is determined and elements marked accordingly.

Level 3
This is the most restrictive and refined level of markup for primary data. It places additional constraints on the encoding of s-units and quoted dialogue, and demands more sub-paragraph level tagging.

The following sections provide precise criteria for conformance to each level.

4.2. Level 1 conformance

4.2.1. Requirements

                     <cesDoc version="3.9">
                       <cesHeader version="2.0"> ... </cesHeader>
                            <div> [optional]

4.2.2. Recommendations

4.2.3. Requirements for documents adapted from legacy data

4.2.4. Recommendations for documents adapted from legacy data

4.3. Level 2 conformance

4.3.1. Requirements

4.3.2. Recommendations

4.4. Level 3 conformance

Conformance to this level demands

| Top | Prev | Next | CES Contents | CES Annexes |
HTML 3.2 Checked!