Corpus Encoding Standard - Document CES 1. Part 2. Version 1.1. Last modified 1 April 1996.


  Part 2

  Recommendations common
  to all documents


| Prev | Next | CES Contents |

2.1. Metalanguage recommendations

The CES constitutes a TEI-conformant application of SGML (ISO 8879). CES documents may be parsed using any SGML parser.

2.1.1. Tag Syntax

All elements in a document are delimited by the use of tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end.

The CES uses the "reference concrete syntax'' of SGML, which specifies that tags are delimited by the characters "<" and ">" and contain the name of the element (its gi for generic identifier). In end tags, the gi is preceded by "/". The gi may consist of upper and lower case letters and the digits 0-9.

2.1.2. Element names

The CES adopts the strategy of the TEI application of SGML by extending the legal length of delimeter names from 8 to 32 characters. Case is not significant in tag or attribute names. However, we recommend the use of the following conventions, following the TEI:

2.1.3. TEI Metalanguage extensions

For the purposes of encoding the complexity and wide range of texts treated by the TEI, the TEI has significantly extended its metalanguage level specification beyond what is offered by SGML. For instance, the TEI provides additional mechanisms for All of these extensions are adopted in the CES.

2.1.4. Tag Minimization

SGML permits various kinds of minimization, or abbreviatory conventions. The TEI interchange format prohibits the use of most minimization techniques (e.g., short references, omission of generic identifiers in start and end tags) allowed in ISO 8879. The CES adopts the TEI prohibition against the use of minimization techniques in general:

2.2. Character sets

A universal character set (UCS) that will cover all languages is under development by ISO and the Unicode consortium. The results of the work so far on this character set has been approved as The Universal Multiple-Octet Coded Character Set standard ISO/IEC 10646-1. UCS will likely be the accepted encoding standard for characters in the future.

UCS encodes each character in four bytes, thus providing a single character set to encode all the worlds' languages.


Although there is little doubt that this standard will eventually become the basis for character representation, its full specification and implementation is long enough away that, for present purposes, it is necessary to provide a temporary solution.

For corpora intended for use in language engineering applications, much interchange will be accomplished via CD-ROM or ftp. Ftp allows binary interchange and can be used to safely transmit any 8-bit character set. Moreover, data interchange is becoming increasingly reliable, due to major international efforts towards standardization such as the Internet effort. For example, TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean". In addition, recent standards have been proposed to guarantee delivery by automatically packing and unpacking data as required:

Even when such these standards are not yet implemented, files can be safely transferred by using universally available encoding programs such as 'uuencode'.

Therefore, we recommend that all data is distributed using the recommendations below for character sets. In the case of blind interchange, data should be encoded using 'uuencode'.

Our recommendation has the merit of being reasonably compatible with UCS, thus facilitating future migration to that standard.

The CES recommendations have been adopted by the EAGLES Tool subgroup for its Guidelines for Linguistic Software Development--see especially Part 1-1: Characters.

2.2.1. ISO 8859-X

The CES recommends the use of the ISO 8859-X series for all the following scripts: Arabic, Cyrillic, Greek, Hebrew, Latin.

The following is a rough list of the languages accomodated in the ISO 8859 series. See also the graphic representation of the code tables.

ISO-8859-1 - Latin 1
Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.

ISO-8859-2 Latin 2
Latin-written Slavic and Central European languages: Czech, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovene.

ISO-8859-3 - Latin 3
Esperanto, Galician, Maltese, and Turkish.

ISO-8859-4 - Latin 4
Scandinavia/Baltic (mostly covered by 8859-1 also): Estonian, Latvian, and Lithuanian. It is an incomplete predecessor of Latin 6.

ISO-8859-5 - Cyrillic
Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian.

ISO-8859-6 - Arabic
Non-accented Arabic.

ISO-8859-7- Modern Greek

ISO-8859-8 - Hebrew
Non-accented Hebrew.

ISO-8859-9 - Latin 5
Same as 8859-1 except for Turkish instead of Icelandic

ISO-8859-10 - Latin 6
Latin6, for Lappish/Nordic/Eskimo languages: Adds the last Inuit (Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover the entire Nordic area.

A list of characters used by a large number of languages is provided in "Characters and character sets for various languages " (Alvestrand, 1995).

See also "ISO 8859-1 National Character Set FAQ" (Gschwind, 1995).

Shortcomings of the ISO 8859 series

The ISO 8859 series lacks the ligatures Dutch ij, French oe and ,,German`` quotation marks, as well as several other characters.

There are also Bulgarian and Ukranian characters missing from ISO 8859-5.

2.2.2. Languages not covered by the ISO 8859 series


The recommendations above do not provide for Asian languages, including Chinese, Japanese, and Korean. Independent standards have been developed for these languages. The CES specifications for these cases are under development.

If it is necessary to encode a text in a language not covered by the ISO 8859-X series, it is required to use

It is also required that the character set used is fully documented in the header providing the encoding description for the corpus; see the description of <wsdUsage>.

Note that the TEI provides several pre-defined Writing System Declarations, including:

2.2.3. Entities for odd characters

Characters not available in the character set that has been selected for the document as a whole must be represented by entity references, which take the form of an ampersand (&) followed by a mnemonic for the character, and terminated by a semicolon (;) where this is necessary to resolve ambiguity. All entities used in a document must be declared in the DTD.

We recommend the use of ISO entities. Standard public entity names can be declared by a reference to a standard public entity, e.g.,

<!ENTITY % ISOLat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN">

<!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN">

<!ENTITY % ISOGrk1 PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN">

<!ENTITY % ISOGrk2 PUBLIC "ISO 8879-1986//ENTITIES Monotoniko Greek//EN">

<!ENTITY % ISOCyr1 PUBLIC "ISO 8879-1986//ENTITIES Russian Cyrillic//EN">

<!ENTITY % ISOCyr2 PUBLIC "ISO 8879-1986//ENTITIES Non-Russian Cyrillic//EN">


Many of the characters that commonly need to be represented are included in the ISO entity sets ISOpub and ISOnum. These sets include, for example, the special characters "&" and "<" which are part of the SGML markup syntax and cannot be included in an SGML document. They also contain entities such as "&mdash;" (for the dash the width of an "m"), "&pound;" (for British sterling), etc. The ISOpub and ISOnum entity sets are declared as follows:

<!ENTITY % ISOPUB PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN">

<!ENTITY % ISONUM PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN">

Note that these entity sets are declared in all the CES DTDs.

If no standard entity name exists or a standard entity is to be renamed, normal SGML syntax can be used to declare an appropriate entity, as follows:

<!ENTITY foo '[unprintable]'> <!-- weird character -->

Declaration of entities and entity sets not already included in the DTD for the document are added at the top of the encoded document, as in this example:
       <!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN" [
       <!ENTITY igcy     "i`"    --=small i grave, Cyrillic--    >
       <!ENTITY Igcy     "I`"    --=capital I grave, Cyrillic--  >
       <!ENTITY % ISOcyr1  PUBLIC 
            "ISO 8879-1986//ENTITIES Russian Cyrillic//EN"       >
       <!ENTITY % ISOcyr2  PUBLIC 
            "ISO 8879-1986//ENTITIES  Non Russian Cyrillic//EN"  >
       <cesDoc version="3.9">...


2.2.4. Shifting among character sets

When different character sets are mixed in a single document, three alternative methods can be used (possibly in conjunction):

These implicit methods are useful when there is a systematic mapping between tags and character sets (e.g., a list of words in one character set, with their translations in another).

The CES provides global lang and wsd attributes, as well as appropriate mechanisms to document correspondences between languages or tags with particular character sets in the CES header.

Note that the language tagging mechanism will still be valid with UCS. "Unicode characters do not specify the language of the text they represent; that is, they are completely language neutral. If the language of a character or character string must be known to accomplish a particular type of process (e.g. language sensitive collation), then a higher-level protocol must be used to specify the language." [from Unicode's "Basic Principles"].

2.2.5. International Phonetic Alphabet


The TEI provides a pre-defined Writing System Declaration (WSD) for transcribing the International Phonetic Alphabet. This is distributed by the TEI both as an SGML entity set and as a TEI Writing System Declaration documenting the entity set:

-//TEI P3: 1994//ENTITIES International Phonetic Alphabet//EN
The CES recommends using the SGML entities and providing the TEI WSD (with reference to it in the <wsdUsage> element in the header) when the IPA system is used in a document.

| Top | Prev | Next | CES Contents | CES Annexes |
HTML 3.2 Checked!