XML Corpus Encoding Standard  Document XCES 0.2 Last Modified 7 May 2002



XCES
Corpus Encoding Standard for XML



SCHEMAS

Beta 0.2




Contents


XCES Schema Overview and Download

The XCES consists of eight schemas. The xcesGlobal and xcesLink schemas do not declare any elements and are imported/included by the other schemas. The eight schemas are:

xcesDoc.xsd : Encoding conventions for level 1 XCES documents.
xcesAna.xsd : Encoding conventions for annotated data.
xcesAlign.xsd : Encoding conventions for aligned data.
xcesWord.xsd : Extends xcesDoc to provide word level tags for stand-off annotation.
xcesSpoken.xsd : Extends xcesDoc to provide tags for encoding spoken data.
xcesHeader.xsd : The XCES header used by all XCES documents.
xcesGlobal.xsd : Global group and type definitions.
xcesLink.xsd : XLink attribute definitions used in xcesAna.xsd and xcesAlign.xsd.
: Used to import the xlink namespace.

Download all files: xces-schema-0_2.zip

The XCES schemas were created automatically from the XCES DTDs using XML Spy and then extensively hand modified.


Validation


Usage

xcesDoc resources :

        <?xml version="1.0"?>
        <cesCorpus xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesDoc.xsd" 
                  version="1.0">
		...
        </cesCorpus>
xcesAna resources : 
        <cesAna xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesAna.xsd" 
                  version="1.0">
		...
        </cesAna>
xcesAlign resources : 
        <?xml version="1.0"?>
        <cesAlign xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesAlign.xsd" 
                  version="1.0">
		...
        </cesAlign>
xcesWord resources : 
        <?xml version="1.0"?>
        <cesWord xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesWord.xsd" 
                  version="1.0">
		...
        </cesWord>
xcesSpoken resources : 
        <?xml version="1.0"?>
        <cesSpoken xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesSpoken.xsd" 
                  version="1.0">
		...
        </cesSpoken>
cesHeader resources :
        <?xml version="1.0"?>
        <cesHeader xmlns="http://www.xml-ces.org/schema" 
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                  xsi:schemaLocation="http://www.xml-ces.org/schema
                                      http://www.cs.vassar.edu/XCES/schema/xcesHeader.xsd" 
                  version="1.0">
		...
        </cesHeader>


XCES Header

The XCES header is described in its own schema making it possible to create standalone header files. The header can be stored in the document in a <cesHeader> element as in the CES, or the header may be stored externally with a <xcesHeader> element used to link to the header file. Although it is not defined here, it is possible to define a new document type that consists of a sequence of headers in one file (a headerbase) and use XPointer expressions to locate the fragment containing the desired header.

Headers.XML : <?xml version="1.0"?>
<cesHeaders>
<cesHeader id="h1"/> ... </cesHeader>
<cesHeader id="h2"/> ... </cesHeader>
<cesHeader id="h3"/> ... </cesHeader>
<cesHeader id="h4"/> ... </cesHeader>
</cesHeaders>
Corpus.XML : <?xml version="1.0"?>
             <cesCorpus>
                 <cesHeader>
                     ...
                 </cesHeader>
                 <cesDoc>
                     <xcesHeader xlink:href="Headers.XML#h1"/>
                     <text> ... </text>
                 </cesDoc>
                 <cesDoc>
                     <xcesHeader xlink:href="Headers.XML#h2"/>
                     <text> ... </text>
                 </cesDoc>
                 ...
             </cesCorpus>

New Elements

The following elements have been added to the <profileDesc> element in the <cesHeader>.

<particDesc> (i.e. participation description)

Description :Describes the identifiable speakers, voices or other participants in a linguistic entertain.
XPath cesHeader/profileDesc/particDesc
Attributes
Name Type Default
xces:globalAtts    
xces:declarable    
Content Model (person | personGrp)+ particLinks?
Example
<particDesc>
   <person id="p1" sex="f" age="42">Female informant, well educated, born 
   in Boston US, 12 Jan 1950, of unknown occupation.  Speaks English 
   fluently.</person>
   <person id="p2" sex="m" age="43"/>
   <particLinks>
      <relation active="p1 p2" desc="spouse"/>
   <particLinks>
</particDesc>

 

<person>

Description :Describes an individual participant in a linguistic interation.
XPath cesHeader/profileDesc/particDesc/person
Attributes
Name Type Meaning
xces:globalAtts    
role xs:string the role of this participant in the interaction
sex (m | f | u ) male, female, unknown
age Xs:string the age group to which the participant belongs
Content Model Character data
Example
<person id="p1" sex="f" age="42">Female informant, well educated, born 
   in Boston US, 12 Jan 1950, of unknown occupation.  Speaks English 
   fluently.
</Person>

 

<personGrp>

Description :Describes a groups of individuals treated as a single entity for analytical reasons.
XPath cesHeader/profileDesc/particDesc/personGrp
Attributes
Name Type Meaning
xces:globalAtts    
role Xs:string the role of this group in the interaction
sex (m | f | x | u ) male, female, mixed, unknown
age Xs:string the age group of the participants
size Xs:string the size or approximate size of the group.
Content Model Character data
Example
 <person id="p1" sex="f" age="42">Female informant, well educated, born 
   in Boston US, 12 Jan 1950, of unknown occupation.  Speaks English 
   fluently.</Person>
   <person id="p2" sem="m" age="43"/>
   <particLinks>
      <relation active="p1 p2" desc="spouse"/>
   <particLinks>
</particDesc>

 

<particLinks> (i.e. participation relationships)

Description :Describes the relationships or social links existing amongst participants in an interaction.
XPath cesHeader/profileDesc/particDesc/particLinks
Attributes
Name Type Meaning
xces:globalAtts    
Content Model relation+
Example
<particLinks>
   <relation desc="parent" active="p1 p2" passive="p3 p4" mutual="n"/>
   <relation desc="spouse" active="p1 p2"/>
   <relation type="social" desc="employer" active="p1" 
             passive="p3 p5 p6 p7" mutual="n"/>
<particLinks>

 

<relation> (i.e. relationship)

Description :Describes any kind of relationship between a specified group of participants.
XPath cesHeader/profileDesc/particDesc/particLinks/relation
Attributes
Name Type Meaning
xces:globalAtts   
type Xs:string categorizes the relationship in some respect
desc Xs:string a brief description of the relationship
active Xs:IDREFS the active participants in a non-mutual relationship
passive Xs:IDREFS the passive participants in a non-mutual relationship
mutual (y | n) indicates if the relation holds equally amongst the participants
Content Model Empty
Example
<relation type="social" desc="supervisor" active="p1"
          passive="p2 p3 p4" mutual="n"/>
<relation type="personal" desc="friends" active="p2 p3 p4" mutual="y"/> 

 

<settingDesc> (i.e. setting description)

Description :Describes the setting or settings within which a language interaction takes place.
XPath cesHeader/profileDesc/settingDesc
Attributes
Name Type Meaning
xces:globalAtts    
xces:declarable    
Content Model setting+
Example
<settingDesc>
   <setting>Texts Recorded in the Canadian Parliment building in Ottawa, 
   between April and November 1988.</setting>
</settingDesc>

 

<setting>

Description :Describes one particular setting in which a language interaction takes place.
XPath cesHeader/profileDesc/settingDesc/setting
Attributes
Name Type Meaning
xces:globalAtts    
who Xs:IDREFS the identifiers of the participants in this setting.
Content Model (name | time | locale | activity)*
Mixedtrue
Example
<setting who="p1 p2 p3">
   <name>New York City</name>
   <time>1989</time>
   <locale>on a park bench</local>
   <activity>feeding birds</activity>
</setting>

 

<name> (i.e. name or proper noun)

Description Contains a proper noun or noun phrase
XPath cesHeader/profileDesc/settingDesc/setting/name
Attributes
Name Type Meaning
xces:globalAtts    
Content Model Character data.
Mixed true
Example
<name>New York City</name>

 

<time>

Description A phrase containing the time of day in any form.
XPath cesHeader/profileDesc/settingDesc/setting/time
Attributes
Name Type Meaning
xces:globalAtts    
type Xs:string legal values are (am | pm | 24 hour | descriptive)
zone Xs:string a time zone or place name
value Xs:string a word or phrase which might be helpful in evaluating the temporal expression.
Content Model Character data.
Example
<setting>
On a park bench at approximately <time value="1145">a quarter to twelve</time>. </setting>

 

<locale>

Description A brief informal description of the nature of a place.
XPath cesHeader/profileDesc/settingDesc/setting/locale
Attributes
Name Type Meaning
xces:globalAtts    
Content Model Character data.
Example
<setting>
On <locale>a park bench</locale> at approximately a quarter to twelve. </Setting>

 


XCES Base Types

Almost all elements in the XCES have an associated simple or complex type. The only exceptions are the root elements in each schema document. All elements, attributes, and types have been placed in the namespace http://www.xml-ces.org/schema. It is recommended, but not required, that http://www.xml-ces.org/schema be made the default namespace for XCES documents. For the remainder of this document it will be assumed that the prefix xces: refers to the namespace http://www.xml-ces.org/schema.

Attribute Groups

The CES DTDs use an ENTITY definition to represent the set of attributes that belong to the class a.global.

    <!ENTITY % a.global '
               id        ID #IMPLIED
               n         CDATA #IMPLIED
               lang      IDREF #IMPLIED
               xml:Lang  CDATA #IMPLIED'>

In the XCES the a.global entity has been replaced with the attribute group xces:a.global defined in xcesGlobal.xsd.

    <Xs:attributeGroup name="a.global">
        <Xs:attribute name="id" type="Xs:ID"/>
        <Xs:attribute name="n" type="Xs:string"/>
        <Xs:attribute name="Lang" type="Xs:IDREF"/>
        <Xs:attribute ref="xml:Lang"/>
    </Xs:attributeGroup>

Each of the top level schemas (xcesAlign, xcesAna, and xcesDoc) extend the set of global attributes by adding attributes specific to that type of document. The attributes added are:

Schema
Attribute Group Name
Attributes Added
Attribute Meaning
xcesAlign
xces:a.align
wsd
Character encoding used.
xcesAna xces:a.ana type Provides more precise information about the element's function or role.
wsd Character encoding used
xcesDoc xces:a.text rend Rendering information about the original version.
wsd Character encoding used

Element Groups

The cesDoc DTD makes use of entities to represent element classes similar to the TEI element classes. In the xcesDoc.xsd schema these are represented by element groups. For example, the entities:

    <!ENTITY % m.token 'abbr | date | num |measure |
                       name | term | time |'>
    <!ENTITY % m.phrase '%m.token; foreign | mentioned |
                       distinct | title | hi | list |
                       corr | gap | reg | ptr | ref'>
    <!ENTITY % phrase.seq '#PCDATA | %m.phrase;'>

are replaced by the element groups xces:m.token and xces:phrase.seq.

    <xs:group name="m.token">
        <xs:choice>
            <xs:element name="abbr" type="xces:abbrType"/>
            <xs:element name="date" type="xces:dateType"/>
            <xs:element name="num" type="xces:numType"/>
            <xs:element name="measure" type="xces:measureType"/>
            <xs:element name="name" type="xces:nameType"/>
            <xs:element name="term" type="xces:termType"/>
            <xs:element name="time" type="xces:timeType"/>
        </xs:choice>
    </xs:group>
    <xs:group name="m.common">
        <xs:choice>
            <xs:element name="list" type="xces:listType"/>
            <xs:element name="corr" type="xces:corrType"/>
            <xs:element name="gap" type="xces:gapType"/>
            <xs:element name="reg" type="xces:regType"/>
            <xs:element name="ptr" type="xces:ptrTyp"/>
            <xs:element name="ref" type="xces:refType"/>
        </xs:choice>
    </xs:group>

    <xs:group name="phrase.seq">
        <xs:choice>
            <xs:group ref="xces:m.token"/>
            <xs:group ref="xces:m.common"/>
            <xs:element name="foreign" type="xces:foreignType"/>
            <xs:element name="mentioned" type="xces:mentionedType"/>
            <xs:element name="distinct" type="xces:distinctType"/>
            <xs:element name="title" type="xces:titleType"/>
            <xs:element name="hi" type="xces:hiType"/>
        </xs:choice>
    </xs:group>

In addition to the above groups, element groups have also been defined that correspond to the following CES entities:

String Types

The xcesGlobal.xsd schema defines the string type xces:class.string that extends xs:string by adding the global attributes xces:a.global. xces:class.string is then used as the base type when defining other string types. All types that extend xces:class.string have a String suffix. i.e. xces:annotationString, xces:creationString, etc.


XCES Linking

XLink attributes and XPointer expressions are used in the XCES to represent links between documents. XPointers can be used to express points and ranges in an XML document whether or not elements in the document contain IDs. However, not all CES linking elements have been converted to XLink links. For example, the <cesAlign> element contains fromDoc, toDoc, fromLocation, and toLocation attributes that can be used to specify the target documents that are being aligned. To model these attributes with XLink would require four new elements to be added. Therefore these attributes remain in the XCES as links, however XPointers should be used to specify the targets. I.E.:

    <cesAlign fromDoc="corpus/english/text1.xml"
              toDoc="corpus/spanish/text1.xml"
              fromLocation="#xpointer(id('p1')/range-to(id('p5')))"
              toLocation="#xpointer(id('p1')/range-to(id('p6')))"
              ...>

Or:

    <cesAlign fromDoc="corpus/english/text1.xml#xpointer(id('p1')/range-to(id('p5')))"
              toDoc="corpus/spanish/text1.xml#xpointer(id('p1')/range-to(id('p6')))"
              ...>

At this time XPointer has not been made a final recomendation by the W3C and as a result the XCES may need to be changed in the future.

Linking in cesAna Documents

There are four linking elements in an XCES annotation document used to indicate a range being annotated: <chunk>, <tok>, <s>, and <par>. In the CES these elements contain the attributes from and to that are used as links. The <chunk> element also contains a doc attribute used as a link. In the XCES these elements are now simple links (xlink:type="simple") and use the xlink:href attribute with an XPointer expression to express the range. For example:

 TEXT.CES :  
   <chunk doc="/corpus/en/text1.ces" from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\25">
      <tok from="2.1.1.1.2.1\1" to="2.1.1.1.2.5"/>
      ...
   </chunk>

Becomes:

TEXT.XCES:
   <chunk xml:base="/corpus/en/text1.ces" 
          xlink:href="#xpointer(string-range(/2/1/1/1/2/1, '', 1, 25))">
      <tok xlink:href="#xpointer(string-range(/2/1/1/1/2/1, '', 1, 5))"/>
      ...
   </chunk>

 

The <cesAna> and <chunkList> elements also contain the xml:base attribute so the common portions of the file's location can be specified at a higher level. For example:

    <cesAna xml:base="http://www.xml-ces.org/" ...>
        ...    
        <chunkList xml:base="corpus/en/">
		    <chunk xml:base="text1.xces" 
                   xlink:href="#xpointer(id('p1s1')/range-to(id('p1s8')))">
                <tok xlink:href="#xpointer(id('p1s1w1')/range-to(id('p1s1w2)))"/> 
		        ...
		    </chunk>
		    <chunk xml:base="text2.xces" 
                   xlink:href="#xpointer(id('p1s9')/range-to(id('p2s10')))">
                <tok xlink:href="#p1s9w1"/> 
		        ...
		    </chunk>
		   ...
	    </chunkList>
        ...
    </cesAna>

Linking in cesAlign Documents

In the CES the <link> element uses the targets or xtargets attribute to specify a semi-colon delimited list of fragments being aligned. In the XCES the <link> element has been changed to an XLink extended link (xlink:type="extended") that contains a sequence of <align> elements (xlink:type="locator") used to identify the fragments being aligned.

Example:

  DOC1: <s id="p1s1">According to our survey, 1988 sales of
        mineral water and soft drinks were much higher than in 1987, reflecting
        the growing popularity of these products.</s> 
        <s id="p1s2">Cola drink manufacturers in particular achieved above- 
        average growth rates.</s>

 <!-- ... -->

  DOC2: <s id="p1s1">Quant aux eaux minérales et aux limonades, elles
        rencontrent toujours plus d'adeptes.</s>
        <s id="p1s2">En effet, notre sondage fait ressortir des ventes 
        nettement supérieures à celles de 1987, pour les boissons 
        à base de cola notamment.</s>

  CES ALIGN DOC:
        <linkGrp targType="s"> 
           <link xtargets="p1s1 ; p1s1">
           <link xtargets="p1s2 ; p1s2"> 
        </linkGrp> 

Becomes:

  XCES ALIGN DOC:
        <linkGrp targType="s">
           <link>
              <align xlink:href="#p1s1"/>
              <align xlink:href="#p1s1"/>
           </link>
           <link>
              <align xlink:href="#p1s2"/>
              <align xlink:href="#p1s2"/>
           </link>
        </linkGrp>

 

The order of the <align> elements within a <link> element is significant. Unless otherwise specified the order is assumed to match the ordering of <translation> elements in the header. If a different ordering is required the attribute n in the <translation> element and the attribute n in the <align> element can be used to explicitly link an <align> element with a specific translation. Many-to-one alignments and many-to-many alignments can be represented by providing a range for the XPointer expression. N-to-zero alignments can be indicated by omitting one or more of the <align> elements and using the n attribute to specify which translations any remaining <align> elements refer to. Alternatively, the href attribute can be set to #xces:undefined to indicate that there is no translation for that fragment in that language.

Example:

header.xml:
 
       <cesHeader version="2.3">
          ...
          <translations>
             <translation trans.loc="text-fr.xml" xml:lang="fr" wsd="ISO8859-1" n="1">
             <translation trans.loc="text-en.xml" xml:lang="en" wsd="ISO8859-1" n="2">
             <translation trans.loc="text-ro.xml" xml:lang="ro" wsd="ISO8859-1" n="3">
             <translation trans.loc="text-cz.xml" xml:lang="cz" wsd="ISO8859-1" n="4">
          </translations>
       </cesHeader>

align.xml:

    <cesAlign type="sent" version="1.6">
       <cesHeader xlink:href="header.xml"/>
       <linkList>
          <!-- sentence alignments -->
          <linkGrp domains="d1 d1 d1 d1" targType="s">
             <link>
                <!-- Same ordering as translation elements [fr,en,ro,cz] -->
                <align xlink:href="#s1"/>
                <align xlink:href="#s1"/>
                <align xlink:href="#s1"/>
                <align xlink:href="#s1"/>
             </link>
             <link>
                <!-- Reverse order [cz,ro,en,fr] -->
                <align n="4" xlink:href="#s2"/>
                <align n="3" xlink:href="#s2"/>
                <align n="2" xlink:href="#s2"/>
                <align n="1" xlink:href="#s2"/>
             </link>
             <link>
                <!-- No English translation [3ro,2cz,1fr]-->
                <align n="3" xlink:href="#xpointer(id('s3')/range-to(id('s5')))"/>
                <align n="4" xlink:href="#xpointer(id('s3')/range-to(id('s4')))"/>
                <align n="1" xlink:href="#s3"/>
             </link>
             <link>
                <!-- 3rd align is fr, the rest are taken in order of translation [1en,1ro,2fr,0cz] -->
                <align xlink:href="#s3"/>
                <align xlink:href="#s4"/>
                <align n="1" xlink:href="#xpointer(id('s4')/range-to(id('s5')))"/>
                <align xlink:href="#xces:undefined"/>
             </link>
             ...
          </linkGrp>
       </linkList>
    </cesAlign>

Example: Linking With Stand-off Annotation

Frequently the data that is to be aligned or annotated is not marked up in a suitable format: for example, when sentence alignment is provided for target documents that are marked only to the paragraph level, or when annotation is stored separately to allow for multiple parallel annotations of the same phenomenon. The following provides a simple example of stand-off annotation.

The overall architecture of the documents is as follows:

 

These files are meant as examples only. The French translation was performed at http://babelfish.altavista.com. Some words have been purposely tokenized incorrectly (i.e. C'est is marked as one word so it aligns with two English words).

Download zip archive with all example files


Acknowledgements

We would like to thank Altova GmbH and Altova Inc. for providing their XML Spy Suite software to be used in the development of the XCES, in the context of the American National Corpus project.

Questions/comments to Nancy Ide ide@cs.vassar.edu or Keith Suderman suderman@cs.vassar.edu