Representation of Linguistic Corpora
Department of Computer Science
Poughkeepsie, New York 12601 USA
tel : (+1) 914 437 5988
fax : (+1) 914 437 7498
e-mail : firstname.lastname@example.org
Laboratoire Parole et Langage
CNRS & Université de Provence
29, Avenue Robert Schuman, 13621 Aix-en-Provence Cedex 1, France
tel : (+33) 42 95 36 34
fax : (+33) 42 59 50 96
This project is intended to provide a theoretical background
coherent methodologies for the representation, access, and manipulation of
corpora intended for use in corpus-based natural language processing (NLP)
research. The project builds on and continues a program of
collaborative research, established in 1988, between Vassar College's
Department of Computer Science and the Laboratoire Parole et Langage (LP&L) of the The Centre National de la Recherche Scientifique
(CNRS) in Aix-en-Provence, France. The work is
carried out in the context of the European projects MULTEXT, MULTEXT-EAST, and EAGLES (in particular, the EAGLES Text Representation subgroup),
supported under the European Commission LRE program and coordinated by
LP&L. The work undertaken at Vassar College is supported
by a grant from the National Science Foundation (NSF RUI grant
The increasing interest in the use of large-scale textual resources for NLP
research has led to the rapid proliferation of both massive amounts of textual
data and text-handling tools. Much of the currently available data is
annotated using ad hoc formats, most of which are entirely inconsistent
with one another, and almost none of which has been developed on the basis of
a sound model of text and text categories or in view of any serious
of the needs of corpus-based NLP research. Similarly, and for related reasons,
there is an enormous redundancy in the functionality of much existing
corpus-handling software (part-of-speech taggers, statistics-gathering
etc.), due to the fact that the same systems need to be re-invented over and
over again to accomodate specific inputand output formats and platforms.
Because such software is typically instantiated in large, unbreakable systems,
the ability to modify it and re-use relevant pieces in other applications is
severely limited. Again,the lack of a principled basis for text software design
is the cause ofthis redundancy and limited reusability.
Our goal is to develop a sound basis and methodology for corpus
well as for the design of corpus-handling tools. There is an obvious
dependency between the two, which demands that they are developed hand-in-hand.
The task involves: (1) analysis of the needs of corpus-based NLP research,
both in terms of the kinds and degree of annotation required and the
for efficient processing, accessibility, etc.; (2) analysis of general
properties and configuration of corpora, analysis of relevant structural and
logical features of component text types, and the design of encoding mechanisms
that can represent all required elements and features while accomodating the
requirements determined in (1); and (3) specifications for text software design,
coordinated with (2), with the aim of avoiding redundancy and maximizing
modifiability, extendability, and reusability.
Currently, the forcus of work within the project is the development of a Corpus Encoding Standard (CES) optimally suited for use in language
engineering, intended to serve as a widely accepted set of encoding standards for
corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text
and Office Systems--Standard Generalized Markup
Language) compliant with the specifications of the TEI
for Electronic Text Encoding and Interchange of the
Text Encoding Initiative. The CES specifies a minimal
encoding level that corpora must achieve to be considered standardized in terms
of descriptive representation (marking of structural and typographic
information) as well as general architecture (so as to be maximally suited for
use in a text database). It also provides
encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.
The CES is being developed in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. Comments and discussion on any aspect of the CES are invited and encouraged. The most recent draft of the CES is available at
The document is also available by ftp as a tar file.
Bourbeau, L., Pinard, F. (1995).
Normalisation et internationalistion:
Inventaire et prospective des normes clefs pour le traitement informatique du
français. Progiciels BPI. Montréal.
Bryan, M. (1988)SGML: An Author's Guide, Addison-Wesley Publishing
Company, New York.
Burnard, L. (1995).
Encoding for Information Interchange-- An Introduction to the Text Encoding
Initiative, TEI Document no TEI J31, Oxford University Computing
Coombs, J.H., Renear, A.H., and DeRose, S.J. (1987).Markup systems and the
future of scholarly text processing. Communications of the ACM, 30, 11, 933-
Cover, R. (1994).
SGML Web Page.
DeRose, S.J., Durand, D.G. (1994). Making HyperMedia Work: A Users's Guide
to HyTime. Kluwer Academic Publishers,
Goldfarb, C.F. (1990).
The SGML Handbook, Clarendon Press, Oxford.
Ide, N. et al.
Version of December 1995.
Ide, N. Encoding standards for large text resources.
Proceedings of the 15th
International Conference on Computational Linguistics, COLING'94, Kyoto,
Japan (1994), 574-78.
Ide, N., Véronis, J.
MULTEXT: Multilingual Text Tools and Corpora.
Proceedings of the 15th International Conference on Computational
Linguistics, COLING'94, Kyoto, Japan, (1994) 588-92.
Ide, N., Véronis, J. What next after the Text Encoding Initiative? The
need for text software. ACH Newsletter, Winter (1993), 1-3.
Ide, N., Véronis, J. (Eds.) (1995a).
The Text Encoding Initiative:
Background and Context. Kluwer Academic
342p. [reprinted from triple special issue of Computers and the
Humanities, 29, no 1/2/3, with an original bibliography]
ISO 8879 (1986).
Information Processing--Text and Office Systems--Standard
Generalized Markup Language (SGML), ISO, Geneva.
ISO/IEC DIS 10744 (1992).
Hypermedia/Time-based Document Structuring Language
(Hytime), ISO, Geneva.
Kimber, W. Eliot (1995).
Practical Hypermedia: An Introduction to HyTime. Charles F. Goldfarb Series
On Open Information Management. Prentice-Hall Professional Technical Reference,
New York; approximately 250 pages. ISBN: 0-13-309899-0.
Newcomb, Steven R., Kipp, Neill A.; Newcomb, Victoria T. (1991).
Hypermedia/Time-based Document Structuring Language.
Communications of the Association for
Computing Machinery 34/11 (November 1991) 67-83.
Sperberg-McQueen, C.M., Burnard, L. (Eds.) (1994).
for Electronic Text Encoding and Interchange, Text Encoding Initiative,
Chicago and Oxford.
van Herwijnen, E. (1991). Practical SGML,
Kluwer Academic Publishers, Boston
[2nd edition, 1994].
| Vassar CS Dept.
| EAGLES Text Representation subgroup