XMELLT

Cross-lingual Multi-word Expression Lexicons for Language Technology

Multilingual Information Access and Management

International Research Co-operation

Department of Computer Science
Vassar College

International Computer Science Institute
University of California, Berkeley

Department of Computer Science
New York University

Computing Research Laboratory
New Mexico State University

Contents

1 Background and overview

2 Project description

3 Methodology and work plan

3.1 Project management
3.2 Survey of existing resources
3.3 Development of pilot lexicons
3.4 Development of preliminary specifications for representing information about multi-word expressions
3.5 Development of preliminary specifications for the structuring and encoding multi-lingual, multi-word expression lexicons
3.6 Exploration of techniques for automatic acquisition

4 Work plan

5 Summary

6 References

Background and overview

The importance of the lexicon for natural language applications is beyond question: literally hundreds of lexicons have been constructed to support such applications as information retrieval, summarization, extraction, and machine translation. As a result, there has been extensive work over the last several years to provide to the natural language processing (NLP) community:

lexical syntactic resources for natural language processing, such as COMLEX Syntax, EDR, and PAROLE, which offer detailed information on the syntactic properties of individual lexical items
lexical semantic resources for natural language processing, such as WordNet, EuroWordNet, SIMPLE, and EDR, including some cross-lingual information
In addition, there have been some efforts to integrate lexical syntactic and semantic information, in part by capturing argument correspondences between different constructs, and in part by capturing correlations between word senses and argument structure. Among such efforts are:
the Explanatory Combinatorial Dictionary of Mel'cuk
the NOMLEX dictionary of Macleod, which captures relations between nominalizations and verbal constructs in English
the catalog of verbal argument alternation for English by Levin
the database of collocations derived from the Robert/Collins EN/FR dictionary and enhanced with lexical functions by Fontenelle
the PAROLE/SIMPLE lexicons, where semantic subcategorisation is encoded, with linking of semantic and syntactic arguments
the Berkeley FrameNet database, which is providing linked semantic-syntactic valences for a large sample of English words and which is beginning to develop a sample bilingual (German and English) version
the Mikrokosmos English, Spanish and Chinese lexicons, which include subcategorization and semantic structures

Most lexicons are devoted to a single language and/or a particular approach dictated by the needs of particular systems, and may incorporate, for example, primarily syntactic information with minimal semantic information. Standardization efforts, notably the EAGLES standards for morpho-syntactic information and lexical semantics, have been established to remedy this situation by aiming at reusability of resources, and have made considerable headway in defining standards that can serve the natural language processing community. However, we are still far from the goal of developing universally available, standard lexicons that are not only reusable, but which meet the critical demand recognized by this joint EU/NSF call: to provide an infrastructure that will support truly multi-lingual, internationally accessible natural language processing applications.

This project is intended to address what we see as the next critical step in the development of resources to support pivotal NLP applications. Efforts supported by the EU, the NSF, and other national and international agencies are underway to develop reusable and widely available multi-lingual lexicons for individual words. This speaks to the recognition by the research community and funders alike that such resources are not only vital to support both mono- and multi-lingual NLP applications, but also that the state of the art has advanced to a point where their development is feasible. This project takes this effort to its next logical step, complementing and expanding on on-going efforts to tackle, in a systematic and coordinated way, the next critical need in lexicon development. The project we propose is focussed, and it involves, as it must, the international research and development community. As such it is directly in line with the aim of the programs on both sides of the Atlantic: to make appropriate headway in developments to support the international information infrastructure.

Although several existing lexicons include multi-word phrases, in which meaning cannot be accounted for as a regular composition of the meanings of the constituents, no concentrated effort has been directed at developing flexible, multi-lingual lexicons for multi-word expressions that will be usable for a variety of natural language processing applications. The multi-word lexicons of many successful older machine translation systems such as SYSTRAN are in fact little more than Translation Memories: they do not provide any interface among single-word entries, multi-word entries, and syntax and semantics. However, as is well known in the translation community, cross-lingual correspondences cannot be represented as pairings of individual lexical items in the vast proportion of instances: for instance, certain semantic features (such as temporal features) may be expressed inflectionally in one language but phrasally (e.g., as nominalizations + support verbs) in another.

Multi-word constructions are extremely frequent in language, comprising perhaps 30%of the lexical stock; consider, for example, the frequency of verbs with separate particles in English. Given their importance for NLP, we believe that work on multi-lingual, multi-word lexicons is seriously needed in order to lay the groundwork for the next generation of multi-lingual lexical resources. The time is ripe for this effort: the NLP community on both sides of the Atlantic has explored in depth the means to encode syntactic information, on the one hand, and semantic information, on the other. Standards such as those developed in EAGLES have involved extensive consideration of the demands of accommodating different languages and multiple languages. And, as various NLP applications have been developed, we have begun to understand in depth the processing needs that will support them. Intellectual and practical advances in the past several years put us in a position to deal with the more complex issues of multi-word expressions and, in particular, their cross-lingual correspondences.

Recognizing this, we propose a planning project that will investigate the potential to develop multi-lingual, multi-word expression lexicons incorporating both syntactic and semantic information. The specific aim of the project is to define the dimensions of a core international infrastructure that can support the creation of such a lexicon incorporating both morpho-syntactic and semantic information, and which will in turn provide a base for building pivotal natural language applications aiming at management of, and universal access to, the vast quantity of information that is becoming available each day via the World Wide Web. In particular, our aims are:

To establish uniform (or inter-translatable) standards for describing multi-word lexical entries at the levels of syntax and morpho-syntax and lexical semantics;

To identify realistic objectives for the representation of phrasal expressions in a multi-lingual lexicon, in terms of size, scope, existing usable input, additional input and links to other lexical resources, and information types still to be acquired;

To determine the type and dimensions of the information that will best serve the needs of critical NLP applications such as information extraction and retrieval, summarization, and translation;

To specify an overall architecture for a joint software and lingware development project, and explore the possibilities for identifying lexicon structure and a representation formalism for the lexicon that will best suit the demands of this architecture;

To specify the outline of a collaborative project to acquire and represent multi-word lexical entries for multiple languages, in terms of workpackages, infrastructure, prospective partners, data providers, providers of requirements, etc.; as well as cost, timeframes, cost/benefit relationships, etc.

To explore the feasibility and dimensions of the eventual project by creating a small number of multi-word entries for support verbs, and including syntax and morpho-syntax;

To explore the possibilities of recognizing and acquiring a repertory of multi-word lexical units from corpora by means of partial parsing, statistics, etc.

The project brings together a group of partners whom we believe to be the central figures in the field, all with extensive experience in the areas of lexicon development, representation, and use:

USA:

International Computer Science Institute : developers of FrameNet, a project for discovering and supporting valence descriptions of English major-class words
New Mexico State University: leader in development of multi-lingual lexicons and applications
New York University : developers of the COMLEX and NOMLEX syntactic lexicons
Vassar College : leader in the development of standards for representing lexical/linguistic information, participant in development of multi-lingual lexicons for western and eastern European languages.

Europe:

Istituto di Linguistica Computazionale, CNR, Pisa: longtime leader in the development of lexicons and standards for their representation
Stuttgart: developer oft large-scale computational lexicons and the lead partner in the VerbMobil sub-project on transfer, involving bilingual lexicons for German/English, German/Japanese, and English/Japanese translation
LexiQuest (industrial partner) : developer of multi-lingual natural language processing applications, including summarization, retrieval, etc.

The project will build on their combined expertise and the resources they have developed to achieve the goals specified above. The industrial partner will serve as an active observer, providing input on how the results can be used in multi-lingual applications as well as input concerning scalability and maintainability of such a resource in an industrial context.

The exploration of the possibility of developing multi-lingual, multi-word expression lexicons will serve as a basis upon which such resources can be developed. Multi-word expressions are crucially needed for a wide number of NLP tasks and applications, ranging from natural language analysis and generation, machine translation, information retrieval and extraction, word sense disambiguation just to mention only a few. For all tasks requiring text understanding, even when this is partial, identification and processing of multi-word expressions would make the whole analysis process easier and more accurate if carried out as a first step, before syntactic and semantic analysis takes place. The need is even more crucial for text generation: accurate text generation cannot be performed without a multi-word on (see, for instance, the case of the generation of support verb constructions). This area of lexicon development is even more critical to support multi-lingual NLP applications.

Project description

The importance and role of multi-word expressions in the description and processing of natural language has long been recognized. Despite the fact that large computational lexicons have begun to exist that contain both syntactic and semantic information, they lack information about multi-word expressions. So far, multi-word information has typically been relegated to the marginal role of idiosyncratic lexical information, or has been has been addressed in terms of specific types of word combinations only (see, for instance, the NOMLEX dictionary, which is focussed on nouns, or FRAMENET, which treats only verbs).

No systematic effort has been made to accommodate multi-word expressions within a comprehensive model covering their wide and complex typology. In fact, under the generic umbrella of multi-word expressions there lies a variety of semi-pre-constructed phrases where the combination of words is — more or less — tightly bound, referred to in the literature as ‘idioms’, ‘compounds’, ‘collocations’, ‘word co-occurrence patterns’, etc. This project aims at taking a broader view of multi-word expressions and proposes an innovative lexical encoding model intended to accommodate the full typology of multi-word expressions.

Care will be taken in the encoding and the range of variation admitted in the surface realization of multi-word expressions and in the semantic transparency or opaqueness of the expression. In particular, the following aspects will be carefully considered:

possibility of internal variation, in terms of: i) insertion, deletion, replacement of modifiers; ii) type or deletion of determiners; iii) possible replacement of the words in the expression (e.g. a grandi tratti/linee ‘generally’);
subcategorisation properties of the expression as a whole;
presence of idiosyncratic constraints on the inflection of the component words;
meaning (non-)compositionality.

The proposed encoding model for multi-word expressions will be compatible and integrated with other lexical encoding models for standard lexicons; e.g. the resulting multi-word expressions model should be easily accommodated within standard lexical encoding models such as EAGLES, which already make provision for the encoding of word co-occurrence information, or within lexicons like PAROLE, SIMPLE, or NOMLEX which represent de facto standards of lexical encoding.

The development of individual multi-word expression lexicons covering individual languages is not the only innovative area of the project. Another crucial aspect is the identification of the parameters for generalizing language-specific experience to multi-lingual applications. In particular, the focus will be on the linking of lexicons of individual languages, involving both individual words and multi-word expressions, or different types of multi-word expressions (see the English noun-noun pattern vs the noun-prepositional_phrase of Romance languages).

Our work in this project will lay the ground for a large-scale project to develop this increasingly needed resource. The need for multi-lingual multi-word expression lexicons makes demand for automatic methods for acquisition of multi-word expressions particularly acute. Since it is widely acknowledged that current printed dictionaries do not contain information about multi-word expressions in a coherent and exhaustive way, acquisition from textual corpora is essential. We will start from a survey of different methods and technologies for lexical acquisition from unrestricted texts (including statistical and rule-based approaches), and, on this basis, a robust and unified approach to the acquisition of the complex typology of multi-word information will be sketched. This sketch will serve as the foundation for acquisition work within the large-scale project which should follow from this preparatory action.

We will undertake the following activities:

Assessment of currently existing lexical resources for multi-word expressions, and standards for their representation. This work will be based, among others, input from the following sources:

The Explanatory Combinatorial Dictionaries (Mel'cuk et al.);
the NOMLEX dictionary for English (Macleod);
the database of collocations derived from the Robert/Collins EN/FR dictionary and enhanced with lexical functions based on ECD categories (Fontenelle);
the database of noun-verb collocation candidates of German extracted from newspaper text (Heid);
the lexicons of syntactic subcategorization of nouns and verbs at IMS Stuttgart (Eckle-Kohler) and at PH Erfurt (Boas: nominalizations of verbs and their prepositional complements, including a broad semantic classification);
The Mikrokosmos English, Spanish and Chinese lexicons that include detailed subcategorization information linked to semantic structures represented as under-specified Text-Meaning Representation expressions (Viegas, Nirenburg).
the database of valence patterns being created through the Berkeley FRAMENET project (Fillmore).

Outline of a strategy to harmonize and build upon existing resources, in order to merge syntactic and semantic information in such a way as to be maximally usable for multi-lingual NLP applications.
Outline of the fundamentals for representing information about multi-word expressions to make it maximally flexible and reusable. This involves two aspects:
Specification of the kinds of linguistic information required in the resources;
Specification of a data architecture and encoding formats that will enable harmonization of the resources and, most importantly, appropriate linkage among multi-lingual entries, with the aim of maximizing the efficiency of processing and retrieval.
Identification of key NLP technologies and their precise needs for coping with phrasal lexical entries.
Identification of the relevant partners to be involved in such an effort, representing a range of expertise and languages (including languages from a variety of language families).

To achieve items 2 and 3, we propose to explore the potential by accomplishing the following:

adding lexical information on support verb constructions to 50 nouns drawn from the NOMLEX and PAROLE/SIMPLE lexicons, for EN, DE, IT, and FR. This will allow us to identify the relationships between noun readings selected by support verbs and noun readings identified independently, on semantic grounds. If both classifications correlate, semi-automatic collocation extraction and classification could be used as an efficient tool to prepare semantic tagging. The experience gained from this exercise, coupled with input from application developers, will feed into a preliminary specification for the design and encoding of multi-lingual, multi-word expression lexicons.

creating lexical entries for 50 N-N constructs in EN, drawn from the PAROLE/SIMPLE lexicons, and the corresponding constructs in IT, FR, and DE. For FR and IT, the correspondence between the EN constructs is not direct; typically, the corresponding items in IT and FR involve N-PP expressions instead. For DE, there is often a single lexical item corresponding to EN N-N constructs; however, in most cases these are compounds requiring sub-analysis. By attempting to represent not only the monolingual information but also establish appropriate linkage among entries the four project languages, we will have the opportunity to consider the requirements for this sort of more complex cross-lingual mapping.

The remainder of the items will be accomplished by a coordination effort, which will seek to identify and contact major players in the development of multi-word expression lexicons and applications that use them or may potentially use them A sketch of the dimensions of the project, including the scope and nature of the work involved, projection of cost, timeframes, cost/benefit relationships, etc., will be the major result of the planning project.

The overall plan of work will be one of incremental refinement. An initial survey of existing resources will be used as the basis for beginning the creation of the sample lexicon entries themselves. A first meeting of the consortium members in month 2 of the project will assess existing resources and lay the ground to initiate the creation of the sample entries. As this effort goes on, constant electronic contact among partners will ensure that problems are discussed as a group and solutions and further developments are carried out uniformly among the creators of the lexicon entries. In month 10 of the project, another consortium meeting will be held in which the progress will be assessed and the precise specifications for representing the entries will be finalized.

During the course of the year, the principal partners (Pisa and Vassar) will be in contact with potential partners on both sides of the Atlantic for a larger effort, both in academia and industry. Near the end of the project, a workshop involving not only current partners, but also potential partners for the larger project, will be held to lay the ground for the final report.

Although handling additional languages is outside the scope of this planning project, prior work of the participants in the projects on lexicons for, especially, non-Indo-European languages such as Japanese, Chinese, and Persian, will be brought into consideration as the work in this planning project develops. In this way, we will develop all specifications with an eye toward expansion to additional languages.

Methodology and work plan

The project is envisaged for one year, with the overall aim of laying the ground for development of a large-scale, multi-lingual lexicon of multi-word expressions. Work will be distributed among the seven partners (4 US partners, 3 EU partners) as described in the following sections. Note that this proposal includes budget and other information for the US partners only; budget and other relevant information for the EU partners, which has already been submitted to the European Commission, is provided in the EU version of the proposal. The EU proposal exactly as submitted to the European Commission is appended to this proposal.

The project will comprise the following activities:

Project management

Project management will involve the following major activities:

Coordination of work and input of the various partners, by establishing and maintaining flow of communication, disseminating information among partners, etc.
Organization of two meetings, one at the two-month mark and one at the ten-month mark, among partners in the project, which aims toward coordinating the effort of preparation of the lexicon entries and harmonizing them for the final specifications
Identification and contact with potential partners
Organization of a workshop among current and potential partners, near the end of the project
Preparation of the final project report, which, drawing on results of the work within the project, provides a blueprint of the large-scale project under preparation and identifies the potential participants

Because of the international nature of this project, two managing partners have been identified, one in Europe, and one in the U.S.

Survey of existing resources

A preliminary step in achieving the overall aim of the project the development of specifications for multi-word expression lexicons — is to first survey and examine existing resources. A compendium of these resources will be developed, providing specific characteristics of each, including type of information, encoding structure, availability, etc. The previous work on surveying syntactic and semantic lexicons and their content compiled by various groups within EAGLES will serve as a basis for this effort.

We envision a meeting of the entire consortium early in the project,, in which partners will assess this information and lay out the ground for development of specifications suing the report as a basis.

The work undertaken here will serve as a basis for the creation of multi-word expression lexicon entries, and will feed into the development of specifications for information about multi-word expressions (see 3.4, below) .

Development of pilot lexicons

This entails, first, the identification of 50 noun entries from the NOMLEX lexicon and the PAROLE/SIMPLE lexicons, and the identification of 50 parallel entries for the same words in existing lexicons in DE, IT, and FR. Support verb entries for the 50 nouns across the four project languages will then be created.

In parallel, 50 N-N constructs in EN will be identified in the PAROLE/SIMPLE lexicons and corresponding constructs for IT, DE, and FR will be determined. Entries for the constructs in all four languages will be created.

Responsibility for creation of specific language entries will be accomplished by the following partners:

Italian : Pisa
German : Stuttgart
French : LexiQuest, New Mexico State University
English : New York University

The work will be undertaken in a step-wise fashion, with feedback and interaction among the involved partners at frequent intervals to ensure compatibility.

This work will feed into the development of specifications for information about multi-word expressions.

Development of preliminary specifications for representing information about multi-word expressions

The development of specifications for multi-word expression lexicons is obviously closely connected to the previous two activities, both of which will feed into this effort.

This work involves the specification of the kinds of linguistic information required in the resources, including

Syntactic sub-categorization of the verbal and the nominal part of the multi-word expressions, as well as of the expression as a whole (considered as a complex predicate);
Morpho-syntax of the nominal group (determination, possibilities of having qualificative adjective phrases and/or relative clauses, etc.); this information seems to correlate with semantic information about referential availability of the nominal component of the multiword;
Semantic relationships between collocations and non-collocational quasi-synonymous expressions, including, among others, a description of aspect, causativity, etc.;
Syntactic relationships between collocations and non-collocational quasi-synonyms (esp. verbs or adjectives): argument structure, linking, inheritance of arguments from verbs to nominalizations in collocational use;
Other types of lexical semantic and syntactic information;
Additional information needed to distinguish closely related (single and multiword) expressions intra-lingually and operate a simple and efficient linking between objects from different languages, with a view to translation and multilingual IR.

The overall plan of work in this workpackage is one of step-wise refinement, beginning with building on the survey of existing resources (3.2), and incorporating input and feedback from the ongoing development of the pilot lexicons (3.3). We will similarly incorporate input from the industrial partner concerning application needs.

A second all-consortium meeting will take place at the 10 month mark, prior to the finalization of the specifications, to enable the partners to harmonize entries in the four languages and finalize the specifications.

Development of preliminary specifications for the structuring and encoding multi-lingual, multi-word expression lexicons

The development of representation schemes for multi-lingual, multi-word expression lexicons has two components:

Specification of the linguistic information required in the resources,
Specification of a harmonized data architecture and encoding formats

The first of these is undertaken in the work described in 3.4, above. The two tasks are intimately related, since the demands of what is to be represented and how to represent it bear upon one another. The developers of the sample multi-word expression lexicon entries must examine the information structure of the entries and the lexicons, and it must then be worked out how they can be mapped to an intelligently designed encoding format. This involves, for example, reconciling differences of information structure and preferred mappings, as well as consideration of the formatting and processing needs of applications that will use the information in the lexicons.

All formats will be developed to be harmonized with EAGLES specifications.

Like the work described in 3.4, the overall plan of work is one of step-wise refinement, incrementally incorporating input and feedback from the work ongoing in 3.4, together with input from the industrial partner, concerning application needs. It will also rely in a broader way on the results of the survey of existing resources, which can provide a starting point for development of a common scheme. In support of this, we envision one or two meetings of representatives of Vassar and NMSU, the principal partners undertaking this task, with the industrial partner, in order to familiarize the developers of the encoding formats with their applications and determine specific application needs.

Exploration of techniques for automatic acquisition

The object of this task is to accomplish the following:

Survey of current approaches and techniques to automatic acquisition from textual corpora of multi-word expressions, ranging from idiomatic expressions to support verb constructions and collocational patterns;

Design of a complex architecture for the acquisition and semi-automatic classification of the wide typology of multi-word expressions to be implemented and developed within the large-scale project which should follow from this preparatory action.

This task is divided into two six-month phases spanning the duration of the project:

Phase 1 (months 1-6) will be concerned with a survey of the following :
acquisition techniques and typology of multi-word expressions (this survey of acquisition techniques should parallel the typology of multi-word expressions delineated in 3.4); e.g.,

noun/verb-collocations (i.e. subject/verb- and verb/complement-patterns) are typically acquired via partial parsing (either by means of regular grammars, or through the use of stochastic parsing), on top of which an acquisition module operates; this is the approach followed in the EC SPARKLE project for different languages (Italian, English and German), with different versions of shallow parsing and subsequent filtering (statistical or analogy-based);
noun-noun compounds are generally acquired by means of statistical techniques based on the notion that the joint probability of occurrence of two or more words in a collocation is higher than the product of their independent probabilities.

acquisition of specific grammatical (i.e. morphological, syntactic, semantic) properties of identified multi-word expressions, also with respect to different domains: e.g.

internal variability;
subcategorisation and/or selectional preferences of the expressions as a whole;
particular constraints on the constituting words (e.g. number, gender).

Phase 2 (months 7-12) will involve the design of a complex architecture for the acquisition of multi-word expressions, taking into account
the typology of multi-word expressions to be acquired;
the different grammatical properties of multi-word expressions in different languages (namely, Italian, French, German and English);
a preliminary semi-automatic classification of acquired expressions;
the different linguistic resources and tools available at the different sites for the shallow parsing and acquisition tasks.

Work Plan

START END PARTNER(S) ACTIVITY RESULT

0 12 Vassar (US)
Pisa (EU) Management Final report summarizing all work in the project, and overviewing the large-scale project

0 2 Pisa
ICSI
NYU
NMSU
Stuttgart Survey of existing resources Report

2 12 NYU
ICSI
NMSU
Pisa
Stuttgart Development of pilot multi-word expression lexicon entries 4 pilot lexicons

2 12 Pisa
LexiQuest NYU
ICSI
NMSU
Stuttgart
Vassar Development of content specifications for multi-word expression lexicons Report

2 12 Vassar NMSU Development of representation schemes for multi-lingual, multi-word expression lexicons Report

0 6 Pisa
NMSU
Stuttgart
Vassar Exploration of techniques for automatic acquisition — Phase 1 Report

6 12 Pisa
NMSU
Stuttgart
Vassar Exploration of techniques for automatic acquisition — Phase 2 Report

Summary

This project is intended to lay the ground for development of a large-scale, multi-lingual, multi-expression lexicon to support natural language processing applications. It is the stated aim of this bilateral call to "further the knowledge required to build information systems that operate in multiple languages" and to "accelerate the development of new applications required by citizens and businesses in the global information society and enable their uptake in various contexts". Clearly, the development of NLP technologies is a key element to achieve these goals. It has been explicitly stated in assessment workshops involving the international research community that this development demands immediate attention to the creation and availability of large-scale, multi-lingual resources to support them. Our proposed project is intended to directly address this need. The project also addresses another aim of this call: to develop standards for the encoding of multilingual language knowledge. All of our work will be undertaken with full regard for existing language engineering standards such as those of EAGLES and is intended to ultimately both complement and contribute to them.

We fully recognize the massive amount of effort required to develop resources of the scale and scope we eventually hope to see, and we also appreciate that undertakings of this size are not to be begun until the groundwork is carefully laid. Although individual projects have provided a basis for developing a large-scale multi-lingual multi-expression lexicon, exploratory work must be done to establish the exact methodology for representing and merging multi-lingual information of this kind. To this end, we have gathered together a group of partners whom we believe to be the central figures in the field, all of them with extensive experience in the areas of lexicon development, representation, and use, and whom we believe to be best qualified to determine the shape of an eventual large-scale effort.

References

Erjavec Tomaz, Nancy Ide, Vladimir Petkevic, and Jean Véronis (1996) Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Proceedings of the First TELRI European Seminar, 87-98.

Ide Nancy (1998). Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference, Granada, Spain, 463-70. Full documentation available at <http://www.cs.vassar.edu/CES/>.

Ide, Nancy (1998). Encoding Linguistic Corpora. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, 9-17.

Ide Nancy and Jean Véronis (1993). Modelling lexical information. In Hockey Susan and Nancy Ide Research in Humanities Computing 4, Oxford University Press, 193-206.

Ide Nancy and Jean Véronis (1994). MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, 588-92.

Ide Nancy and Jean Véronis (1995). Encoding dictionaries. In Ide Nancy and Jean Véronis (Eds.). The Text Encoding Initiative: Background and Context. Dordrecht: Kluwer Academic Publishers, 167-80.

Ide Nancy and Greg Priest-Dorman (1996) Corpus Encoding Standard. EAGLES Technical Report, available from Istituto di Linguistica Computazionale, Pisa, Italy, 100p. Also available at http://www.cs.vassar.edu/CES/.

Ide Nancy, Jacques Le Maître and Jean Véronis (1995). Outline of a Model for Lexical Databases. Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale IX, X (Pisa, 1995), 283-320. [reprinted from Information Processing and Management., 29, 2, 159-186]

Le Maître, Jacques, Nancy Ide and Jean Véronis (1994) Modélisation et interrogation de bases de données lexicales. Ingénierie des systemes d'informations, 2, 1, 57-82.

Tufis Dan, Nancy Ide and Tomaz Erjavec (1998) Standardized Specifications, Deevelopment and Assessment of Large Morpho-syntactic Resources for Six Central and Eastern European Languages. Proceedings of the First International Lnaguage Resources and Evaluation Conference, Granada, Spain, 233-40.

Véronis, Jean and Nancy Ide (1996a) Considerations for Linguistic Software Reusability. Available at <http://www.lpl.univ-aix.fr/projects/multext/LSD/LSD1.html>.

Véronis, Jean and Nancy Ide (1995b) Guidelines for Linguistic Software Development. Available at <http://www.lpl.univ-aix.fr/projects/multext/LSD/LSD2.html>.

START	END	PARTNER(S)	ACTIVITY	RESULT
0	12	Vassar (US) Pisa (EU)	Management	Final report summarizing all work in the project, and overviewing the large-scale project
0	2	Pisa ICSI NYU NMSU Stuttgart	Survey of existing resources	Report
2	12	NYU ICSI NMSU Pisa Stuttgart	Development of pilot multi-word expression lexicon entries	4 pilot lexicons
2	12	Pisa LexiQuest NYU ICSI NMSU Stuttgart Vassar	Development of content specifications for multi-word expression lexicons	Report
2	12	Vassar NMSU	Development of representation schemes for multi-lingual, multi-word expression lexicons	Report
0	6	Pisa NMSU Stuttgart Vassar	Exploration of techniques for automatic acquisition — Phase 1	Report
6	12	Pisa NMSU Stuttgart Vassar	Exploration of techniques for automatic acquisition — Phase 2	Report