XMELLT

Cross-lingual Multi-word Expression Lexicons for Language Technology
 
 
 
 
 
 
 
 

Multilingual Information Access and Management

International Research Co-operation



 
 
 
 

Department of Computer Science
Vassar College

International Computer Science Institute
University of California, Berkeley

Department of Computer Science
New York University

Computing Research Laboratory
New Mexico State University


 
 





Contents

1 Background and overview

2 Project description

3 Methodology and work plan

3.1 Project management

3.2 Survey of existing resources

3.3 Development of pilot lexicons

3.4 Development of preliminary specifications for representing information about multi-word expressions

3.5 Development of preliminary specifications for the structuring and encoding multi-lingual, multi-word expression lexicons

3.6 Exploration of techniques for automatic acquisition

4 Work plan

5 Summary

6 References



Background and overview

The importance of the lexicon for natural language applications is beyond question: literally hundreds of lexicons have been constructed to support such applications as information retrieval, summarization, extraction, and machine translation. As a result, there has been extensive work over the last several years to provide to the natural language processing (NLP) community:

Most lexicons are devoted to a single language and/or a particular approach dictated by the needs of particular systems, and may incorporate, for example, primarily syntactic information with minimal semantic information. Standardization efforts, notably the EAGLES standards for morpho-syntactic information and lexical semantics, have been established to remedy this situation by aiming at reusability of resources, and have made considerable headway in defining standards that can serve the natural language processing community. However, we are still far from the goal of developing universally available, standard lexicons that are not only reusable, but which meet the critical demand recognized by this joint EU/NSF call: to provide an infrastructure that will support truly multi-lingual, internationally accessible natural language processing applications.

This project is intended to address what we see as the next critical step in the development of resources to support pivotal NLP applications. Efforts supported by the EU, the NSF, and other national and international agencies are underway to develop reusable and widely available multi-lingual lexicons for individual words. This speaks to the recognition by the research community and funders alike that such resources are not only vital to support both mono- and multi-lingual NLP applications, but also that the state of the art has advanced to a point where their development is feasible. This project takes this effort to its next logical step, complementing and expanding on on-going efforts to tackle, in a systematic and coordinated way, the next critical need in lexicon development. The project we propose is focussed, and it involves, as it must, the international research and development community. As such it is directly in line with the aim of the programs on both sides of the Atlantic: to make appropriate headway in developments to support the international information infrastructure.

Although several existing lexicons include multi-word phrases, in which meaning cannot be accounted for as a regular composition of the meanings of the constituents, no concentrated effort has been directed at developing flexible, multi-lingual lexicons for multi-word expressions that will be usable for a variety of natural language processing applications. The multi-word lexicons of many successful older machine translation systems such as SYSTRAN are in fact little more than Translation Memories: they do not provide any interface among single-word entries, multi-word entries, and syntax and semantics. However, as is well known in the translation community, cross-lingual correspondences cannot be represented as pairings of individual lexical items in the vast proportion of instances: for instance, certain semantic features (such as temporal features) may be expressed inflectionally in one language but phrasally (e.g., as nominalizations + support verbs) in another.

Multi-word constructions are extremely frequent in language, comprising perhaps 30%of the lexical stock; consider, for example, the frequency of verbs with separate particles in English. Given their importance for NLP, we believe that work on multi-lingual, multi-word lexicons is seriously needed in order to lay the groundwork for the next generation of multi-lingual lexical resources. The time is ripe for this effort: the NLP community on both sides of the Atlantic has explored in depth the means to encode syntactic information, on the one hand, and semantic information, on the other. Standards such as those developed in EAGLES have involved extensive consideration of the demands of accommodating different languages and multiple languages. And, as various NLP applications have been developed, we have begun to understand in depth the processing needs that will support them. Intellectual and practical advances in the past several years put us in a position to deal with the more complex issues of multi-word expressions and, in particular, their cross-lingual correspondences.

Recognizing this, we propose a planning project that will investigate the potential to develop multi-lingual, multi-word expression lexicons incorporating both syntactic and semantic information. The specific aim of the project is to define the dimensions of a core international infrastructure that can support the creation of such a lexicon incorporating both morpho-syntactic and semantic information, and which will in turn provide a base for building pivotal natural language applications aiming at management of, and universal access to, the vast quantity of information that is becoming available each day via the World Wide Web. In particular, our aims are:

The project brings together a group of partners whom we believe to be the central figures in the field, all with extensive experience in the areas of lexicon development, representation, and use:

USA:

Europe: The project will build on their combined expertise and the resources they have developed to achieve the goals specified above. The industrial partner will serve as an active observer, providing input on how the results can be used in multi-lingual applications as well as input concerning scalability and maintainability of such a resource in an industrial context.

The exploration of the possibility of developing multi-lingual, multi-word expression lexicons will serve as a basis upon which such resources can be developed. Multi-word expressions are crucially needed for a wide number of NLP tasks and applications, ranging from natural language analysis and generation, machine translation, information retrieval and extraction, word sense disambiguation just to mention only a few. For all tasks requiring text understanding, even when this is partial, identification and processing of multi-word expressions would make the whole analysis process easier and more accurate if carried out as a first step, before syntactic and semantic analysis takes place. The need is even more crucial for text generation: accurate text generation cannot be performed without a multi-word on (see, for instance, the case of the generation of support verb constructions). This area of lexicon development is even more critical to support multi-lingual NLP applications.
 

Project description

The importance and role of multi-word expressions in the description and processing of natural language has long been recognized. Despite the fact that large computational lexicons have begun to exist that contain both syntactic and semantic information, they lack information about multi-word expressions. So far, multi-word information has typically been relegated to the marginal role of idiosyncratic lexical information, or has been has been addressed in terms of specific types of word combinations only (see, for instance, the NOMLEX dictionary, which is focussed on nouns, or FRAMENET, which treats only verbs).

No systematic effort has been made to accommodate multi-word expressions within a comprehensive model covering their wide and complex typology. In fact, under the generic umbrella of multi-word expressions there lies a variety of semi-pre-constructed phrases where the combination of words is — more or less — tightly bound, referred to in the literature as ‘idioms’, ‘compounds’, ‘collocations’, ‘word co-occurrence patterns’, etc. This project aims at taking a broader view of multi-word expressions and proposes an innovative lexical encoding model intended to accommodate the full typology of multi-word expressions.
 
 

Care will be taken in the encoding and the range of variation admitted in the surface realization of multi-word expressions and in the semantic transparency or opaqueness of the expression. In particular, the following aspects will be carefully considered:

The proposed encoding model for multi-word expressions will be compatible and integrated with other lexical encoding models for standard lexicons; e.g. the resulting multi-word expressions model should be easily accommodated within standard lexical encoding models such as EAGLES, which already make provision for the encoding of word co-occurrence information, or within lexicons like PAROLE, SIMPLE, or NOMLEX which represent de facto standards of lexical encoding.

The development of individual multi-word expression lexicons covering individual languages is not the only innovative area of the project. Another crucial aspect is the identification of the parameters for generalizing language-specific experience to multi-lingual applications. In particular, the focus will be on the linking of lexicons of individual languages, involving both individual words and multi-word expressions, or different types of multi-word expressions (see the English noun-noun pattern vs the noun-prepositional_phrase of Romance languages).

Our work in this project will lay the ground for a large-scale project to develop this increasingly needed resource. The need for multi-lingual multi-word expression lexicons makes demand for automatic methods for acquisition of multi-word expressions particularly acute. Since it is widely acknowledged that current printed dictionaries do not contain information about multi-word expressions in a coherent and exhaustive way, acquisition from textual corpora is essential. We will start from a survey of different methods and technologies for lexical acquisition from unrestricted texts (including statistical and rule-based approaches), and, on this basis, a robust and unified approach to the acquisition of the complex typology of multi-word information will be sketched. This sketch will serve as the foundation for acquisition work within the large-scale project which should follow from this preparatory action.

We will undertake the following activities:

  1. Assessment of currently existing lexical resources for multi-word expressions, and standards for their representation. This work will be based, among others, input from the following sources:
  2. Outline of a strategy to harmonize and build upon existing resources, in order to merge syntactic and semantic information in such a way as to be maximally usable for multi-lingual NLP applications.
  3. Outline of the fundamentals for representing information about multi-word expressions to make it maximally flexible and reusable. This involves two aspects:
  4. Specification of the kinds of linguistic information required in the resources;
  5. Specification of a data architecture and encoding formats that will enable harmonization of the resources and, most importantly, appropriate linkage among multi-lingual entries, with the aim of maximizing the efficiency of processing and retrieval.
  6. Identification of key NLP technologies and their precise needs for coping with phrasal lexical entries.
  7. Identification of the relevant partners to be involved in such an effort, representing a range of expertise and languages (including languages from a variety of language families).
To achieve items 2 and 3, we propose to explore the potential by accomplishing the following: Although handling additional languages is outside the scope of this planning project, prior work of the participants in the projects on lexicons for, especially, non-Indo-European languages such as Japanese, Chinese, and Persian, will be brought into consideration as the work in this planning project develops. In this way, we will develop all specifications with an eye toward expansion to additional languages.
 

Methodology and work plan

The project is envisaged for one year, with the overall aim of laying the ground for development of a large-scale, multi-lingual lexicon of multi-word expressions. Work will be distributed among the seven partners (4 US partners, 3 EU partners) as described in the following sections. Note that this proposal includes budget and other information for the US partners only; budget and other relevant information for the EU partners, which has already been submitted to the European Commission, is provided in the EU version of the proposal. The EU proposal exactly as submitted to the European Commission is appended to this proposal.

The project will comprise the following activities:
 

  • Project management

  • Project management will involve the following major activities:

    Because of the international nature of this project, two managing partners have been identified, one in Europe, and one in the U.S.
     
  • Survey of existing resources

  • A preliminary step in achieving the overall aim of the project ­ the development of specifications for multi-word expression lexicons — is to first survey and examine existing resources. A compendium of these resources will be developed, providing specific characteristics of each, including type of information, encoding structure, availability, etc. The previous work on surveying syntactic and semantic lexicons and their content compiled by various groups within EAGLES will serve as a basis for this effort.

    We envision a meeting of the entire consortium early in the project,, in which partners will assess this information and lay out the ground for development of specifications suing the report as a basis.

    The work undertaken here will serve as a basis for the creation of multi-word expression lexicon entries, and will feed into the development of specifications for information about multi-word expressions (see 3.4, below) .
     

  • Development of pilot lexicons

  • This entails, first, the identification of 50 noun entries from the NOMLEX lexicon and the PAROLE/SIMPLE lexicons, and the identification of 50 parallel entries for the same words in existing lexicons in DE, IT, and FR. Support verb entries for the 50 nouns across the four project languages will then be created.

    In parallel, 50 N-N constructs in EN will be identified in the PAROLE/SIMPLE lexicons and corresponding constructs for IT, DE, and FR will be determined. Entries for the constructs in all four languages will be created.

    Responsibility for creation of specific language entries will be accomplished by the following partners:

    Italian : Pisa
    German : Stuttgart
    French : LexiQuest, New Mexico State University
    English : New York University
    The work will be undertaken in a step-wise fashion, with feedback and interaction among the involved partners at frequent intervals to ensure compatibility.

    This work will feed into the development of specifications for information about multi-word expressions.
     

  • Development of preliminary specifications for representing information about multi-word expressions

  • The development of specifications for multi-word expression lexicons is obviously closely connected to the previous two activities, both of which will feed into this effort.

    This work involves the specification of the kinds of linguistic information required in the resources, including

    The overall plan of work in this workpackage is one of step-wise refinement, beginning with building on the survey of existing resources (3.2), and incorporating input and feedback from the ongoing development of the pilot lexicons (3.3). We will similarly incorporate input from the industrial partner concerning application needs.

    A second all-consortium meeting will take place at the 10 month mark, prior to the finalization of the specifications, to enable the partners to harmonize entries in the four languages and finalize the specifications.
     

  • Development of preliminary specifications for the structuring and encoding multi-lingual, multi-word expression lexicons

  • The development of representation schemes for multi-lingual, multi-word expression lexicons has two components:

    The first of these is undertaken in the work described in 3.4, above. The two tasks are intimately related, since the demands of what is to be represented and how to represent it bear upon one another. The developers of the sample multi-word expression lexicon entries must examine the information structure of the entries and the lexicons, and it must then be worked out how they can be mapped to an intelligently designed encoding format. This involves, for example, reconciling differences of information structure and preferred mappings, as well as consideration of the formatting and processing needs of applications that will use the information in the lexicons.

    All formats will be developed to be harmonized with EAGLES specifications.

    Like the work described in 3.4, the overall plan of work is one of step-wise refinement, incrementally incorporating input and feedback from the work ongoing in 3.4, together with input from the industrial partner, concerning application needs. It will also rely in a broader way on the results of the survey of existing resources, which can provide a starting point for development of a common scheme. In support of this, we envision one or two meetings of representatives of Vassar and NMSU, the principal partners undertaking this task, with the industrial partner, in order to familiarize the developers of the encoding formats with their applications and determine specific application needs.
     

  • Exploration of techniques for automatic acquisition

  • The object of this task is to accomplish the following:

    This task is divided into two six-month phases spanning the duration of the project:
     
     

    Work Plan
     


    START END PARTNER(S) ACTIVITY RESULT
    0 12 Vassar (US)
    Pisa (EU)
    Management Final report summarizing all work in the project, and overviewing the large-scale project
    0 2 Pisa 
    ICSI
    NYU
    NMSU
    Stuttgart
    Survey of existing resources Report
    2 12 NYU
    ICSI
    NMSU
    Pisa
    Stuttgart
    Development of pilot multi-word expression lexicon entries 4 pilot lexicons 
    2 12 Pisa
    LexiQuest NYU
    ICSI
    NMSU
    Stuttgart
    Vassar 
    Development of content specifications for multi-word expression lexicons Report 
    2 12 Vassar NMSU Development of representation schemes for multi-lingual, multi-word expression lexicons Report
    0 6 Pisa 
    NMSU
    Stuttgart
    Vassar
    Exploration of techniques for automatic acquisition — Phase 1 Report
    6 12 Pisa 
    NMSU
    Stuttgart
    Vassar
    Exploration of techniques for automatic acquisition — Phase 2 Report

     
     
     

    Summary

    This project is intended to lay the ground for development of a large-scale, multi-lingual, multi-expression lexicon to support natural language processing applications. It is the stated aim of this bilateral call to "further the knowledge required to build information systems that operate in multiple languages" and to "accelerate the development of new applications required by citizens and businesses in the global information society and enable their uptake in various contexts". Clearly, the development of NLP technologies is a key element to achieve these goals. It has been explicitly stated in assessment workshops involving the international research community that this development demands immediate attention to the creation and availability of large-scale, multi-lingual resources to support them. Our proposed project is intended to directly address this need. The project also addresses another aim of this call: to develop standards for the encoding of multilingual language knowledge. All of our work will be undertaken with full regard for existing language engineering standards such as those of EAGLES and is intended to ultimately both complement and contribute to them.

    We fully recognize the massive amount of effort required to develop resources of the scale and scope we eventually hope to see, and we also appreciate that undertakings of this size are not to be begun until the groundwork is carefully laid. Although individual projects have provided a basis for developing a large-scale multi-lingual multi-expression lexicon, exploratory work must be done to establish the exact methodology for representing and merging multi-lingual information of this kind. To this end, we have gathered together a group of partners whom we believe to be the central figures in the field, all of them with extensive experience in the areas of lexicon development, representation, and use, and whom we believe to be best qualified to determine the shape of an eventual large-scale effort.
     
     
     

    References

    Erjavec Tomaz, Nancy Ide, Vladimir Petkevic, and Jean Véronis (1996) Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Proceedings of the First TELRI European Seminar, 87-98.

    Ide Nancy (1998). Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. Proceedings of the First International Language Resources and Evaluation Conference, Granada, Spain, 463-70. Full documentation available at <http://www.cs.vassar.edu/CES/>.

    Ide, Nancy (1998). Encoding Linguistic Corpora. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, 9-17.

    Ide Nancy and Jean Véronis (1993). Modelling lexical information. In Hockey Susan and Nancy Ide Research in Humanities Computing 4, Oxford University Press, 193-206.

    Ide Nancy and Jean Véronis (1994). MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, 588-92.

    Ide Nancy and Jean Véronis (1995). Encoding dictionaries. In Ide Nancy and Jean Véronis (Eds.). The Text Encoding Initiative: Background and Context. Dordrecht: Kluwer Academic Publishers, 167-80.

    Ide Nancy and Greg Priest-Dorman (1996) Corpus Encoding Standard. EAGLES Technical Report, available from Istituto di Linguistica Computazionale, Pisa, Italy, 100p. Also available at http://www.cs.vassar.edu/CES/.

    Ide Nancy, Jacques Le Maître and Jean Véronis (1995). Outline of a Model for Lexical Databases. Current Issues in Computational Linguistics: In Honour of Don Walker. Linguistica Computazionale IX, X (Pisa, 1995), 283-320. [reprinted from Information Processing and Management., 29, 2, 159-186]

    Le Maître, Jacques, Nancy Ide and Jean Véronis (1994) Modélisation et interrogation de bases de données lexicales. Ingénierie des systemes d'informations, 2, 1, 57-82.

    Tufis Dan, Nancy Ide and Tomaz Erjavec (1998) Standardized Specifications, Deevelopment and Assessment of Large Morpho-syntactic Resources for Six Central and Eastern European Languages. Proceedings of the First International Lnaguage Resources and Evaluation Conference, Granada, Spain, 233-40.

    Véronis, Jean and Nancy Ide (1996a) Considerations for Linguistic Software Reusability. Available at <http://www.lpl.univ-aix.fr/projects/multext/LSD/LSD1.html>.

    Véronis, Jean and Nancy Ide (1995b) Guidelines for Linguistic Software Development. Available at <http://www.lpl.univ-aix.fr/projects/multext/LSD/LSD2.html>.