KLUWER ACADEMIC PUBLISHERS

TEXT, SPEECH AND LANGUAGE TECHNOLOGY
BOOK SERIES

Series Editors:
Nancy Ide, Vassar College, USA
Jean Véronis, Université de Provence, France


VOLUME 11


Natural Language Processing Using Very Large Corpora


Edited by

Susan Armstrong
ISSCO, University of Geneva
Kenneth W. Church
AT&T Research Laboratories
Pierre Isabelle
Xerox Research Centre Europe
Sandra Manzi
ISSCO, University of Geneva
Evelyne Tzoukermann
Lucent, Bell Laboratories
David Yarowsky
Johns Hopkins University


The 1990s have been an exciting time for researchers working with large collections of text. Text is available like never before. It was not all that long ago that researchers referred to the Brown Corpus as a `large' corpus. The Brown Corpus, a `mere' million words collected at Brown University in the 1960s, is about the same size as a dozen novels, the complete works of William Shakespeare, the Bible, a collegiate dictionary or a week of a newswire service. Today, one can easily surf the web and download millions of words in no time at all.

What can we do with all this data? It is better to do something simple than nothing at all. Researchers in large corpora are using basically brute force methods to make progress on some of the hardest problems in natural language processing, including part-of-speech tagging, word sense disambiguation, parsing, machine translation, information retrieval, and discourse analysis. They are overcoming the so-called knowledge-acquisition bottleneck by processing vast quantities of data, more text than anyone could possibly read in a lifetime, and estimating all sorts of `central and typical' facts that any speaker of the language would be expected to know, e.g. word frequencies, word associations and typical predicate-argument relations.

Much of this work has been reported at a series of annual meetings, known as the Workshop on Very Large Corpora (WVLC) and related meetings sponsored by ACL/SIGDAT (Association for Computational Linguistics' special interest group on data). Subsequent meetings have been held in Asia (1994, 1997), America (1995, 1996, 1997) and Europe (1995, 1996). The papers in this book represent much of the best of the first three years of this workshop/conference as selected by a competitive review process.


Contents:

Introduction.
Implementation and Evaluation of a German HMM for POS Disambiguation; H. Feldweg.
Improvements in Part-of-Speech Tagging with an Application To German; H. Schmid.
Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging; E. Brill, M. Pop.
Tagging French without Lexical Probabilities - Combining Linguistic Knowledge and Statistical Learning; E. Tzoukermann, et al.
Example-Based Sense Tagging of Running Chinese Text; X. Tong, et al.
Disambiguating Noun Groupings with Respect to WordNet Senses; P. Resnik.
A Comparison of Corpus-based Techniques for Restoring Accents in Spanish and French Text; D. Yarowsky. Beyond Word N-Grams; F. Pereira, et al.
Statistical Augmentation of a Chinese Machine-Readable Dictionary; P. Fung, D. Wu.
Text Chunking Using Transformation-based Learning; L. Ramshaw, M.P. Marcus.
Prepositional Phrase Attachment through a Backed-off Model; M. Collins, J. Brooks.
On the Unsupervised Induction of Phrase-Structure Grammars; C. de Marcken.
Robust Bilingual Word Alignment for Machine Aided Translation; I. Dagan, et al.
Iterative Alignment of Syntactic Structures for a Bilingual Corpus; R. Grishman.
Trainable Coarse Bilingual Grammars for Parallel Text Bracketing; D. Wu.
Comparative Discourse Analysis of Parallel Texts; P. van der Eijk.
Comparing the Retrieval Performance of English and Japanese Text Databases; H. Fujii, W.B. Croft.
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson; K. Church, W. Gale.

List of Authors.
Subject Index.


ORDER INFORMATION

Natural Language Processing Using Very Large Corpora
Edited by Susan Armstrong, Kenneth W. Church, Pierre Isabelle, Sandra Manzi, Evelyne Tzoukermann, David Yarowsky

TEXT, SPEECH, AND LANGUAGE TECHNOLOGY Series, Volume 11
Kluwer Academic Publishers, Dordrecht

Hardbound, ISBN 0-7923-6055-9
November 1999, 324 pp.

For customers in Mexico, USA, Canada and Latin America: Rest of the world:
Kluwer Academic Publishers
Order Department
P.O. Box 358
Accord Station
Hingham, MA 02018-0358
U.S.A.
Kluwer Academic Publishers Group
Order Department
P.O. Box 322
3300 AH Dordrecht
The Netherlands
Tel : 617 871 6600
Fax : 617 871 6528
Email : kluwer@wkap.com
Tel : +31 78 6392392
Fax : +31 78 6546474
Email : services@wkap.nl



MORE INFORMATION

About the TEXT, SPEECH AND LANGUAGE TECHNOLOGY series:

About Kluwer Academic Publishers:


Last modification : 3/18/00