Title: | Multext-East CES1: Nineteen Eighty-Four, English |
Responsibility: |
Nancy Ide |
Modified ECI tags of first chapter to conform to CES
Added or modified some sub-paragraph level tagging.
|
|
Responsibility: |
Tomaz Erjavec |
Modified full ECI Orwell to conform to CES V3.15
|
|
Responsibility: |
Greg Priest-Dorman |
Modified Tomaz Erjavec's full Orwell to conform to CES V3.21
Checked and modified markup for correctness down to the paragraph level
|
|
Responsibility: |
Greg Priest-Dorman |
Added tagging of sentences in paragraphs using MtSeg and english resources.
|
|
Edition: | MTE Final Release |
Extent: | 104302 words 928772 bytes -
WordCount represents the number of words in this text exclusive
of tags and header information. ByteCount reflects the approximate
size of the file containing the doctype and cesDoc element including
all text, tags and header information.
|
Publication: |
Distributor: | Vassar College Computer Science Department |
| 124 Raymond Avenue, Poughkeepsie, New York, USA 12604 |
Published: | October 1st, 1997 |
Availability: | restricted -
Available for research purposes upon receipt of signed agreement
|
|
Source: |
Title: |
The European Corpus Initiative
Multilingual Corpus 1:
1984 by George Orwell (English)
|
Responsibility: |
Association for Computational Linguistics |
Converted from OTA's DTD to ECI DTD
|
|
Publication: |
Distributor: | ACL |
| ACL |
Published: | 1994 |
Availability: | restricted -
Available for research purposes upon receipt of signed agreement
|
|
Source: |
Title: | Orwell's 1984: electronic edition |
Responsibility: |
Oxford Text Archive |
The four versions of Orwell's 1984 in the OTA were all prepared by the
OUCS KDEM service in 1985 for Dr David C Bennett of the School of
Oriental And African Studies at London University. The texts here
have not been encoded or proofread in any way since they were produced
(other than the English text, which was converted to an SGML like
encoding by John Price-Wilkin, and subsequently automatically
converted to conform to the OTA's dtd by myself and Alan Morrison. The
other languages were converted to TEI conformant SGML by the ECI
project 1993.) ——LB, Nov 1992
|
|
Edition: |
Public Domain TEI edition prepared at the Oxford Text Archive
|
Publication: |
Distributor: | Oxford Text Archive |
|
Oxford University Computing Service
13 Banbury Road
Oxford OX2 6NN UK
archive@ox.ac.uk
|
Published: | 19 Nov 1992 |
Availability: | restricted -
Freely available for non-commercial use provided that this header is
included in its entirety with any copy distributed
|
|
Source: |
Nineteen Eighty Four
1949; reprinted 1961
New American Library
New York
|
|
|
Project: |
This English version of Orwell's 1984 is encoded conformant to level 1
specifications of the Corpus Encoding Standard for the MULTEXT-EAST
project. The English is to serve as the base for the parallel corpus,
which will include aligned versions of the text in Romanian,
Bulgarian, Estonian, Slovenian, Czech, and Hungarian.
|
Editorial: |
Conformance: | Corpus Encoding Standard, Version 2.0 (level 1) |
Correction: | medium, silent |
Quotation: |
Rendition attribute values on Q, QUOTE, MENTIONED and TERM tags are
adapted from ISOpub and ISOnum standard entity set names when used.
If the rend attribute is ommited in the markup the rendition on the
first set of Q, QUOTE, MENTIONED or TERM tags is "PRE lsquo POST
rsquo" and the rendition on Q, MENTIONED or TERM tag nested in a Q or
QUOTE tag is "PRE ldquo POST rdquo"
(marks=none, form=nonstd |
Segmentation: |
Marked up to the level of paragraph: P, QUOTE plus marking of
sub-paragraph element Q. Some marking of particular sub-paragraph
elements: NAME, DATE, ABBR, MENTIONED, DISTINCT, FOREIGN.
|
Hyphenation: |
No end-of-line hyphenation present in the ECI original.
|
|
Tags: | |
| abbr | 38 |
Abbreviations are marked only within marked names. Other abbreviations
are not marked.
|
| body | 1 | |
| date | 40 |
All dates which contain one or more digits (the characters 0-9) are
marked, including dates specifying day/month/year and dates consisting
only of a year. No attempt was made to identify or mark dates in other forms.
|
| distinct | 1 | |
| div | 28 | |
| foreign | 39 |
The Newspeak words "thoughtcrime" and "doublethink" are consistently
marked as FOREIGN, when they do not appear in some other tag where the
lang attribute provides the language information. Latin and French
words are also marked.
|
| head | 1 | |
| hi | 103 |
The highlighting tag is used to mark words and phrases which were
typographically distinguished in the printed version of the text, and
for which no other more precise tag is applicable. In most of these
cases, such highlighting signifies emphasis.
|
| item | 4 | |
| l | 32 | |
| list | 1 | |
| mentioned | 261 | Rendition information has not been
systematically retained. When no rendition information is provided,
rendering is generally in italics in the 1949 Harcourt, Brace and World
Edition of Ninteeen Eighty-Four. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt edition. |
| name | 1744 |
Frequently occurring names of people, places, organizations,
products, languages, and events, are marked. If a name is marked, every
occurrence of that name is marked.
Person names in the genitive are not marked to include the English genitive
suffix "'s". For other names, only those occurrences which function as
stand-alone proper nouns are marked; adjectival uses (e.g., "Newspeak
words") are not marked.
|
| note | 2 | |
| num | 52 |
Anything containing one or more digits (the characters 0-9) that is
not part of a date, and all roman numerals, are marked as a
number. In cases where a ratio is expressed (per cent, per thousand),
the entire phrase (e.g., "10 per cent") is marked as a number.
|
| p | 1286 | |
| poem | 10 | |
| ptr | 2 | |
| q | 2209 |
The Q tag is used to mark quoted dialogue. The attribute
"type=indirect" is used when attributed speech is marked
typographically in the printed text (e.g., "I know you," he seemed to
say). The attribute "type=written" is used in those cases where
Winston's writing in his diary is represented as quoted thought (e.g.,
"If there is hope," he wrote, "it lies in the Proles."). If no "rend"
attribute is provided on the Q tag, the value is assumed to be "PRE
ldquo" on the first Q in a series of Qs within the same P unbroken by
#PCDATA and "POST rdquo" on the last Q in the series. The attribute
"broken=yes" is used when no sentence terminating punctuation (either
inside the Q itself or in the intervening text between two Qs) appears
between two dialogue fragments by the same speaker.
|
| quote | 35 |
QUOTE marks quotations from outside sources, including extensive
quotations from Winston's diary and Goldstein's treatise.
|
| s | 6701 |
S tags have been inserted automatically and then cleaned up by hand
in the locations (character offsets) provided by MTSeg version 1.3.1
using the english resource files.
|
| text | 1 | |
| title | 46 | Rendition information has not been
systematically retained. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt, Brace and World edition. |
9/5/96 | Tomaž Erjavec, IJS |
- Corrected the chapter 1 (esp header) to CES V2 conformance
-
with spelling cheker corrected a number of original OCR typos:
I instead of l, rn instead of m
- inserted Qs
- inserted some missing apostrophes
- changed '. . .' to '...', ' !' to '!', ' ?' to '?'
-
changed a number of GIs, as CES does not support ECI ones:
EMPH to HI
MENTION to MENTIONED and removed punctuation on single words therein
GLOSS to TERM (best I could come up with, without loosing distinction)
|
14/5/96 | Tomaž Erjavec, IJS |
- Deleted apostrophes from chapter 2 and onwards
- Changed some TERM into FOREIGN
|
14/7/96 | Greg Priest-Dorman |
- Changed dashes to entity mdash (not complete)
- Added additional q tags where appropriate
- Added quote tages
- Changed q tags to quote tags where appropriate
- All quotation marks repalced with markup
- Replaced q tags with mentioned tags where appropriate
- Standardized the markup of poems in the text
-
Marked broken Q tags as such (linking of broken Q tags with
next and prev attributes is not yet done)
|
15/09/96 | Greg Priest-Dorman |
- linked broken Q tags with "prev" and "next" attributes
-
all occurrences of "..." and ". . ." have been replaced with the
ISO_8879:1986 Publishing entity "hellip"
-
changes of P and QUOTE tags since version .3 logged in file
p.and.quote.changes, available on request
- names tagged with NAME as stated above in tagUsage "gi=name"
-
quoted text tagged as stated above in tagUsage "gi=q" and tagUsage "gi=quote"
-
dates and numbers tagged as stated above in tagUsage "gi=num" and
tagUsage "gi=date"
-
abbreviations are tagged as stated above in tagUsage "gi=abbr"
-
OCR errors have been corrected when found, most noticeably, the "p"
at the beginning of "Party" was usually incorrectly in lower case.
-
"rend" if added has been checked against the 1949 Harcourt,
Brace & World, Inc. edition of Nineteen Eighty-Four
|
15/01/97 | Greg Priest-Dorman |
- Changed IDs, PREV and NEXT attributes using "1984en" to "Oen"
-
Fixed tagging error in Part 1 Chapter 4 QUOTE 2
(see mte1984-en.ces.V1.1.CHANGES) and reduced tagUsage for P by 2
- fixed some typos in the header
- replaced any tab(^I) characters in the text (there was one)
- reformated the text for readability and consistency
- updated BYTECOUNT
|
03/03/97 | Greg Priest-Dorman |
-
Corrected markup: marked broken Qs part 1 chapter 8 paragraph 3
(pointed out by O. Csaba).
-
Corrected markup: Part 1 chapter 4, in the list of newspeak quotes
from the times part of the last list item was not in the list, it
is now (pointed out by T. Erjavec)
- corected punctuation error: Part 1 chapter 4, on two occasions
the newspeak quote which ends "fullwise upsub antefiling" occurs. In
the printed edition this is followed by a period, so I added the period.
|
30/04/97 | Greg Priest-Dorman |
- inserted S tags in the locations given by MtSeg
-
inserted Q and HI tags where necessary as a result of S tag insertion
|
12/05/97 | Greg Priest-Dorman |
-
Corrected several tagging errors pointed out by T. Erjavec and V. Petkevic
- modifed header to comply with T. Erjavec's header style
- updated tagUsage
- removed blank lines
|
14/05/97 | Greg Priest-Dorman |
- added Ss to two newspeak paragraphs to aid in alignment
- updated tagUsage
|
19/05/97 | Greg Priest-Dorman |
-
Corrected several tagging errors pointed out by T. Erjavec
-
Corrected several typos in the text pointed out by T. Erjavec and V. Petkevic
- updated tagUsage
|
20/06/97 | Greg Priest-Dorman |
-
Corrected several tagging errors pointed out by Vladimir Petkevic
where a sentence boundry was inserted 2 characters ahead of where
it should have been.
|
1997-12-14 | Tomaz Erjavec |
- Corrected several errors in the header
|