Project: |
This English version of Orwell's 1984 is encoded conformant to level 1
specifications of the Corpus Encoding Standard for the MULTEXT-EAST
project. The English is to serve as the base for the parallel corpus,
which will include aligned versions of the text in Romanian,
Bulgarian, Estonian, Slovenian, Czech, and Hungarian.
|
Editorial: |
Conformance: | Corpus Encoding Standard, Version 2.0 (level 1) |
Correction: | medium, silent |
Quotation: |
Rendition attribute values on Q, QUOTE, MENTIONED and TERM tags are
adapted from ISOpub and ISOnum standard entity set names when used.
If the rend attribute is ommited in the markup the rendition on the
first set of Q, QUOTE, MENTIONED or TERM tags is "PRE lsquo POST
rsquo" and the rendition on Q, MENTIONED or TERM tag nested in a Q or
QUOTE tag is "PRE ldquo POST rdquo"
(marks=none, form=nonstd |
Segmentation: |
Marked up to the level of paragraph: P, QUOTE plus marking of
sub-paragraph element Q. Some marking of particular sub-paragraph
elements: NAME, DATE, ABBR, MENTIONED, DISTINCT, FOREIGN.
|
Hyphenation: |
No end-of-line hyphenation present in the ECI original.
|
|
Tags: | |
| abbr | 38 |
Abbreviations are marked only within marked names. Other abbreviations
are not marked.
|
| body | 1 | |
| date | 40 |
All dates which contain one or more digits (the characters 0-9) are
marked, including dates specifying day/month/year and dates consisting
only of a year. No attempt was made to identify or mark dates in other forms.
|
| distinct | 1 | |
| div | 28 | |
| foreign | 39 |
The Newspeak words "thoughtcrime" and "doublethink" are consistently
marked as FOREIGN, when they do not appear in some other tag where the
lang attribute provides the language information. Latin and French
words are also marked.
|
| head | 1 | |
| hi | 103 |
The highlighting tag is used to mark words and phrases which were
typographically distinguished in the printed version of the text, and
for which no other more precise tag is applicable. In most of these
cases, such highlighting signifies emphasis.
|
| item | 4 | |
| l | 32 | |
| list | 1 | |
| mentioned | 261 | Rendition information has not been
systematically retained. When no rendition information is provided,
rendering is generally in italics in the 1949 Harcourt, Brace and World
Edition of Ninteeen Eighty-Four. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt edition. |
| name | 1744 |
Frequently occurring names of people, places, organizations,
products, languages, and events, are marked. If a name is marked, every
occurrence of that name is marked.
Person names in the genitive are not marked to include the English genitive
suffix "'s". For other names, only those occurrences which function as
stand-alone proper nouns are marked; adjectival uses (e.g., "Newspeak
words") are not marked.
|
| note | 2 | |
| num | 52 |
Anything containing one or more digits (the characters 0-9) that is
not part of a date, and all roman numerals, are marked as a
number. In cases where a ratio is expressed (per cent, per thousand),
the entire phrase (e.g., "10 per cent") is marked as a number.
|
| p | 1286 | |
| poem | 10 | |
| ptr | 2 | |
| q | 2209 |
The Q tag is used to mark quoted dialogue. The attribute
"type=indirect" is used when attributed speech is marked
typographically in the printed text (e.g., "I know you," he seemed to
say). The attribute "type=written" is used in those cases where
Winston's writing in his diary is represented as quoted thought (e.g.,
"If there is hope," he wrote, "it lies in the Proles."). If no "rend"
attribute is provided on the Q tag, the value is assumed to be "PRE
ldquo" on the first Q in a series of Qs within the same P unbroken by
#PCDATA and "POST rdquo" on the last Q in the series. The attribute
"broken=yes" is used when no sentence terminating punctuation (either
inside the Q itself or in the intervening text between two Qs) appears
between two dialogue fragments by the same speaker.
|
| quote | 35 |
QUOTE marks quotations from outside sources, including extensive
quotations from Winston's diary and Goldstein's treatise.
|
| s | 6701 |
S tags have been inserted automatically and then cleaned up by hand
in the locations (character offsets) provided by MTSeg version 1.3.1
using the english resource files.
|
| text | 1 | |
| title | 46 | Rendition information has not been
systematically retained. The original
electronic version contained rendition information inconsistent with
the 1949 Harcourt, Brace and World edition. |