CES header

Created by: NMI
Current Status: update
Created date: 1995-05-10
Updated date: 1997-12-14

File description

Title: Multext-East CES1: Nineteen Eighty-Four, English
Responsibility:
Nancy Ide
  • Modified ECI tags of first chapter to conform to CES Added or modified some sub-paragraph level tagging.
  • Responsibility:
    Tomaz Erjavec
  • Modified full ECI Orwell to conform to CES V3.15
  • Responsibility:
    Greg Priest-Dorman
  • Modified Tomaz Erjavec's full Orwell to conform to CES V3.21 Checked and modified markup for correctness down to the paragraph level
  • Responsibility:
    Greg Priest-Dorman
  • Added tagging of sentences in paragraphs using MtSeg and english resources.
  • Edition: MTE Final Release
    Extent: 104302 words
    928772 bytes
     -  WordCount represents the number of words in this text exclusive of tags and header information. ByteCount reflects the approximate size of the file containing the doctype and cesDoc element including all text, tags and header information.
    Publication:
    Distributor: Vassar College Computer Science Department
     124 Raymond Avenue, Poughkeepsie, New York, USA 12604
    Published: October 1st, 1997
    Availability: restricted - Available for research purposes upon receipt of signed agreement
    Source:
    Title: The European Corpus Initiative Multilingual Corpus 1: 1984 by George Orwell (English)
    Responsibility:
    Association for Computational Linguistics
  • Converted from OTA's DTD to ECI DTD
  • Publication:
    Distributor: ACL
     ACL
    Published: 1994
    Availability: restricted - Available for research purposes upon receipt of signed agreement
    Source:
    Title: Orwell's 1984: electronic edition
    Responsibility:
    Oxford Text Archive
  • The four versions of Orwell's 1984 in the OTA were all prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison. The other languages were converted to TEI conformant SGML by the ECI project 1993.) ——LB, Nov 1992
  • Edition: Public Domain TEI edition prepared at the Oxford Text Archive
    Publication:
    Distributor: Oxford Text Archive
      Oxford University Computing Service 13 Banbury Road Oxford OX2 6NN UK archive@ox.ac.uk
    Published: 19 Nov 1992
    Availability: restricted - Freely available for non-commercial use provided that this header is included in its entirety with any copy distributed
    Source: Nineteen Eighty Four 1949; reprinted 1961 New American Library New York

    Encoding description

    Project: This English version of Orwell's 1984 is encoded conformant to level 1 specifications of the Corpus Encoding Standard for the MULTEXT-EAST project. The English is to serve as the base for the parallel corpus, which will include aligned versions of the text in Romanian, Bulgarian, Estonian, Slovenian, Czech, and Hungarian.
    Editorial:
    Conformance: Corpus Encoding Standard, Version 2.0 (level 1)
    Correction: medium, silent
    Quotation: Rendition attribute values on Q, QUOTE, MENTIONED and TERM tags are adapted from ISOpub and ISOnum standard entity set names when used. If the rend attribute is ommited in the markup the rendition on the first set of Q, QUOTE, MENTIONED or TERM tags is "PRE lsquo POST rsquo" and the rendition on Q, MENTIONED or TERM tag nested in a Q or QUOTE tag is "PRE ldquo POST rdquo" (marks=none, form=nonstd
    Segmentation: Marked up to the level of paragraph: P, QUOTE plus marking of sub-paragraph element Q. Some marking of particular sub-paragraph elements: NAME, DATE, ABBR, MENTIONED, DISTINCT, FOREIGN.
    Hyphenation: No end-of-line hyphenation present in the ECI original.
    Tags:  
     abbr38 Abbreviations are marked only within marked names. Other abbreviations are not marked.  
     body1 
     date40 All dates which contain one or more digits (the characters 0-9) are marked, including dates specifying day/month/year and dates consisting only of a year. No attempt was made to identify or mark dates in other forms.  
     distinct1 
     div28 
     foreign39 The Newspeak words "thoughtcrime" and "doublethink" are consistently marked as FOREIGN, when they do not appear in some other tag where the lang attribute provides the language information. Latin and French words are also marked.  
     head1 
     hi103 The highlighting tag is used to mark words and phrases which were typographically distinguished in the printed version of the text, and for which no other more precise tag is applicable. In most of these cases, such highlighting signifies emphasis.  
     item4 
     l32 
     list1 
     mentioned261Rendition information has not been systematically retained. When no rendition information is provided, rendering is generally in italics in the 1949 Harcourt, Brace and World Edition of Ninteeen Eighty-Four. The original electronic version contained rendition information inconsistent with the 1949 Harcourt edition. 
     name1744 Frequently occurring names of people, places, organizations, products, languages, and events, are marked. If a name is marked, every occurrence of that name is marked. Person names in the genitive are not marked to include the English genitive suffix "'s". For other names, only those occurrences which function as stand-alone proper nouns are marked; adjectival uses (e.g., "Newspeak words") are not marked.  
     note2 
     num52 Anything containing one or more digits (the characters 0-9) that is not part of a date, and all roman numerals, are marked as a number. In cases where a ratio is expressed (per cent, per thousand), the entire phrase (e.g., "10 per cent") is marked as a number.  
     p1286 
     poem10 
     ptr2 
     q2209 The Q tag is used to mark quoted dialogue. The attribute "type=indirect" is used when attributed speech is marked typographically in the printed text (e.g., "I know you," he seemed to say). The attribute "type=written" is used in those cases where Winston's writing in his diary is represented as quoted thought (e.g., "If there is hope," he wrote, "it lies in the Proles."). If no "rend" attribute is provided on the Q tag, the value is assumed to be "PRE ldquo" on the first Q in a series of Qs within the same P unbroken by #PCDATA and "POST rdquo" on the last Q in the series. The attribute "broken=yes" is used when no sentence terminating punctuation (either inside the Q itself or in the intervening text between two Qs) appears between two dialogue fragments by the same speaker.  
     quote35 QUOTE marks quotations from outside sources, including extensive quotations from Winston's diary and Goldstein's treatise.  
     s6701 S tags have been inserted automatically and then cleaned up by hand in the locations (character offsets) provided by MTSeg version 1.3.1 using the english resource files.  
     text1 
     title46Rendition information has not been systematically retained. The original electronic version contained rendition information inconsistent with the 1949 Harcourt, Brace and World edition. 

    Profile description

    Languages:  
     Newspeak ns none 
     Newspeak official jargon ns-jg none 
     British Cockney English en-ck none 
    WSD:  
      ISO Latin-1 character set for Western European languages  ISO8859-1  

    Revision description

    9/5/96Tomaž Erjavec, IJS
    • Corrected the chapter 1 (esp header) to CES V2 conformance
    • with spelling cheker corrected a number of original OCR typos: I instead of l, rn instead of m
    • inserted Qs
    • inserted some missing apostrophes
    • changed '. . .' to '...', ' !' to '!', ' ?' to '?'
    • changed a number of GIs, as CES does not support ECI ones: EMPH to HI MENTION to MENTIONED and removed punctuation on single words therein GLOSS to TERM (best I could come up with, without loosing distinction)
    14/5/96Tomaž Erjavec, IJS
    • Deleted apostrophes from chapter 2 and onwards
    • Changed some TERM into FOREIGN
    14/7/96Greg Priest-Dorman
    • Changed dashes to entity mdash (not complete)
    • Added additional q tags where appropriate
    • Added quote tages
    • Changed q tags to quote tags where appropriate
    • All quotation marks repalced with markup
    • Replaced q tags with mentioned tags where appropriate
    • Standardized the markup of poems in the text
    • Marked broken Q tags as such (linking of broken Q tags with next and prev attributes is not yet done)
    15/09/96Greg Priest-Dorman
    • linked broken Q tags with "prev" and "next" attributes
    • all occurrences of "..." and ". . ." have been replaced with the ISO_8879:1986 Publishing entity "hellip"
    • changes of P and QUOTE tags since version .3 logged in file p.and.quote.changes, available on request
    • names tagged with NAME as stated above in tagUsage "gi=name"
    • quoted text tagged as stated above in tagUsage "gi=q" and tagUsage "gi=quote"
    • dates and numbers tagged as stated above in tagUsage "gi=num" and tagUsage "gi=date"
    • abbreviations are tagged as stated above in tagUsage "gi=abbr"
    • OCR errors have been corrected when found, most noticeably, the "p" at the beginning of "Party" was usually incorrectly in lower case.
    • "rend" if added has been checked against the 1949 Harcourt, Brace & World, Inc. edition of Nineteen Eighty-Four
    15/01/97Greg Priest-Dorman
    • Changed IDs, PREV and NEXT attributes using "1984en" to "Oen"
    • Fixed tagging error in Part 1 Chapter 4 QUOTE 2 (see mte1984-en.ces.V1.1.CHANGES) and reduced tagUsage for P by 2
    • fixed some typos in the header
    • replaced any tab(^I) characters in the text (there was one)
    • reformated the text for readability and consistency
    • updated BYTECOUNT
    03/03/97Greg Priest-Dorman
    • Corrected markup: marked broken Qs part 1 chapter 8 paragraph 3 (pointed out by O. Csaba).
    • Corrected markup: Part 1 chapter 4, in the list of newspeak quotes from the times part of the last list item was not in the list, it is now (pointed out by T. Erjavec)
    • corected punctuation error: Part 1 chapter 4, on two occasions the newspeak quote which ends "fullwise upsub antefiling" occurs. In the printed edition this is followed by a period, so I added the period.
    30/04/97Greg Priest-Dorman
    • inserted S tags in the locations given by MtSeg
    • inserted Q and HI tags where necessary as a result of S tag insertion
    12/05/97Greg Priest-Dorman
    • Corrected several tagging errors pointed out by T. Erjavec and V. Petkevic
    • modifed header to comply with T. Erjavec's header style
    • updated tagUsage
    • removed blank lines
    14/05/97Greg Priest-Dorman
    • added Ss to two newspeak paragraphs to aid in alignment
    • updated tagUsage
    19/05/97Greg Priest-Dorman
    • Corrected several tagging errors pointed out by T. Erjavec
    • Corrected several typos in the text pointed out by T. Erjavec and V. Petkevic
    • updated tagUsage
    20/06/97Greg Priest-Dorman
    • Corrected several tagging errors pointed out by Vladimir Petkevic where a sentence boundry was inserted 2 characters ahead of where it should have been.
    1997-12-14Tomaz Erjavec
    • Corrected several errors in the header