TEI and the American Memory Project at the Library of Congress Deborah A. Lapeyre and Tommie Usdin ATLIS Consulting Group Project Overview American Memory is a pilot program of the Library of Congress, designed to make selected collections widely available in electronic form. The Library intends to provide this information via computer networks in the future, but is using CD-ROM as a distribution medium for a five-year pilot (1990 - 1994). American Memory includes text, image, motion, and audio collections. The text collections include two distinct types of information: archival text materials and published materials. Archival collections are generally unpublished materials, and will be provided as searchable texts with facsimile images of each page. The published books will be produced as searchable texts, with images of any illustrations. The American Memory project has been designed to provide continued document availability in both the near term and the extended future. During the design and planning of this system, the project personnel were looking for the ability to provide machine-readable documents in a neutral format that could: o Not restrict document access, but be usable with a variety of search and display engines on various hardware platforms; o Be usable well into the foreseeable future; o Facilitate both display and printing of a simplified presentation of the material (not a faithful reproduction of the appearance of the original); and o Facilitate searching and information retrieval. Selection of SGML: Through the Back Door The Library of Congress made SGML an option in an American Memory data conversion procurement. This contract was primarily for scanning, conversion to ASCII, and tagging several of the more fragile collections to be included in American Memory. Bids were accepted for systems using SGML or the generic coding scheme the project had used in prototype testing. Systems Integration Group proposed that by using SGML, the Library could not only get a better, more flexible, text resource, they could also get a high quality text resource at a lower price than they would if they used non-SGML generic coding. This is because there is Commercial Off the Shelf Software that can validate the tags and the tag relationships of SGML documents. If SGML were not used, either custom software would be needed to do this validation, or a lower level of tag validation would be performed. Decision Favoring SGML over Generic Coding or HTML American Memory preferred the SGML option because it was an accepted standard, ensuring longer term viability of the converted text, and because it significantly increased their options in Commercial Off the Shelf Retrieval Systems. Not only is there an increasing number of SGML-based retrieval systems available, they are often available for relatively standard and relatively inexpensive hardware platforms. SGML is readily translatable to HTML for access across the World Wide Web using even low-end browsers. Higher-end browsers can display the SGML directly. Thus SGML supports the goal of making the American Memory materials as widely available as possible. Use of SGML also fits the Library's philosophy of using standards whenever practical. Although any generic coding scheme can be used to achieve these advantages to a degree, SGML offers several distinct advantages: o SGML adds intelligence to documents in the explicit identification of document structure. The knowledge of document structure can be used to reformat the document for printing or display, relate parts of the document to each other or to other documents, and provide context to improve the precision of information retrieval requests. o SGML parsing provides a critical level of error checking. Parsing can, for example, find small errors in format coding (such as turning on italics and never turning it off) that degrade the quality of the information display. The fact that all tags are keyed correctly and used in expected patterns can prevent search system and formatting failures. o SGML is more flexible than other generic coding systems. It is very easy to convert from an SGML tagging scheme to another encoding scheme, for example, a desktop publishing system for printing or formatted display. American Memory Design Goals and Objectives The American Memory project is creating a corpus of searchable electronic text to be used by researchers in various on-line searching and CD-ROM prototypes, using many different search engines. It was necessary that the search and retrieval system supply, at a minimum, full-text searching and the ability to print and display both scanned image files and reasonable approximations (not faithful copies) of the conversion of the original text. [***LeeEllen, we need to name the search engines you have used here *** and brief capabilities] SGML tags have been used to indicate the boundaries of documents and delineate structures within the text. The SGML tagging is descriptive not interpretive in nature. The tagging must be rich enough to provide the ability to: o Format the text for printouts; o Link to images of pages or portions of pages; o Limit searches to a particular text structure or exclude the contents of a particular structure from a search (for example, find a particular word anywhere it occurs in a heading, not merely within the text of a paragraph); and o Identify elements in the data that are never displayed to the searcher, but whose content may be searched e.g., content of table cells). SGML Application Standards The decision to use SGML immediately prompted the question of what flavor of SGML to use. The choice of SGML merely means that elements will be defined and described in a Document Type Definition (DTD) according to the rules of ISO 8879. The international SGML standard specifies how to describe which SGML features have been selected and how to specify a tag set, but does not provide a tag set or set of template tag relationships. Many organizations choose to use an SGML "application standard", such as ISO 12083 or the TEI. There is nothing in the application standard that contradicts the international standard, but it is far more specific. An application standard exists because a community of people (for example, the semiconductor industry, the automobile manufacturers, the airline and aircraft industries, etc.) have discussed their mutual and individual requirements and made compromises on elements and data relationships all can use. The decision to model an application on an existing application standard implies the restriction of the resultant DTD to a particular way of organizing element relationships, an agreement to use certain models for the content of elements, and at least a partial agreement on tags sets and naming conventions. AAP Application Standard (ISO 12083) The Library's first choice in SGML application standards was ANSI/NISO Z39.59-1988, commonly known as the AAP standard (since then revised as ISO 12083). The APP standard was the first SGML application standard developed, and had as one of its goals the encoding of manuscripts for publication. The Library of Congress was heavily involved in the development of the AAP standard; and placed particular emphasis on the bibliographic portion of the model. In 1988 it was approved by the American National Standards Institute as ANSI/NISO Standard Z39.59-1988. In 1994, it was published following major revision as an International Standards Organization standard, ISO 12083. Because of its wide endorsement by standards organizations, many members of the library community felt that the "AAP" standard should be used for its American Memory implementation. At the beginning of the American Memory document analysis the Library felt that unless there was a very good reason not to adhere to the AAP standard, they wanted to use it. The original goals of the AAP standard were to allow authors to create manuscripts in machine-readable form that would be publisher independent, and useful to the library community. The AAP standard provides pre- defined tags for some commonly occurring manuscripts elements and specifies how to extend the set. It provides a mechanism for extending the structures defined to include additional elements through an extensive set of SGML parameter entities. The architecture of the AAP standard is based on the assumption that for any application, the structure of the documents to be included will be known. The standard is quite prescriptive, providing a way to enforce style standards, many required elements, and a specific sequence of document components. This prescriptive ability is very valuable in situations in which documents are being authored, edited, or modified. It provides a mechanism for enforcing document standards, and ensuring that required parts of documents are present and in the correct sequence. However, it is simply not appropriate to try to define archival materials with a prescriptive structure; the documents simply will not match the rules. It is futile to pretend that the documents will match the rules, and destructive to alter them to match. TEI Application Standard At the same time the American Memory Project was analyzing their document encoding requirements, the Text Encoding Initiative was being developed by a consortium of archivists, historians, and computational linguists. The TEI DTDs were being designed to permit the encoding of any textual document into a form that is system and interpretation independent, so that multiple analyses can be done of the same machine- readable text. The basic premise of the TEI application standard is that what ever occurs in the document should be identified, so that it is accessible. The TEI standard imposes few sequence rules and includes very few required structures (there are a few in the header, which identifies the file and the header itself is required). For example, while it is possible to identify the portion of a text that identifies a conference, there is no reason to expect that in a historical text the conference number would be followed by the conference name, the conference date, conference place, and sponsor, in that order; rules imposed by the AAP standard. During the American Memory document analysis, library staff and the SGML consultants were frequently amazed by the close fit between the requirements of American Memory and the TEI. There are portions of the TEI guidelines that could be quotes from the workshop discussions. For example, after a lengthy discussion in which we learned that in some of these materials Tables of Contents and Indices were as likely to occur at the front or the back, we read in the TEI guidelines that: "Conventions vary as to which elements are grouped as back matter and which as front. For example, some books place the table of contents at the front, and others at the back. Even title pages may appear at the back of a book as well as the front. The content model for and are therefore identical." Selection of the TEI model TEI Guidelines were selected as the more appropriate SGML application standard for the American Memory project to emulate because there was no way to predict what information would appear in the document, nor in what sequence, and it was not acceptable to re-arrange a historical text to meet the needs of a coding or retrieval system. Therefore, the Document Analysis Group decided: o The American Memory DTD would follow the model of the TEI, with the intent that movement to full compliance will be relatively simple once the TEI DTDs are complete. The TEI DTDs have been designed in a deliberately modular fashion, including many parameter entities, because it is the intent of the TEI Board that each site put together, from the DTD fragments, the document components needed to describe their collection. o All structures would follow the TEI models as much as possible. Deviations from the TEI structures will be clearly identified, both in a report and as comments within the DTD. o A subset of the TEI header was to be included in the DTD. All required TEI header elements are present in the American Memory documents, even though, at least in the original implementation, none of the optional components were used. o TEI naming conventions would be used where known, particularly for parameter entities. This is in the spirit of the TEI, which has lists of items that local groups are expected to change. The TEI genre distinctions were not preserved; all text is tagged as prose in the American Memory project. Design and Analysis Decisions The American Memory DTD was developed based on a five- day document analysis session. Participants included people familiar with a wide variety of the materials to be included in American Memory as well as the team responsible for the design and development of the electronic product. Participants were responsible for making decisions on what should, and should not be identified in the tagged file, and what the various elements should be named. SGML consultants produced the final Document Type Definition and SGML Tag Library. Design Constraints A major design constraint was that there was a limited budget. As much of the tagging as possible would be automated, and a commercial document conversion company would do all of the manual tagging. This meant that the people making judgments would be neither librarians nor subject experts. It is reasonable to assume that many of then would have limited literacy; keying is often done off-shore or by prison inmates. Thus it was deemed essential to minimize the amount of judgment to be imposed during the conversion process. The Library chose not to identify structures that would be difficult to identify correctly, preferring to leave the full burden on the searcher rather than provide inaccurate coding. Among the other compromises made because of the conversion process: o Dates are recorded as given in the document, with NO attempt to standardize content or form; o Names of persons, places, and historical periods are NOT identified, tagged, or verified against an authority file; o The keyers do not determine document boundaries or identification. (Library staff determines and communicates what is a single document.); and o TEI Header information (such as title) is provided by Library Staff. Title, author, and publication information as it appears in a document are tagged as generic identification information. An additional major constraint was that at the time the American Memory DTD was developed the TEI guidelines were incomplete. The DTD could not, for course, comply with a standard that did not yet exist. It was the hope of the developers that, by adhering to as much of the TEI application as was applicable and available, and trying to stick to the spirit of the TEI application as we understood it, conversion to a fully compliant application would be relatively easy in the future. Analysis Decisions In any documentation project, there are some critical design decisions. We have already discussed the most important of these; selection of an SGML application standard. In the case of American Memory, there were some additional decisions, equally important to the ultimate useability of the documents. Theses include: o Definition of the Text of a document; o The definition of a document and inclusion of bibliographic information; o Use of the scanned images; and o Table handling. Definition of Text It would be possible to say that all letters and numbers on the documents should be considered part of the "text", and included in the searchable text file of the document. Advantages of this approach include that using this definition nothing will be skipped, and it requires the least amount of judgment on the part of the document conversion staff. Disadvantages of this approach are, however, significant: o A lot of redundant text (such as letterheads that appear on every page) would be included, increasing costs and file size with very little value; o Unneeded text would interrupt the primary text, impairing searching (for example a letter-head could occur in the middle of a paragraph or even a word); and o Extracting all of the text from graphics and complex text such as advertisements is quite time-consuming, and thus expensive. As a compromise, the document analysis team decided that all text within the "body" of the page should be tagged and made searchable. Such text will include: o The original page number(s); o Any handwritten, stamped, perforated, or otherwise interpolated additional text, notes, or annotations; o The text of a letterhead, each first time it appears; and o The text of advertising, unless specific instructions are given to exclude advertisements. They decided that the searchable text would NOT include: o Text within an illustration or image (even if it can be read); o Running heads and feet (such as telephone-style ears in books); and o Text in a letterhead, except each first time it appears. Document Identification and Inclusion of Bibliographic Information The definition of where a document begins and another document ends is a subject for endless discussion. If three "books" are bound into one cover is this one document are three? How are the pages in a pile of notes separated into documents, or is the entire pile a document? What about scrapbooks or clipping files in archival material? Of even more concern to the Library is the identification of each document. Librarians spend years learning to catalog documents, and have endless discussions about the correct cataloging of non- published and archival materials such as much of the American Memory collection. Many of the documents to be included in American Memory had already been cataloged by the Library, and most of the others were cataloged as part of a group of documents. Full MARC records (the Library=D5s machine-readable catalog record) exist for many of these documents. It seemed wasteful for the data conversion staff to re-catalog the documents, and totally unrealistic to expect them to identify the documents in a way that corresponds to the judgments made by the collection staff. For this reason, also, it was decided to utilize very little of the complex bibliographic structure in the TEI header. The keyers and coders are not trained catalogers, and would not have access to the existing cataloging information. Even if the existing cataloging were available, there is not a simple mapping between the structure of a MARC record and the bibliographic information in the TEI header. In order to make maximum use of existing cataloging resources, the Library staff decided that it was far more practical for them to make document identification and boundary decisions than for them to try to write guidelines that would allow data conversion staff to make those decisions. For each document, the Library of Congress has prepared a target page on which they provide a variety of identification and control information, and record some additional instructions to the keying staff. The target sheets: o Identify the start and type of the document; o Assign a unique document ID number, which will be used to link to the library cataloging record; o Provide the document title; o Determine the level of coding, for example whether or not to tag the full text of any advertising. Link to Scanned Images Much of the material to be included in the American Memory collections is fragile and valuable. In order to protect these materials, they were scanned by a contractor at the library, and the scanned images were used for further processing. Handling of the materials was strictly limited, to reduce damage to the collections. The images can be provided for display whenever appropriate. To make the display of the page images possible, at the beginning of the text for each page in the text file, a set of page-related information was inserted. This includes the id of the image of the page, and any page numbers that appear on that page. In many manuscripts, and some books, there are multiple numbers on a physical page. A page may have a "-2-" in the upper right header, and a "Page 14" in a footer. Using this philosophy, both page numbers were captured, and no judgment was made as to which was the "primary" or "correct" page number. There is also no need for the page numbers to be unique (which in real life they often are not) because pages are uniquely identified by the ID of the scanned images. These images will be used to display any non-text information in the documents, and for display of tabular information, and to provide a true representation of the original artifact (in the case of archival materials). Tables In the SGML world, there seems to more philosophical discussions over table encoding than any other single subject. There is a wide variety of table-handling models, none of which are fully satisfactory for all applications. The only agreement is that in many types of documents, particularly non-fiction documents, there is a great deal of critical information that is available only in tabular form. For the purposes of American Memory the analysis group decided that there were two major requirements for encoding of tables; o It is important to be able to do text retrieval on the words in the tables; and o Users need to be able to see the tables as they were published. Simple retrieval on the words and phrases in a cell was deemed sufficient; there was no need to support searches such as "find every table where the word 'Massachusetts' appears in one row and the words 'New York' appear in the subsequent row". It was also observed that the archival materials included a wide variety of table layouts, including not only rectangular tables with rows and columns, but also circular and spiral tables. The existence of the scanned images made a very simple and cost effective method available for encoding tables. The text of the tables would be made available for searching, but the display version of each table would be its scanned image, NOT a reconstructed table based on the text file. Only the text of the tables, including cell integrity, was captured in the text file. No formatting or structural information above the level of the cell was encoded about the tables, because while the table would be retrieved by the words in the cells, only their images would be displayed (not the searchable text. That is, each cell was identified as being in the table, and the words in the cell were identified, but no information on the size, shape, or location of the cell was captured. This meant that coding the tables became no more difficult or costly than coding much of the text, and the number of codes per table was reduced by an estimated factor of three to eight (depending on the complexity of the table). The American Memory DTD The American Memory DTD is a relatively simple implementation of the TEI concept. Among the simplifications made are: 1. TEI genre distinctions have been ignored. American Memory Documents are Prose. 2. Nearly every content model in the TEI was simplified to remove optional elements that were not relevant to American Memory. 3. No distinction was made between numbered and unnumbered divisions of text. Added Elements Elements were added to identify structures of particular interest in these collections, including: o Handwritten text in an otherwise typewritten or typeset document; o Text Stamped, Embossed, or Perforated on the documents; o Advertisements; o Tables; o Control or Scanning Page Number; o Page Number as Printed on the Page; o Blank Page Marker; o Library of Congress Catalog Number; o Collection Name; and o Copyright Information. Modifications Needed to Make American Memory TEI- Compliant The American Memory DTD in Current Use The American Memory DTD is in use to capture a variety of materials, and to re-tag some documents that were previously tagged using a non-SGML generic tagging scheme. The DTD has proven useful for tagging a variety of texts. American Memory has digitized a variety of Library of Congress collections. They are currently interested in talking to potential partners who may be interested in publishing some of these collections. While there is no easy way to measure the relative accuracy or retrieval system precision using SGML as compared to non-SGML encoding, the SGML option, and selection of the TEI model for American Memory seem to be working well. The Future of the American Memory [This section will probably never be written. LeeEllen's talk will incorporate this material, along with her lessons learned, implications for the library community, etc.]