Using the ANC with Xaira
Setting up Xaira and the ANC is a relatively straight forward task. It may look like there are a large number of steps involved,
but each step is essentially the same:
- Run the Indextools program and perform the tasks on the Tools menu in (almost) the same order they appear on the menu.
- Accept the default values (mostly) and click Ok.
- Wait.
These instructions only describe how to set up an elementary index for the ANC. Xaira supports much more complex indexing capabilities. Please refer to the Xaira documentation for more details.
Requirements
It is assumed that you have unpacked the ANC files somewhere onto
your hard drive. We will refer to this location as the ANC home directory from now
on. At a minimum you will need the "merged" directory and the XML
files respStmt.xml and publicationStmt.xml in the ANC home directory.
It is also assumed that you have the latest available build of Xaira installed (v1.10 as of this writing).
Steps
- Download the ANCXaira.zip
file. The zip archive contains the following files:
- BulkXsltW.exe : a program for preprocessing the ANC files.
- anc.xsl : a XSLT style sheet used to preprocess the ANC.
- bib.xsl.txt : A text file containing the XSL to be pasted into Xaira when creating the bibliography
- nyt-headers-fixed.zip : A zip file containing 14 updated/repaired headers for the NY Times files.
Unzip the ANCXaira.zip file to your ANC home directory. Unzip the nyt-headers-fixed.zip into the ANC home directory
as well. The structure of the files in the nut-headers-fixed zip file is the same as the directory structure of the
merged\nytimes directory so the headers will unzip to the proper locations if unpacked into the ANC home directory.
- Create a new directory called "texts" in your ANC home directory.
- Start the BulkXsltW program. If the BulkXsltW program is in the ANC home directory you can accept the default values
and simply click the "Transform" button. Otherwise enter the following:
- For the input directory browse to the ANC home directory and select the "merged" directory.
- For the XSL file select the "anc.xsl" file you installed above.
- For the output directory select the "texts" directory you created above.
- Leave the rest of the fields as they are.
- Click the "Transform" button.
- Wait. (it takes approximately 15 minutes on a 1.5 GHz machine.)
Quit the BulkXslt program when it completes.
Note: The BulkXslt program is a Java application bundled as an executable and
requires that you have Java 1.4 or later installed on your system. Since Xaira is a
Windows application I have only provided a Windows executable. I can provide executables for other platforms, a Java Jar file,
or the Java source code if needed.
- Start Xaira's index tools program.
- Select "New" from the "File" menu. You should see the message "(new) New Corpus" in Xaira's window.
- Select "Parameter file..." from the Tools menu
- Enter any thing you want as the name
- Click the Browse button for the "Root" and select your ANC home directory.
- Click the Default button.
- Click the "Advanced button"
- Make sure the "XML Validation" check box is not checked.
- Click OK
- Click Ok.
- Select "File list..." from the "Tools" menu. Click the
"Generate" button. There should be 11405 files in the new file list. Click Ok.
- Select "Make Header" from the "Tools" menu. Click "Ok" if you are asked if you want to make a new header. Go for coffee.
- It's probably a good idea to save your work at this point. Select "File -> Save" or click the "Save" icon on the toolbar.
- Select "Make Bibliography". Replace the contents of the window with the following:
<!-- Replace the select path by a path from the root of the document
to the bibliography. -->
<xsl:template match="/" xmlns:x="http://www.xces.org/schema/2003">
<xaira:bibliography>
<xsl:apply-templates select="//x:monogr" mode="xces"/>
</xaira:bibliography>
</xsl:template>
<xsl:template match="*" mode="xces">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@*" mode="xces"/>
<xsl:apply-templates mode="xces"/>
</xsl:element>
</xsl:template>
<xsl:template match="@*|text()" mode="xces">
<xsl:copy-of select="."/>
</xsl:template>
<!-- Make sure nothing else produces output -->
<xsl:template match="text()"/>
Click the "Ok" button and go for another coffee.
Note: You can cut and paste the above from the bib.xsl.txt file.
- Save the corpus again.
- Select "Special tags.." from the Tools menu. Select "Word break" from the combo box and "tok" in the Tags list. Click Ok.
- Select "Additional keys..." from the Tools menu and add the following two keys:
- For part of speech
- Name : POS
- Description : Part of speech
- Element : tok
- Attribute : msd
- Proc : Use value
- For lemmata
- Name : Lemma
- Description : Lemma
- Element : tok
- Attribute : base
- Proc : Use value
- Lemma scheme : checked
Click Ok to close the Additional keys dialog.
- Select "Indexer -> Run" from the "Tools" menu. Wait. Try to time this so you can select Run, turn off the lights, and
go home for the night..
- Select "XCorpus file..." from the "Tools" menu. Click the OK button.
- Close IndexTools.
You should now be able to open and query the ANC with Xaira.
Copyright 2002-2004 American National Corpus Project. All rights reserved.