CMPU 102 - Assignment 9
Short Document Concordance

Assigned: Monday, Apr 28

Due: Monday, May 5


A Concordance is a listing of all of the words appearing in a short document together with the line number(s) on which they appear.

In this assignment you will:

  1. Implement a Concordance using classes HashMap and PriorityQueue from the JCF (Java Collections Framework). 
  2. You will complete the classes Concordance and WordRecord as outlined in the starter project.
  3. You will gain experience using classes from the JCF.
  4. Scan through the input file, line-by-line, and in each line extract each word, construct a WordRecord, and put it in the HashMap. 
  5. When all of the words in the document have been added to the Concordance, you will enter the contents of the HashMap into a PriorityQueue and then perform successive removes of WordRecord objects from the PriorityQueue to write their contents to an output file.
  6. Use the submit102 script from the Linux prompt to hand in your project.

Summary: We have provided the starter project with the three classes you will need. You will complete the classes WordRecord and Concordance as instructed in the comments.  You will submit your project and hand in a hard copy of the code and input and output files.  

Download the starter project

  1. Click here to download the starter project: Assign9.zip
  2. Save it in your cs102 directory.
  3. Unzip the file.
  4. You should be able to Open this as an existing project in NetBeans (described next)

Launch NetBeans and Open Project

  1. From NetBeans, go to the File menu and select "Open Project" -- or click on the third button from the left on the button bar.
  2. Navigate to your cs102 folder and select the folder named "Assign9" 
  3. The "Open Project Folder" button should be enabled. Click it to open the starter project.
  4. The classes you need to modify are:  WordRecord and Concordance.  

The Concordance Class

The instance variables for this class are:


where concord will contain the words and occurrences within the textfile input file.   You will implement the following methods: 

In makeConcordance you will declare and initialize a Scanner to read the input file, textFile.  You must wrap a Scanner around a File attached to the fileName.  With this scanner you will repeatedly read a line from the file and increment the line number count.  Inside this outer loop you will use a StringTokenizer to extract each word from the line (a String).  When you create your StringTokenizer you will supply as parameters the String obtained by the scanner, and a second string--a listing of the delimiters.   The delimiters will be all of the punctuation characters (plus the space and carriage return), and specification of the delimiters is needed to prevent these punctuation characters from being attached to the beginning or end of a word.  You may declare a delimiter string as follows: 

String delims = ",.?;:-!)(\"\n\t ";   

Use
the StringTokenizer to extract each word from the input line.  Next you determine if the word (the key) is contained in the HashMap. If it is not present, you will form a WordRecord object and put it in the HashMap.  If the word is already present, you will have to add the current lineNumber to the list of line numbers on which the word appears.  This is done by first removing the WordRecord from the HashMap, adding the new line number to the object retrieved, and then putting it back in the HashMap.

When all of the lines from the input file have been read and all of the words extracted and added to the HashMap, the method writeConcordance is called.  You will need to create a PrintStream that wraps a FileOutputStream attached to a File with the fileName supplied as a parameter to this method.  You will construct a PriorityQueue<WordRecordand add the Collection obtained from the HashMap.  While this priority queue is not empty, successively remove each WordRecord and add the word and its list of line numbers to the output file.    

 General note: The online Java API is your friend. Consult it to determine which methods to use and how to use them correctly.

 The WordRecord Class

Class WordRecord implements the Comparable interface.  A WordRecord has a word (String) and an ArrayList<Integer> (for holding line numbers) as instance variables.  This class has the following methods: 


All of the methods are very straightforward.  Your implementation of toString should produce a string containing the word and a list of line numbers on which it appears.  When adding a new line number, you should determine that the new line is not the same as the last line number that was previously added.  You should not record duplicate line numbers when the word appears more than once on the same line.

The Main Class

The main class has only a main method, which has been implemented for you.  This method gets the name of the input and output files from the command line (args[0] and args[1]).  Then it creates a Concordance object and, inside a try block, makes a concordance and then writes it to the output file.  

Complete the partially implemented classes as detailed above.  Follow instructions included with the comments to implement each method. 

Good luck!

 

When you complete your program and run it, a sample fragment of the output (in file Concordance.txt) should look like the listing below.  The input text is from the beginning paragraphs of David Copperfield.   

 

a:    4, 7, 12, 16,

acquainted:       9,

all: 11, 17,

and: 4, 5, 6, 7, 10, 18,

any: 8,

anybody:    2, 18,

as:   3, 11,

at:   4, 17,

attaching: 11,

baby:       17,

be:   1, 2, 9, 18,

because:    13,

becoming:   9,

been:       3, 18,

before:     8,

began:      5,

begin:      2,

beginning: 3,

…… … …


Submitting your solution

From a terminal window, type the following commands:

cd
cd cs102 
submit102 Assign9