Assignment 1: Words and Sentences

Assigned Wednesday, 10 September
Due Wednesday, 17 September, 1:30 p.m.

Contents

Set up

Task: Download and extract asmt1.zip.

This assignment uses the click Python module for processing command-line arguments, so make sure to install it before continuing:

$ pip install -r requirements.txt

Like most Python projects, future assignments will also use a requirements.txt file to specify any dependencies; you’ll run the same command to install them.

This also installs pytest, a testing framework. The starter code includes test functions that you can run to check your implementation:

$ pytest tokenizer.py
$ pytest segmenter.py

These tests will help verify that your functions work correctly. To run all tests at once, simply use pytest without any arguments.

Part 1: Tokenization and word frequencies 35 points

1.1: Alice in Wordland

In class on Monday, we looked at some copies of Edgar Allen Poe’s writing made available by Project Gutenberg. For this assignment, we’ll start by looking at another book they provide, Lewis Carroll’s Alice’s Adventures in Wonderland, which you’ll find in gutenberg/carroll-alice.txt.

Task: In the main function of tokenizer.py, read in the text of the book. (Specify the filename directly rather than making it a command-line argument.)

Task: Find the 20 most frequent words in the book by calling get_words (with lower=True) on the text and running words_by_frequency on the resulting list of words.

Check-in

If everything’s working, the counts you get should be:

the1603
and766
to706
a614
she518

And, a bit further down, you should see that alice occurs 221 times.

However, the word alice actually appears 398 times in the text! So, why isn’t this what the code finds? (Recall the issues we discussed in class!) If you’re not sure, examine the data to see if you can figure it out before continuing.

1.2: Punctuation 10 points

There’s a deficiency in the provided get_words function! When we calculate word frequencies, we probably don’t care whether the word was adjacent to a punctuation mark. For example, the word hatter appears in the text 57 times, but count_words says it appears only 24 times. Because it sometimes appears next to a punctuation mark, those instances are being counted separately:

hatter24
hatter.13
hatter,10
hatter:6
hatters1
hatter's1
hatter;1
hatter.'1

Our get_words function would be better if it separated punctuation from words. We can accomplish this by using the re.split function.

Below is a small example that demonstrates how the string .split() method works on a small text and compares it to using re.split() with parentheses around \W to capture the separators:

Examples starting with >>> are showing Python in interactive mode. To load the definitions from and then interact with Python, run python -i tokenize.py.
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> text.split()
['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is', 'in', 'vain."']
>>> re.split(r"(\W)", text)
['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the', ' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is', ' ', 'in', ' ', 'vain', '.', '', '"', '']

(While in class we used re.findall to specify what a token looked like, now we’re specifying what the boundary between tokens looks like – namely, it’s a \W, something that isn’t a word-like character. The parentheses around \W are crucial – they tell re.split to include the separators in the result, as documented in the Python documentation.)

Note that this isn’t quite what we want, but we’re getting closer, since we’ve separated the punctuation from the words.

Task: Using the above example as a guide, write and test a function called tokenize that takes a string as an input and returns a list of words and punctuation, but not extraneous spaces and empty strings.

Like get_words, tokenize should take an optional argument lower that defaults to False to determine whether the string should be case-normalized before separating the words.

Hint: You don’t need to modify the re.split() line; just remove the strings that only contain whitespace (or nothing at all).

>>> tokenize(text, lower=True)
['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little', 'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"']
>>> print(" ".join(tokenize(text, lower=True)))
" oh no , no , " said the little fly , " to ask me is in vain . "

Checking in

Use your tokenize function in conjunction with the words_by_frequency function to list the top five most frequent words in carroll-alice.txt. You should get this:

' (single quote) 2871
, (comma) 2418
the 1642
. (period) 988
and 872

1.3: Getting words 10 points

Task: Write a function called filter_nonwords that takes a list of strings as input and returns a new list of strings that excludes anything that isn’t entirely alphabetic. Use the string .isalpha() method to determine if a string is comprised of only alphabetic characters.

>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."'
>>> tokens = tokenize(text, lower=True)
>>> filter_nonwords(tokens)
['oh', 'no', 'no', 'said', 'the', 'little', 'fly', 'to', 'ask', 'me', 'is', 'in', 'vain']

Checking in

Use this function to list the top five most frequent words in carroll-alice.txt. Confirm that you get the following before moving on:

the1642
and872
to729
a632
it595

1.4: Most frequent words 5 points

Task: Iterate through all of the files in the gutenberg directory that end in .txt (that is, skip the README) and print the top five words for each.

To get a list of all the files in a directory, use Python’s built-in pathlib module:

from pathlib import Path

for book in Path("gutenberg").iterdir():
    with open(book, encoding="latin1") as f:
        ...

1.5: Token analysis questions 10 points

Task: Answer the following questions in your assignment writeup, which should be submitted as a PDF on Gradescope along with your Python files. Feel free to use LaTeX, Microsoft Word, or whatever you find convenient, as long as you can export to PDF.

  1. Most frequent word: Loop through all the files the gutenberg directory that end in .txt. Is the always the most common word? If not, what are some other words that show up as the most frequent word?
  2. Impact of lowercasing: If you don’t lowercase all the words before you count them, how does this result change, if at all?

Part 2: Sentence segmentation 55 points

While tokenization divides a text into a list of word tokens, sentence segmentation divides it into a list of sentences. In this part of the assignment, you will write a simple sentence segmenter.

The brown directory includes three text files taken from the Brown Corpus:

  • editorial.txt,
  • fiction.txt, and
  • lore.txt.

The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of five characters: period, colon, semicolon, exclamation point, and question mark.

However, there is a catch: not every period represents the end of a sentence. There are many abbreviations (U.S.A., Dr., Mon., etc., etc.) that can appear in the middle of a sentence where the periods do not indicate the end of a sentence. There are also many examples where colon is not the end of the sentence. The other three punctuation marks are all nearly unambiguously the ends of a sentence. (Yes, even semicolons!)

For each of the above files, I have also provided a file containing the token index (counting from 0) of each of the actual locations of the ends of sentences

  • editorial-eos.txt,
  • fiction-eos.txt, and
  • lore-eos.txt

Your job is to write a sentence segmenter, and to output the predicted token number of each sentence boundary.

2.1: Getting started

The provided segmenter.py has some starter code and can be run from the command line. When it’s called from the command line, it takes one required argument and two optional arguments:

$ ./segmenter.py --help
Usage: segmenter.py [OPTIONS]

  Segments text into sentences.

Options:
  -t, --textfile FILENAME         Unlabeled text file.  [required]
  -y, --hypothesis-file FILENAME  Write hypothesized boundaries to FILENAME
                                  (default: stdout).
  --help                          Show this message and exit.

Make sure you understand how this code uses the click module to process command-line arguments. The most confusing part of this may be that click is able to open files directly for you. For instance, the --textfile argument given in the segmenter code includes the type click.File("r"), meaning that the resulting parameter textfile will be a pointer to the text file in “read” mode. Since this file is already open, you should not call open() on it!

The segmenter.py starter code imports the tokenize function from the last section.

Checking in

Confirm that you can pass in brown/editorial.txt as the textfile argument, and that running your code from the previous part on it splits it into a list of 63,333 tokens.

Note: Do not filter out punctuation – these tokens will be exactly the ones we want to consider as potential sentence boundaries!

2.2: Baseline segmenter 10 points

The starter code contains a function called baseline_segmenter that takes a list of tokens as its only argument. It returns a list of tokenized sentences – that is, a list of lists of words, with one list per sentence.

>>> baseline_segmenter(tokenize("I am Sam. Sam I am."))
[['I', 'am', 'Sam', '.'], ['Sam', 'I', 'am', '.']]

Remember that every sentence in our data set ends with one of the five tokens (".", ":", ";", "!", and "?"). Since it’s just a baseline approach, baseline_segmenter predicts that every instance of one of these characters is the end of a sentence.

Task: Fill in the function write_sentence_boundaries. This function takes two arguments: a list of lists of strings (like the one returned by baseline_segmenter) and a writeable pointer to either an open file or sys.stdout.

You will need to loop through all of the sentences in the document. For each sentence, you will want to write the index of the last token in the sentence to the filepointer on a new line.

Checking in

Confirm that when you run baseline_segmenter on the tokens from editorial.txt, it predicts 3278 sentence boundaries, and that the first five predicted boundaries are at tokens 22, 54, 74, 99, and 131.

2.3: Evaluating the segmenter 5 points

To evaluate your system, I am providing you a program called evaluate.py that compares your hypothesized sentence boundaries with the ground-truth boundaries. This program will report to you the true positives, true negatives, false positives, and false negatives, along with some other metrics (precision, recall, F-measure), which may be new to you. You can run evaluate.py with the --help option to see all of the command-line options that it supports.

Task: If you didn’t do so for the checkpoint above, fill in the main function in segmenter.py to read the text from textfile, tokenize it, predict the sentence boundaries using baseline_segmenter, and write the boundaries to hypothesis_file.

A sample run, first writing the output of the baseline segmenter to the file editorial.hyp, would be:

$ ./segmenter.py -t brown/editorial.txt -y editorial.hyp
$ ./evaluate.py -d brown/ -c editorial -y editorial.hyp

Checking in

Confirm that you get the following before moving on:

TP:    2719	FN:       0
FP:     559	TN:   60055

PRECISION: 82.95%	RECALL: 100.00%	F: 90.68%

(A quick aside: this is a good case for why we like to think about F1 score to help reason about acceptable tradeoffs of precision and recall. If only 25% of the punctuation marks we retrieve were true sentence boundaries, we would get recall of 100% still, precision of 25%, and an F1 of 40% – much lower than the arithmetic average of 62.5% we would get from precision and recall. The further either precision or recall gets from 1, the more it affects the F1 score.)

2.4: Improving the segmenter 30 points

Now it’s time to improve the baseline sentence segmenter. Every time we encounter one of the punctuation marks that can end a sentence, we predict that it actually is ending a sentence. Therefore, we don’t have any false negatives! However, we have quite a few false positives.

Task: Fill in the improved_segmenter function to do a better job identifying sentence boundaries, and switch segmenter.py to use improved_segmenter.

You can see the type of tokens that your system is mis-characterizing by setting the verbosity of evaluate.py to something greater than 0. Setting it to 1 will print out all of the false positives and false negatives so you can work to improve your improved_segmenter function.

To test your segmenter, I will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts I am providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.

Be sure to test your segmenter on all the provided categories of text, not just “editorial”!

2.5: Segmentation questions 10 points

Task: Answer the following questions in your assignment writeup.

  1. Describe (using the metrics from the evaluation script) the performance of your final segmenter.
  2. Describe at least three things that your final segmenter does better than the baseline segmenter. What cases are you most proud of catching in your segmenter? Include specific examples that are handled well.
  3. Describe at least three places where your segmenter still makes mistakes. Include specific examples where your segmenter makes the wrong decision. If you had another week to work on this, what would you add? If you had the rest of the semester to work on it, what would you do?

Submitting the assignment

Submit your tokenizer.py, segmenter.py, and assignment write-up on Gradescope. (Don’t submit the texts or any other files.)

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.