Assigned | Wednesday, 10 September |
Due | Wednesday, 17 September, 1:30 p.m. |
Contents
Set up
Task:
Download and extract asmt1.zip
.
This assignment uses
the click
Python module for processing command-line arguments, so make sure to
install it before continuing:
$ pip install -r requirements.txt
Like most Python projects, future assignments will also use
a requirements.txt
file to specify any dependencies;
you’ll run the same command to install them.
This also installs pytest
, a testing framework. The
starter code includes test functions that you can run to check your
implementation:
$ pytest tokenizer.py $ pytest segmenter.py
These tests will help verify that your functions work correctly. To run
all tests at once, simply use pytest
without any arguments.
Part 1: Tokenization and word frequencies 35 points
1.1: Alice in Wordland
In class on Monday, we looked at some copies of Edgar Allen
Poe’s writing made available
by Project Gutenberg. For this
assignment, we’ll start by looking at another book they provide,
Lewis Carroll’s Alice’s Adventures in
Wonderland, which you’ll find
in gutenberg/carroll-alice.txt
.
Task:
In the main
function of tokenizer.py
, read in
the text of the book. (Specify the filename directly rather than making
it a command-line argument.)
Task: Find the 20 most frequent words
in the book by calling get_words
(with lower=True
)
on the text and running words_by_frequency
on the resulting
list of words.
Check-in
If everything’s working, the counts you get should be:
the | 1603 |
and | 766 |
to | 706 |
a | 614 |
she | 518 |
… | … |
And, a bit further down, you should see
that alice
occurs 221 times.
However, the word alice
actually appears 398 times in the text! So, why isn’t this what
the code finds? (Recall the issues we discussed in class!) If
you’re not sure, examine the data to see if you can figure it out
before continuing.
1.2: Punctuation 10 points
There’s a deficiency in the provided get_words
function! When we calculate word frequencies, we probably don’t care
whether the word was adjacent to a punctuation mark. For example, the
word hatter
appears in the text 57 times, but
count_words
says it appears only 24 times. Because it
sometimes appears next to a punctuation mark, those instances are being
counted separately:
hatter | 24 |
hatter. | 13 |
hatter, | 10 |
hatter: | 6 |
hatters | 1 |
hatter's | 1 |
hatter; | 1 |
hatter.' | 1 |
Our get_words
function would be better if it separated
punctuation from words. We can accomplish this by using the re.split
function.
Below is a small example that demonstrates how the
string .split()
method works on a small text and compares it to using re.split()
with parentheses around \W
to capture the separators:
python -i tokenize.py
.
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."' >>> text.split() ['"Oh', 'no,', 'no,"', 'said', 'the', 'little', 'Fly,', '"to', 'ask', 'me', 'is', 'in', 'vain."'] >>> re.split(r"(\W)", text) ['', '"', 'Oh', ' ', 'no', ',', '', ' ', 'no', ',', '', '"', '', ' ', 'said', ' ', 'the', ' ', 'little', ' ', 'Fly', ',', '', ' ', '', '"', 'to', ' ', 'ask', ' ', 'me', ' ', 'is', ' ', 'in', ' ', 'vain', '.', '', '"', '']
(While in class we used re.findall
to specify what a
token looked like, now we’re specifying what the boundary between
tokens looks like – namely, it’s a \W
,
something that isn’t a word-like character. The parentheses around
\W
are crucial – they tell re.split
to
include the separators in the result, as documented in the Python documentation.)
Note that this isn’t quite what we want, but we’re getting closer, since we’ve separated the punctuation from the words.
Task:
Using the above example as a guide, write and test a function
called tokenize
that takes a string as an input and returns
a list of words and punctuation, but not extraneous spaces and empty
strings.
Like get_words
, tokenize
should take an
optional argument lower
that defaults
to False
to determine whether the string
should be case-normalized before separating the words.
Hint: You don’t need to modify
the re.split()
line; just remove the strings that
only contain whitespace (or nothing at all).
>>> tokenize(text, lower=True) ['"', 'oh', 'no', ',', 'no', ',', '"', 'said', 'the', 'little', 'fly', ',', '"', 'to', 'ask', 'me', 'is', 'in', 'vain', '.', '"'] >>> print(" ".join(tokenize(text, lower=True))) " oh no , no , " said the little fly , " to ask me is in vain . "
Checking in
Use your tokenize
function in conjunction with
the words_by_frequency
function to list the top five most
frequent words in carroll-alice.txt
. You should get
this:
'(single quote) |
2871 | |
,(comma) |
2418 | |
the | 1642 | |
.(period) |
988 | |
and | 872 |
1.3: Getting words 10 points
Task:
Write a function called filter_nonwords
that takes a list
of strings as input and returns a new list of strings that excludes
anything that isn’t entirely alphabetic. Use the string .isalpha()
method to determine if a string is comprised of only alphabetic characters.
>>> text = '"Oh no, no," said the little Fly, "to ask me is in vain."' >>> tokens = tokenize(text, lower=True) >>> filter_nonwords(tokens) ['oh', 'no', 'no', 'said', 'the', 'little', 'fly', 'to', 'ask', 'me', 'is', 'in', 'vain']
Checking in
Use this function to list the top five most frequent words
in carroll-alice.txt
. Confirm that you get the following
before moving on:
the | 1642 |
and | 872 |
to | 729 |
a | 632 |
it | 595 |
1.4: Most frequent words 5 points
Task: Iterate through all of the files in
the gutenberg
directory that end in .txt
(that
is, skip the README
) and print the top five words
for each.
To get a list of all the files in a directory, use Python’s
built-in pathlib
module:
from pathlib import Path for book in Path("gutenberg").iterdir(): with open(book, encoding="latin1") as f: ...
1.5: Token analysis questions 10 points
Task: Answer the following questions in your assignment writeup, which should be submitted as a PDF on Gradescope along with your Python files. Feel free to use LaTeX, Microsoft Word, or whatever you find convenient, as long as you can export to PDF.
- Most frequent word: Loop through all the files
the
gutenberg
directory that end in.txt
. Isthe
always the most common word? If not, what are some other words that show up as the most frequent word? - Impact of lowercasing: If you don’t lowercase all the words before you count them, how does this result change, if at all?
Part 2: Sentence segmentation 55 points
While tokenization divides a text into a list of word tokens, sentence segmentation divides it into a list of sentences. In this part of the assignment, you will write a simple sentence segmenter.
The brown
directory includes three text files taken from
the Brown
Corpus:
editorial.txt
,fiction.txt
, andlore.txt
.
The files do not indicate where one sentence ends and the next begins. In the data set you are working with, sentences can only end with one of five characters: period, colon, semicolon, exclamation point, and question mark.
However, there is a catch: not every period represents the end of a
sentence. There are many abbreviations (U.S.A.
,
Dr.
, Mon.
, etc.
, etc.) that can appear in the middle of a sentence
where the periods do not indicate the end of a sentence. There are also
many examples where colon is not the end of the sentence. The other
three punctuation marks are all nearly unambiguously the ends of a
sentence. (Yes, even semicolons!)
For each of the above files, I have also provided a file containing the token index (counting from 0) of each of the actual locations of the ends of sentences
editorial-eos.txt
,fiction-eos.txt
, andlore-eos.txt
Your job is to write a sentence segmenter, and to output the predicted token number of each sentence boundary.
2.1: Getting started
The provided segmenter.py
has some starter code and can
be run from the command line. When it’s called from the command line, it
takes one required argument and two optional arguments:
$ ./segmenter.py --help Usage: segmenter.py [OPTIONS] Segments text into sentences. Options: -t, --textfile FILENAME Unlabeled text file. [required] -y, --hypothesis-file FILENAME Write hypothesized boundaries to FILENAME (default: stdout). --help Show this message and exit.
Make sure you understand how this code uses
the click
module to process command-line arguments. The most confusing part of
this may be that click
is able to open files directly for
you. For instance, the --textfile
argument given in the
segmenter code includes the
type click.File("r")
, meaning
that the resulting parameter textfile
will be a pointer to
the text file in “read” mode. Since this file is already
open, you should not call open()
on it!
The segmenter.py
starter code imports
the tokenize
function from the last section.
Checking in
Confirm that you can pass in brown/editorial.txt
as
the textfile
argument, and that running your code from the
previous part on it splits it into a list of 63,333 tokens.
Note: Do not filter out punctuation – these tokens will be exactly the ones we want to consider as potential sentence boundaries!
2.2: Baseline segmenter 10 points
The starter code contains a function called baseline_segmenter
that takes a list of tokens as its only argument. It returns a list of tokenized sentences – that is, a list of lists of words, with one list per sentence.
>>> baseline_segmenter(tokenize("I am Sam. Sam I am.")) [['I', 'am', 'Sam', '.'], ['Sam', 'I', 'am', '.']]
Remember that every sentence in our data set ends with one of the five
tokens ("."
, ":"
, ";"
, "!"
, and "?"
). Since
it’s just a baseline approach, baseline_segmenter
predicts
that every instance of one of these characters is the end of a
sentence.
Task:
Fill in the function write_sentence_boundaries
. This function takes two arguments: a list of lists of strings (like the one returned by baseline_segmenter
) and a writeable pointer to either an open file or sys.stdout
.
You will need to loop through all of the sentences in the document. For each sentence, you will want to write the index of the last token in the sentence to the filepointer on a new line.
Checking in
Confirm that when you run baseline_segmenter
on the
tokens from editorial.txt
, it predicts 3278 sentence
boundaries, and that the first five predicted boundaries are at tokens
22, 54, 74, 99, and 131.
2.3: Evaluating the segmenter 5 points
To evaluate your system, I am providing you a program
called evaluate.py
that compares your hypothesized sentence
boundaries with the ground-truth boundaries. This program will report to
you the true positives, true negatives, false positives, and false
negatives, along with some other metrics (precision, recall, F-measure),
which may be new to you. You can run evaluate.py
with
the --help
option to see all of the command-line options that
it supports.
Task:
If you didn’t do so for the checkpoint above, fill in
the main
function in segmenter.py
to
read the text from textfile
, tokenize it, predict the
sentence boundaries using baseline_segmenter
, and write the
boundaries to hypothesis_file
.
A sample run, first writing the output of the baseline segmenter to
the file editorial.hyp
, would be:
$ ./segmenter.py -t brown/editorial.txt -y editorial.hyp $ ./evaluate.py -d brown/ -c editorial -y editorial.hyp
Checking in
Confirm that you get the following before moving on:
TP: 2719 FN: 0 FP: 559 TN: 60055 PRECISION: 82.95% RECALL: 100.00% F: 90.68%
(A quick aside: this is a good case for why we like to think about F1 score to help reason about acceptable tradeoffs of precision and recall. If only 25% of the punctuation marks we retrieve were true sentence boundaries, we would get recall of 100% still, precision of 25%, and an F1 of 40% – much lower than the arithmetic average of 62.5% we would get from precision and recall. The further either precision or recall gets from 1, the more it affects the F1 score.)
2.4: Improving the segmenter 30 points
Now it’s time to improve the baseline sentence segmenter. Every time we encounter one of the punctuation marks that can end a sentence, we predict that it actually is ending a sentence. Therefore, we don’t have any false negatives! However, we have quite a few false positives.
Task:
Fill in the improved_segmenter
function to do a better job
identifying sentence boundaries, and switch segmenter.py
to
use improved_segmenter
.
You can see the type of tokens that your system is mis-characterizing
by setting the verbosity
of evaluate.py
to
something greater than 0. Setting it to 1 will print out all of the
false positives and false negatives so you can work to improve
your improved_segmenter
function.
To test your segmenter, I will run it on a hidden text you haven’t seen yet. It may not make sense for you to spend a lot of time trying to fix obscure cases that show up in the three texts I am providing you because these cases may never show up in the hidden text that you haven’t seen. But it is important to handle cases that occur multiple times and even some cases that appear only once if you suspect they could appear in the hidden text. You want to write your code to be as general as possible so that it works well on the hidden text without leading to too many false positives.
Be sure to test your segmenter on all the provided categories of text, not just “editorial”!
2.5: Segmentation questions 10 points
Task: Answer the following questions in your assignment writeup.
- Describe (using the metrics from the evaluation script) the performance of your final segmenter.
- Describe at least three things that your final segmenter does better than the baseline segmenter. What cases are you most proud of catching in your segmenter? Include specific examples that are handled well.
- Describe at least three places where your segmenter still makes mistakes. Include specific examples where your segmenter makes the wrong decision. If you had another week to work on this, what would you add? If you had the rest of the semester to work on it, what would you do?
Submitting the assignment
Submit your tokenizer.py
, segmenter.py
, and assignment write-up on Gradescope. (Don’t submit the texts or any other files.)
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.