Assigned | Wednesday, 17 September |
Due | Wednesday, 24 September, 1:30 p.m. |
Contents
Set up
Task:
Download and extract asmt2.zip
.
This week, we’ll use the spaCy Python library. SpaCy is designed to make it easy to use pre-trained models to analyze very large sets of data.
Task: Install the required dependencies by running:
$ pip install -r requirements.txt $ python3 -m spacy download en_core_web_sm
This installs both pytest
(for testing) and spaCy (for
text processing). After installation, run import spacy
in
Python to verify the installation worked.
The starter code (lm.py
) is based on example solutions
to the in-class exercise from Wednesday. It loads a spaCy pipeline and
includes functions to generate unigrams and bigrams based on
spaCy’s tokenization of a text.
While in class we just loaded one text, Emma, for this
assignment we’ll train our model on both Emma
and Persuasion. To do this, we’ve provided a
function make_docs
that takes a list of file names and
produces a single spaCy Doc
with all the tokens from those
files.
Part 1: Language modeling with n-grams
1.1: Trigrams
Task:
Write get_trigrams
that takes a spaCy Doc
and
returns a list of all trigrams in order.
Feel free to base this function on the
provided get_bigrams
! If you want to abstract the commonalities
of get_unigrams
, get_bigrams
,
and get_trigrams
into a helper function, that’s also
a good idea, though not required.
1.2: N-gram counts
Now we need to store counts for all of the n-grams in our text.
Task:
Write a function get_ngram_counts
that takes your
training Doc
and uses
your get_unigrams
, get_bigrams
,
and get_trigrams
functions to generate
the n-gram counts within the training text and returns
separate Counter
s for the unigrams, bigrams, and trigrams.
1.3: Next word probability
Now we have all the pieces to calculate the probability of a word given the two words that come before it.
Task: Write a function
called calc_ngram_prob
. It should take a particular
trigram, the bigram counts, and the trigram counts, and it should return
the maximum likelihood estimation (MLE) of the probability
of that trigram.
As a reminder, the formula for the MLE probability of a trigram is:

For numerical stability, wrap the division term in a call
to math.log
to convert it from a probability to a log
probability. (Note: The reported check-in results below use natural
log.)
Return negative infinite (-math.inf
) for
n-grams that aren’t in the given Counter
.
Important: Do not call math.log(0)
.
If either the trigram count or its conditioning bigram count is 0,
return -math.inf
immediately instead of taking a log.
Check in
You can check that it’s working with these test trigrams:
>>> calc_ngram_prob(("the", "day", "was"), bigram_freqs, trigram_freqs) -3.2386784521643803 >>> calc_ngram_prob(("hey", "groovy", "trigram"), bigram_freqs, trigram_freqs) -inf
(Sadly, Jane Austen never even uses the unigram groovy
.)
Part 2: Text generation
2.1: Possible next words
Task:
Write a function get_possible_next_words
that takes a
sequence of text (as a string), the bigram counts, and the trigram
counts and returns a list of (word, probability) pairs for the possible
next words.
This function should use calc_ngram_prob
function to
calculate the log probabilities of each of the candidates. The results
don’t need to be sorted – we’ll do that when we use
them in the next function.
2.2: Most likely next word
Task:
Now write a function predict_next_word
that
uses get_possible_next_words
to find the most likely next
word for a sequence of text. Like get_possible_next_words
,
it should take as arguments the sequence of text (as a string), the
bigram counts, and the trigram counts. It should return the most likely
next word. If there is no possible continuation, it should return "UNK"
.
Check in
The most likely next word following an agreeable
should
be manner
(with log probability −1.5 or 22%).
2.3: Generating text
We can now generate text based on the books that our program “read”!
Task: Write a function
called generate_text
. It should take as arguments a string
representing the text to complete; n, the number of words to
generate; the bigram counts; and the trigram counts. It should return
the original text with n words added to it.
It’s possible you will be given an input where there are no
known continuations. In that case, your code can add the
token "UNK"
for unknown.
Note: Apply this rule at every generation
step. If the current context has no known continuation,
append "UNK"
. The next step uses the updated
context (which may again yield "UNK"
).
2.4: Generating more text
It’s boring to always generate the most likely completion; let’s make our text generation less predictable!
Task: Write a function
called sample_next_word
. It should be
like predict_next_word
, except that it returns a word from
the set of candidates produced by get_possible_next_words
,
sampled according to its probability. Like predict_next_word
and
generate_text
, sample_next_word
should return "UNK"
when there is no valid continuation.
Hint: You can use
the random.choices()
function to help you sample! But be aware that it expects non-negative
weights, so before you pass the probabilities as weights, you’ll need to
turn them back to normal probabilities using the math.exp
function.
Edge case: If all candidate log probabilities
are -math.inf
(so all weights are 0
after math.exp
), return "UNK"
instead of calling random.choices
.
Task:
Next, augment your generate_text
with an
optional mode
parameter. We’ll use mode
to
signal whether we want the most likely next word
(mode="top"
) or a word sampled
according to its probability
(mode="random"
). If mode
is omitted, treat it as "top"
(choose the most
likely next word).
2.5: Questions
Task: Make a PDF with your answers to the following questions.
- Try running
generate_text
(in random mode) to add 10 tokens for three text contexts of your choice. What are the results? - How does the output look? If you’re familiar with Austen’s writing, does it sound like her? Does it sound like any writer who knows English? What issues do you see?
Part 3: Smoothing and evaluation
This week we talked about the problem of sparsity, or how to handle zeros in language modeling. For example, if you were to train a unigram language model on the fiction category of the Brown corpus and then try to calculate the probability of generating the editorial category, you’d end with a 0 probability because there are words that occur in the editorial category that simply don’t appear in the fiction category.
In this part of the assignment, you’ll grapple with this problem in the context of using n-gram language models to analyze textual similarity.
3.1: Calculating text likelihood
Here’s one way we can use our n-gram language model to quantify the similarity between texts: Given a language model trained on one dataset, we can calculate the perplexity of another dataset. (We also use perplexity to evaluate language models.)
Recall that perplexity is the inverse probability of the dataset, normalized by the number of words:

Task:
Write a function called calc_text_perplexity
. It should
take a spaCy Doc
, a Counter
of bigrams, and
a Counter
of trigrams, and it should return the perplexity
of the document.
To avoid numerical stability issues, you should normalize by the length of the document before exponentiating the sum of log probabilities.
Hint: Remember that our code calculates log probabilities, so you add them rather than multiply. Likewise, exponents becomes multiplication.
Convention for short documents:
- If the document contains no trigrams, define its perplexity to
be
1.0
(empty product). - For documents containing trigrams with zero probability (unseen
trigrams without smoothing), the perplexity should
be
float("inf")
. (The1.0
return value is only for documents too short to have any trigrams.)
Check in
Run your calc_text_perplexity
on the training text. Your
perplexity should be around 6.7.
Run your calc_text_perplexity
on
the provided austen-sense.txt
file (Sense and
Sensibility). What do you observe?
3.2: Handling out-of-vocabulary words
The issue with calculating perplexity on texts outside of our training data is that they may contain trigrams that are not observed during training. The resulting perplexity score doesn’t make any sense – your script may terminate in an error, or calculate an extremely low perplexity, because dividing 1 by negative infinity returns zero.
We will add support for out-of-vocabulary words using a very simple technique: Laplace (add-1) smoothing. For unseen trigrams in our test data, we “pretend” we saw them once during training. We also adjust the counts for observed n-grams so that probability mass is distributed sensibly.
Task:
Write a function called add1_smoothing
. Your function
should take a test Doc
, a Counter
of bigram
frequencies, and a Counter
of trigram frequencies. It
should process the document and update the n-gram
Counter
s to account for n-grams in the
Doc
that weren’t observed during training.
Specifically, you should:
- Increment all of the existing trigram counts by 1.
- For each trigram in the document that you didn't see when training, give it a count of 1.
- For every trigram – including the ones you just added – increase the count for the bigram it starts with by 1.
Task:
Use your function to smooth the bigram and trigram counts from our
training text with austen-sense.txt
.
Check in
If you smooth the n-grams from our original
training Doc
with austen-sense.txt
and compare
the counts for the first four bigrams for the same text before and after
smoothing, you should observe counts like the following:
Bigram | Original | Smoothed |
---|---|---|
("[", "emma") |
1 | 2 |
("emma", "by") |
1 | 2 |
("by", "jane") |
3 | 5 |
("jane", "austen") |
2 | 5 |
Notice that each of these bigrams increased by at least +1 because they were already present in training. Some increased further because they were also the starting bigram of trigrams in the test text.
If you check the last four bigrams, you will see new bigrams, which were not observed in the training data:
Bigram | Original | Smoothed |
---|---|---|
("or", "producing") |
0 | 1 |
("producing", "coolness") |
0 | 1 |
("coolness", "between") |
0 | 1 |
("husbands", ".") |
0 | 1 |
These new bigrams were not observed in training, so they start at 0 and then appear once as the starting bigram of a test trigram, giving them count = 1 after smoothing.
3.3: Questions
Now that we have added smoothing, you can do some analysis of the Project Gutenberg data.
Task: Make a PDF with your answers to the following questions.
- Of the texts that are not written by Austen, which is the most
similar to the Austen training text? (You may want to use
iterdir()
like we did on Assignment 1 to iterate through the provided files.) - There are three authors with three texts apiece in the dataset: Austen, Chesterton, and Shakespeare. For each author, train your n-gram model using two of the texts per author, and test on the held-out third text. Which author’s writing is the most consistent? Your writeup should include details about which texts were used to train and test.
Submitting the assignment
Submit your lm.py
and your answers to the questions
in 2.5 and 3.3
on Gradescope. (Don’t
submit the texts or any other files.)
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.