Assignment 2: Language Models

Assigned Wednesday, 17 September
Due Wednesday, 24 September, 1:30 p.m.

Contents

Set up

Task: Download and extract asmt2.zip.

This week, we’ll use the spaCy Python library. SpaCy is designed to make it easy to use pre-trained models to analyze very large sets of data.

Task: Install the required dependencies by running:

$ pip install -r requirements.txt
$ python3 -m spacy download en_core_web_sm

This installs both pytest (for testing) and spaCy (for text processing). After installation, run import spacy in Python to verify the installation worked.

The starter code (lm.py) is based on example solutions to the in-class exercise from Wednesday. It loads a spaCy pipeline and includes functions to generate unigrams and bigrams based on spaCy’s tokenization of a text.

While in class we just loaded one text, Emma, for this assignment we’ll train our model on both Emma and Persuasion. To do this, we’ve provided a function make_docs that takes a list of file names and produces a single spaCy Doc with all the tokens from those files.

Part 1: Language modeling with n-grams

1.1: Trigrams

Task: Write get_trigrams that takes a spaCy Doc and returns a list of all trigrams in order.

Feel free to base this function on the provided get_bigrams! If you want to abstract the commonalities of get_unigrams, get_bigrams, and get_trigrams into a helper function, that’s also a good idea, though not required.

1.2: N-gram counts

Now we need to store counts for all of the n-grams in our text.

Task: Write a function get_ngram_counts that takes your training Doc and uses your get_unigrams, get_bigrams, and get_trigrams functions to generate the n-gram counts within the training text and returns separate Counters for the unigrams, bigrams, and trigrams.

1.3: Next word probability

Now we have all the pieces to calculate the probability of a word given the two words that come before it.

Task: Write a function called calc_ngram_prob. It should take a particular trigram, the bigram counts, and the trigram counts, and it should return the maximum likelihood estimation (MLE) of the probability of that trigram.

As a reminder, the formula for the MLE probability of a trigram is:

P(w_n | w_{n-2}w_{n-1} = \frac{C(w_{n-2}w_{n-1}w_n)}{C(w_{n-2}w_{n-1})}

For numerical stability, wrap the division term in a call to math.log to convert it from a probability to a log probability. (Note: The reported check-in results below use natural log.)

Return negative infinite (-math.inf) for n-grams that aren’t in the given Counter.

Important: Do not call math.log(0). If either the trigram count or its conditioning bigram count is 0, return -math.inf immediately instead of taking a log.

Check in

You can check that it’s working with these test trigrams:

>>> calc_ngram_prob(("the", "day", "was"), bigram_freqs, trigram_freqs)
-3.2386784521643803
>>> calc_ngram_prob(("hey", "groovy", "trigram"), bigram_freqs, trigram_freqs)
-inf

(Sadly, Jane Austen never even uses the unigram groovy.)

Part 2: Text generation

2.1: Possible next words

Task: Write a function get_possible_next_words that takes a sequence of text (as a string), the bigram counts, and the trigram counts and returns a list of (word, probability) pairs for the possible next words.

This function should use calc_ngram_prob function to calculate the log probabilities of each of the candidates. The results don’t need to be sorted – we’ll do that when we use them in the next function.

2.2: Most likely next word

Task: Now write a function predict_next_word that uses get_possible_next_words to find the most likely next word for a sequence of text. Like get_possible_next_words, it should take as arguments the sequence of text (as a string), the bigram counts, and the trigram counts. It should return the most likely next word. If there is no possible continuation, it should return "UNK".

Check in

The most likely next word following an agreeable should be manner (with log probability −1.5 or 22%).

2.3: Generating text

We can now generate text based on the books that our program “read”!

Task: Write a function called generate_text. It should take as arguments a string representing the text to complete; n, the number of words to generate; the bigram counts; and the trigram counts. It should return the original text with n words added to it.

It’s possible you will be given an input where there are no known continuations. In that case, your code can add the token "UNK" for unknown.

Note: Apply this rule at every generation step. If the current context has no known continuation, append "UNK". The next step uses the updated context (which may again yield "UNK").

2.4: Generating more text

It’s boring to always generate the most likely completion; let’s make our text generation less predictable!

Task: Write a function called sample_next_word. It should be like predict_next_word, except that it returns a word from the set of candidates produced by get_possible_next_words, sampled according to its probability. Like predict_next_word and generate_text, sample_next_word should return "UNK" when there is no valid continuation.

Hint: You can use the random.choices() function to help you sample! But be aware that it expects non-negative weights, so before you pass the probabilities as weights, you’ll need to turn them back to normal probabilities using the math.exp function.

Edge case: If all candidate log probabilities are -math.inf (so all weights are 0 after math.exp), return "UNK" instead of calling random.choices.

Task: Next, augment your generate_text with an optional mode parameter. We’ll use mode to signal whether we want the most likely next word (mode="top") or a word sampled according to its probability (mode="random"). If mode is omitted, treat it as "top" (choose the most likely next word).

2.5: Questions

Task: Make a PDF with your answers to the following questions.

  1. Try running generate_text (in random mode) to add 10 tokens for three text contexts of your choice. What are the results?
  2. How does the output look? If you’re familiar with Austen’s writing, does it sound like her? Does it sound like any writer who knows English? What issues do you see?

Part 3: Smoothing and evaluation

This week we talked about the problem of sparsity, or how to handle zeros in language modeling. For example, if you were to train a unigram language model on the fiction category of the Brown corpus and then try to calculate the probability of generating the editorial category, you’d end with a 0 probability because there are words that occur in the editorial category that simply don’t appear in the fiction category.

In this part of the assignment, you’ll grapple with this problem in the context of using n-gram language models to analyze textual similarity.

3.1: Calculating text likelihood

Here’s one way we can use our n-gram language model to quantify the similarity between texts: Given a language model trained on one dataset, we can calculate the perplexity of another dataset. (We also use perplexity to evaluate language models.)

Recall that perplexity is the inverse probability of the dataset, normalized by the number of words:

PP(W) = (\prod_{n=1}^N P(w_n | w_{n-2} w_{n-1}))^{-1/N}

Task: Write a function called calc_text_perplexity. It should take a spaCy Doc, a Counter of bigrams, and a Counter of trigrams, and it should return the perplexity of the document.

To avoid numerical stability issues, you should normalize by the length of the document before exponentiating the sum of log probabilities.

Hint: Remember that our code calculates log probabilities, so you add them rather than multiply. Likewise, exponents becomes multiplication.

Convention for short documents:

  • If the document contains no trigrams, define its perplexity to be 1.0 (empty product).
  • For documents containing trigrams with zero probability (unseen trigrams without smoothing), the perplexity should be float("inf"). (The 1.0 return value is only for documents too short to have any trigrams.)

Check in

Run your calc_text_perplexity on the training text. Your perplexity should be around 6.7.

Run your calc_text_perplexity on the provided austen-sense.txt file (Sense and Sensibility). What do you observe?

3.2: Handling out-of-vocabulary words

The issue with calculating perplexity on texts outside of our training data is that they may contain trigrams that are not observed during training. The resulting perplexity score doesn’t make any sense – your script may terminate in an error, or calculate an extremely low perplexity, because dividing 1 by negative infinity returns zero.

We will add support for out-of-vocabulary words using a very simple technique: Laplace (add-1) smoothing. For unseen trigrams in our test data, we “pretend” we saw them once during training. We also adjust the counts for observed n-grams so that probability mass is distributed sensibly.

Task: Write a function called add1_smoothing. Your function should take a test Doc, a Counter of bigram frequencies, and a Counter of trigram frequencies. It should process the document and update the n-gram Counters to account for n-grams in the Doc that weren’t observed during training.

Specifically, you should:

  1. Increment all of the existing trigram counts by 1.
  2. For each trigram in the document that you didn't see when training, give it a count of 1.
  3. For every trigram – including the ones you just added – increase the count for the bigram it starts with by 1.

Task: Use your function to smooth the bigram and trigram counts from our training text with austen-sense.txt.

Check in

If you smooth the n-grams from our original training Doc with austen-sense.txt and compare the counts for the first four bigrams for the same text before and after smoothing, you should observe counts like the following:

Bigram Original Smoothed
("[", "emma") 1 2
("emma", "by") 1 2
("by", "jane") 3 5
("jane", "austen") 2 5

Notice that each of these bigrams increased by at least +1 because they were already present in training. Some increased further because they were also the starting bigram of trigrams in the test text.

If you check the last four bigrams, you will see new bigrams, which were not observed in the training data:

Bigram Original Smoothed
("or", "producing") 0 1
("producing", "coolness") 0 1
("coolness", "between") 0 1
("husbands", ".") 0 1

These new bigrams were not observed in training, so they start at 0 and then appear once as the starting bigram of a test trigram, giving them count = 1 after smoothing.

3.3: Questions

Now that we have added smoothing, you can do some analysis of the Project Gutenberg data.

Task: Make a PDF with your answers to the following questions.

  1. Of the texts that are not written by Austen, which is the most similar to the Austen training text? (You may want to use iterdir() like we did on Assignment 1 to iterate through the provided files.)
  2. There are three authors with three texts apiece in the dataset: Austen, Chesterton, and Shakespeare. For each author, train your n-gram model using two of the texts per author, and test on the held-out third text. Which author’s writing is the most consistent? Your writeup should include details about which texts were used to train and test.

Submitting the assignment

Submit your lm.py and your answers to the questions in 2.5 and 3.3 on Gradescope. (Don’t submit the texts or any other files.)

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.