Assignment 4: Learning and Exploring Word Embeddings

Assigned Wednesday, 1 October
Due Wednesday, 8 October, 1:30 p.m.

In this assignment, you will explore the use of a word embedding model to capture semantic and syntactic similarities between words based on their co-occurrence patterns.

Contents

Set up

For this assignment, we will use gensim, a Python library that supports vector space models. You will need to install it, as we’ve previously done for spacy and scikit-learn:

$ pip3 install -r gensim

For this assignment there are no starter files, but there are some code snippets below to help you.

Part 1: Training a Word2vec model

In this part, you will use the gensim library to train a Word2vec model on a text corpus. You will also learn how to perform some basic operations on the word vectors.

To learn a reasonable word embedding model requires a significant amount of text. The best models are trained on massive amounts of text, and we’ll look at one of these in Part 2. But, to start, we’ll try learning our own model from a single book.

We’ll try using the longest book from the Project Gutenberg selection we used on Assignment 2, namely the 1611 King James Version of the Bible. (This choice is neither to advocate for or against anyone’s religious beliefs. For our purposes, it’s simply a long document that has some interesting properties!)

You may already have the text sitting around from Assignment 2, but we can also just request load it from online:

Task: Use the requests module to download the text:

import requests

URL = "https://www.cs.vassar.edu/~cs366/data/bible-kjv.txt"

bible = requests.get(URL).text

Task: Prepare the text to give to gensim, dividing it into a list of sentences, with each sentence divided into a list of tokens.

At this point, you should feel very familiar with how to do that, using spaCy. You can implement it yourself or feel free to use this variant of the code we’ve used before:

from spacy.lang.en import English

nlp = English(pipeline=[])
nlp.add_pipe("sentencizer")

def get_sentences(text: str) -> list[list[str]]:
    """Split the specified text into sentences, consisting of text tokens."""

    sents = []

    # We process the text in chunks by paragraph, ensuring that a sentence
    # never crosses a paragraph boundary:
    for para in text.split("\n\n"):
        doc = nlp(para.replace("\n", " "))
        for sent in doc.sents:
            tokens = [
                token.text.lower().strip()
                for token in sent
                if not token.is_space
            ]
            sents.append(tokens)

    return sents

Task: Train the Word2vec model on the corpus using the gensim.models.Word2Vec class. You can use the default parameters except:

  • set the number of iterations (epochs) to 25
  • set the context window to 2

Store the model in a variable called bible_model.

Note: The higher the number of iterations, the longer it will take to run. Feel free to set it to a lower number while you’re working on the code, but set it to 25 before you answer any questions.

Now we can work with the trained vectors, which are in the wv element of the model object:

bible_vecs = bible_model.wv

And we can check whether the embeddings capture basic similarities we might expect in the source text:

Task: Use the most_similar method of the vectors to find the 10 most similar words to the following words:

  • garden
  • woman
  • well
  • cast

Print the words and their similarity scores.

The exact output format’s up to you, but, for an example, if you were looking at the word money you might print:

Similar to money:
- 0.59 food
- 0.58 victuals
- 0.56 meat
- 0.54 cup
- 0.52 wine
- 0.52 price
- 0.50 sack
- 0.50 vineyard
- 0.50 staff
- 0.50 goods

Task: Build another model from the KJV sentences with the same settings but the context window set to 10 tokens.

Update your code to print the same lists of similar words, including this new model.

Once again, the output format is up to you, but it could look like this to make it easy to compare:

Similar to money:
  2                  10
- 0.59 food          0.65 sacks
- 0.58 victuals      0.61 price
- 0.56 meat          0.60 sack
- 0.54 cup           0.55 meat
- 0.52 wine          0.53 chest
- 0.52 price         0.52 tribute
- 0.50 sack          0.51 vineyard
- 0.50 vineyard      0.50 bottle
- 0.50 staff         0.49 food
- 0.50 goods         0.48 goods

Task: Briefly describe (in a comment) the kind of changes you observe. Why do you think this happens?

Part 2: Exploring a large Word2vec model

For a high-quality distributional vector model, we want to learn the embeddings from a lot of text – more than is practical for everyone to do. Thankfully, researchers who build these models often distribute them for others to study and use.

One popular model is Mikolov et al.’s original Word2vec model trained on approx. 100 billion words of the (proprietary) Google News dataset. The gensim library can download it for us:

import gensim.downloader as gensim_api

gnews = gensim_api.load("word2vec-google-news-300")

Notes:

  • If you’re working on the CS Department systems, you don’t need to download this model! Instead, before the import gensim.download... line, add

    import os
    os.environ["GENSIM_DATA_DIR"] = "/data/366/word2vec"
    
    and then the load method should use the copy that’s already there!
  • This is a big model, so it may take a few minutes to download. It will be stored in a gensim-data directory in your home directory, e.g., on macOS, /Users/yourname/gensim-data. You may want to delete it when you’re done with the assignment.

Because this model was trained on so much more data than our Bible model, we hope the results will have better coverage and look more reasonable. Let’s take a look!

Task: Find at least one word and one multi-word expression for which the list of most_similar words looks appropriate. Print the lists of similar words.

Note: For multi-word expressions, join the words with underscores rather than spaces.

Task: Find at least two words (or multi-word phrases) for which the list of most_similar words includes something strange. Print the lists of similar words.

This might take some searching, but it can be done! Consider some less common words or phrases you could try.

The Word2vec model is known for its ability to solve analogies through simple vector arithmetic. For example, kingman + woman results in a vector close to queen.

Task: Use the most_similar method to find at least two original analogies (that is, not using examples seen in class or discussed in the readings) that yield a good result.

But it’s not all sunshine and queens…

Task: Find at least two analogies that fail to return the expected results.

(Some of these will be quite amusingly bad!)

Submitting the assignment

Submit your code on Gradescope. (Don’t submit the texts or any other files.)

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.