Assigned | Wednesday, 1 October |
Due | Wednesday, 8 October, 1:30 p.m. |
In this assignment, you will explore the use of a word embedding model to capture semantic and syntactic similarities between words based on their co-occurrence patterns.
Contents
Set up
For this assignment, we will use gensim
, a Python
library that supports vector space models. You will need to install it,
as we’ve previously done for spacy
and scikit-learn
:
$ pip3 install -r gensim
For this assignment there are no starter files, but there are some code snippets below to help you.
Part 1: Training a Word2vec model
In this part, you will use the gensim
library to train a
Word2vec model on a text corpus. You will also learn how to perform some basic operations on
the word vectors.
To learn a reasonable word embedding model requires a significant amount of text. The best models are trained on massive amounts of text, and we’ll look at one of these in Part 2. But, to start, we’ll try learning our own model from a single book.
We’ll try using the longest book from the Project Gutenberg selection we used on Assignment 2, namely the 1611 King James Version of the Bible. (This choice is neither to advocate for or against anyone’s religious beliefs. For our purposes, it’s simply a long document that has some interesting properties!)
You may already have the text sitting around from Assignment 2, but we can also just request load it from online:
Task:
Use the requests
module to download the text:
import requests URL = "https://www.cs.vassar.edu/~cs366/data/bible-kjv.txt" bible = requests.get(URL).text
Task:
Prepare the text to give to gensim
, dividing it into a list of
sentences, with each sentence divided into a list of tokens.
At this point, you should feel very familiar with how to do that, using spaCy. You can implement it yourself or feel free to use this variant of the code we’ve used before:
from spacy.lang.en import English nlp = English(pipeline=[]) nlp.add_pipe("sentencizer") def get_sentences(text: str) -> list[list[str]]: """Split the specified text into sentences, consisting of text tokens.""" sents = [] # We process the text in chunks by paragraph, ensuring that a sentence # never crosses a paragraph boundary: for para in text.split("\n\n"): doc = nlp(para.replace("\n", " ")) for sent in doc.sents: tokens = [ token.text.lower().strip() for token in sent if not token.is_space ] sents.append(tokens) return sents
Task:
Train the Word2vec model on the corpus using
the gensim.models.Word2Vec
class. You can use the default
parameters except:
- set the number of iterations (epochs) to 25
- set the context window to 2
Store the model in a variable called bible_model
.
Note: The higher the number of iterations, the longer it will take to run. Feel free to set it to a lower number while you’re working on the code, but set it to 25 before you answer any questions.
Now we can work with the trained vectors, which are in
the wv
element of the model object:
bible_vecs = bible_model.wv
And we can check whether the embeddings capture basic similarities we might expect in the source text:
Task:
Use the most_similar
method of the vectors to find the 10
most similar words to the following words:
garden
woman
well
cast
Print the words and their similarity scores.
The exact output format’s up to you, but, for an example, if
you were looking at the word money
you might print:
Similar to money: - 0.59 food - 0.58 victuals - 0.56 meat - 0.54 cup - 0.52 wine - 0.52 price - 0.50 sack - 0.50 vineyard - 0.50 staff - 0.50 goods
Task: Build another model from the KJV sentences with the same settings but the context window set to 10 tokens.
Update your code to print the same lists of similar words, including this new model.
Once again, the output format is up to you, but it could look like this to make it easy to compare:
Similar to money: 2 10 - 0.59 food 0.65 sacks - 0.58 victuals 0.61 price - 0.56 meat 0.60 sack - 0.54 cup 0.55 meat - 0.52 wine 0.53 chest - 0.52 price 0.52 tribute - 0.50 sack 0.51 vineyard - 0.50 vineyard 0.50 bottle - 0.50 staff 0.49 food - 0.50 goods 0.48 goods
Task: Briefly describe (in a comment) the kind of changes you observe. Why do you think this happens?
Part 2: Exploring a large Word2vec model
For a high-quality distributional vector model, we want to learn the embeddings from a lot of text – more than is practical for everyone to do. Thankfully, researchers who build these models often distribute them for others to study and use.
One popular model is Mikolov et al.’s original Word2vec model trained on approx. 100 billion words of the (proprietary) Google News dataset. The gensim library can download it for us:
import gensim.downloader as gensim_api gnews = gensim_api.load("word2vec-google-news-300")
Notes:
If you’re working on the CS Department systems, you don’t need to download this model! Instead, before the
import gensim.download...
line, addimport os os.environ["GENSIM_DATA_DIR"] = "/data/366/word2vec"
and then theload
method should use the copy that’s already there!- This is a big model, so it may take a few minutes to download.
It will be stored in a
gensim-data
directory in your home directory, e.g., on macOS,/Users/yourname/gensim-data
. You may want to delete it when you’re done with the assignment.
Because this model was trained on so much more data than our Bible model, we hope the results will have better coverage and look more reasonable. Let’s take a look!
Task:
Find at least one word and one multi-word expression for which the list
of most_similar
words looks appropriate. Print the lists
of similar words.
Note: For multi-word expressions, join the words with underscores rather than spaces.
Task:
Find at least two words (or multi-word phrases) for which the list
of most_similar
words includes something strange. Print the lists
of similar words.
This might take some searching, but it can be done! Consider some less common words or phrases you could try.
The Word2vec model is known for its ability to solve analogies
through simple vector arithmetic. For example, king
− man
+ woman
results in a
vector close to queen
.
Task:
Use the most_similar
method to find at least two original
analogies (that is, not using examples seen in class or discussed in the
readings) that yield a good result.
But it’s not all sunshine and queens…
Task: Find at least two analogies that fail to return the expected results.
(Some of these will be quite amusingly bad!)
Submitting the assignment
Submit your code on Gradescope. (Don’t submit the texts or any other files.)
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.