| Assigned | Wednesday, 8 October |
| Due |
In this assignment, we will use build a neural network model to classify songs by three musical artists: Beyoncé, Drake, and Taylor Swift.
For this assignment, we will use PyTorch to implement a neural network classifier similar to the one we wrote in class for predicting the language of a book title. But rather than count individual letters or words for the features, we will use a pre-trained neural network called DistilBERT to calculate word embeddings.
DistilBERT is a more efficient version of a popular contextual language model called BERT (which we’ll discuss in class soon!). As many machine learning practitioners do, you can use DistilBERT as a black box for now.
Contents
Set up
Task: Download and unzip the starter code and data for the assignment.
Task: Install the PyTorch and transformers Python packages:
$ pip3 install -r requirements.txt
Try running the starter file. The first time you do so, it will download the DistilBERT model that will be used.
Part 1: Song lyric data
We have scraped the lyrics from each of the artists’ studio albums Not including some of their most recent albums. Sorry, Showgirl… into CSV files and split it into two datasets: one for training and one for testing.
These CSV files don’t have a header row, but the columns are:
- Artist name
- Album name
- Track name
- One line of the lyrics
- Line number
- Year
(The only ones we’ll worry about are the artist name and the line of lyrics.)
1.1: Loading data
The input to our model will be a vector of numbers. For training and evaluation, we also need to provide a list of labels – the correct classes for each lyric. The starter code includes a hard-coded dictionary mapping the artist names to class numbers.
Task:
Fill in the function make_data that takes as input the
name of one of the dataset CSV files and the label
dictionary. It should read in the contents of the file and return a list
of the lyrics (as strings) and a list of the corresponding labels (as
the class numbers given in the dictionary).
Hint: Make your life easier; use the
Python csv
module to split each line into the columns. It knows to
account for commas that are inside a column (e.g., in
Honestly,
Nevermind
) vs those that separate columns.
Task:
When you’ve filled in the function, uncomment the
calls to make_data in main to read the
training and testing data and the print statements below
them.
Check in
If everything’s working right, the first entry
in train_data should be 'She thought
she was killing that shit, I told her, "Go harder"' and the first
entry in train_labels should be 0
(Beyoncé).
(As a matter of course, we won’t peak at test_data
or test_labels!)
1.2: Preprocessing data for DistilBERT
We mapped the labels to appropriate numeric values, and now we need to do
the same for the lyrics. For this purpose, we’ll use DistilBERT’s
own tokenizer, which maps character sequences to token indices. The
starter code loads the DistilBERT tokenizer with the
name tokenizer.
We’ll use a maximum length MAX_LENGTH, defaulting
to 10, so that we can control how much memory the model consumes. Songs
that are longer than the maximum length will be truncated; songs that
are shorter will be padded.
Task:
Fill in the function prep_bert_data to preprocess your data for
DistilBERT, taking in a list of lyrics and calling the tokenizer
on each of them in order to return a list of tensors
, where each token has dtype=torch.long.You
don’t need to use the return_tensors parameter; you
can just call torch.tensor on each list, like we did in
class for the letter-count features.
Use the right arguments to truncate and pad the songs
to the maximum length. (See the parameters in the
documentation
for calling a PreTrainedTokenizer, which is the parent
class of DistilBertTokenizer.)
The tokenizer will return a dictionary with both the input IDs and an attention mask. For this assignment, you can safely ignore the attention mask and just select the first element in the input IDs (even though you may see a warning message saying to use the attention mask).
Task:
Uncomment the lines in main
calling prep_bert_data to generate the training and testing
features.
Check in
If everything’s working right, the first entry
in train_feats (encoding the lyrics we saw above) should
be
tensor([ 101, 2016, 2245, 2016, 2001, 4288, 2008, 4485, 1010, 102])
Part 2: Neural network classification
In class we saw how to use PyTorch to construct a neural network regression model with two hidden layers and ReLU activation functions. The model for this assignment will be a bit larger and take longer to train.
The starter code is designed to periodically save model checkpoints. Similar to when you play a videogame, these checkpoints ensure that if your program crashes or you kill it, you can reload the model from the file and continue training where you left off.
Once you have a model that you are happy with, you will be able to load it from the checkpoint for inference (prediction), without retraining it.
Task:
Uncomment the code in main through the call
to make_or_restore_model. The code will automatically create a
directory called ckpt where checkpoints will be saved. The code
in make_or_restore_model checks if there is a saved
checkpoint; if not, it initializes the model from scratch.
Note: If you want to retrain your model from
scratch (e.g., because something has gone wrong), you will need to
delete the contents of the ckpt directory.
2.1: Incorporating embeddings
Task:
Modify the model NN so it feeds your input through a
pretrained DistilBERT model. You will need to modify the model
definition in two places:
- First, in
__init__, you will want to add the DistilBERT model that’s loaded near the top of the file. - Next, in
forward, you will need to feed the input into DistilBERT. DistilBERT returns multiple outputs; we want only thelast_hidden_state. The documentation used to gave a nice example that shows how to do this, which they removed for some reason. It was:from transformers import AutoTokenizer, DistilBertModel import torch tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") model = DistilBertModel.from_pretrained("distilbert-base-uncased") inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) # ** is turning the inputs dictionary into named arguments to model() last_hidden_states = outputs.last_hidden_state
2.2: Flattening
The output from DistilBERT is two-dimensional:
- One dimension is the number of tokens.
- The other is the number of units in the last hidden layer of the DistilBERT model: 768.
To feed the output of DistilBERT into a softmax layer for classification, we will need to flatten it into one dimension. After all, we want a prediction per song lyric, not per word in the song lyric.
Task: Insert
a nn.Flatten
layer after the DistilBERT layer and before the nn.Linear
layer.
You will also need to adjust the input dimensions to
the nn.Linear output_layer. The input
dimensions should be the number of tokens (MAX_LENGTH, passed
in to the constructor as the n_features argument) times the number of DistilBERT units (768).
Check in
You have now implemented a regression classifier on DistilBERT embeddings. Before adding any more layers, check to make sure that your current version runs.
2.3: Going deep
The starter code just gives a simple regression model, equivalent to a feedforward neural network with no hidden layers. We can do better!
Task:
Replace the existing linear layers (layer1 and
layer2 from the starter code) with four nn.Linear layers, each with
a nn.ReLU layer to your model to make it more powerful.
We didn’t see it in class, but you may find it useful to use
the nn.Sequential function in your __init__ to
group together layers.
The first hidden layer should take the number of dimensions of the DistilBERT embeddings as input (768), and output half as many dimensions. Each subsequent layer should reduce the dimensionality by half.
You should apply the nn.Flatten layer after the hidden
layers, but before the softmax layer.
Part 3: Evaluation and class imbalance
When we train the model, we evaluate the overall model performance. But our lyric dataset suffers from class imbalance: there are twice as many Taylor Swift songs as Beyoncé songs.
3.1: Performance by class
Task:
Write a function called print_performance_by_class that
prints a summary of accuracy by category. Your function should take two
parameters: a list of labels (such as test_labels) and a list
of model predictions (as tensors).
Your function should print a summary like the one below.
Accuracy by Category: Category 0: 0.0 Category 1: 0.0 Category 2: 1.0
Hint: You can use the test() function
as an example of how to retrieve the model predictions and compare them
to the correct labels!
3.2: Question
Task:
Uncomment the call to print_performance_by_class
in main and then answer this question in a plain-text file
or PDF:
- How well does your model perform? Report its accuracy for all three classes.
3.3: Sampling predictions
It can be useful to manually inspect some of the model’s predictions.
Task:
Write a function called sample_and_print_predictions that
takes four arguments: a dataset, the set of features for the dataset,
the set of labels for the dataset, and the model. It should randomly
sample 10 indices within the dataset and retrieve model predictions for
those 10 lyrics.
For each of the sampled lyrics, it should print the lyric, its class, and the predicted class, e.g.,
Lyrics: You wear the same jewels that I gave you - Class: Taylor Swift - Prediction: Taylor Swift Lyrics: I'm swervin' on that, swervin'-swervin' on that - Class: Beyoncé - Prediction: Drake
(You don’t need to use this exact format, but it’ll be easier to interpret if you show the actual artist names rather than the label numbers.)
3.4: Adjusting for class imbalance
As mentioned, a major challenge of our song lyric dataset is class imbalance: it contains twice as many Taylor Swift song lyrics as Beyoncé or Drake. But we can mitigate this imbalance problem by weighting less frequent classes more heavily in the loss function.
Task:
Calculate the number of lyrics by artist in the training dataset. Add
class weights as an argument to the loss
function, nn.NLLLoss. You can do this by passing in a
tensor that maps labels to weights, as described in
the loss
function documentation. (You might find it easiest to make a
dictionary and then convert it to a tensor.)
Hint: A good weighting scheme is to weight each class inversely to its frequency in the training data.
3.5: Question
Task: Answer this question in the same write-up you started for 3.2:
- How well does your model perform now? Report its accuracy for all three classes.
Submitting the assignment
Submit your code and your write-up for Part 3.2 and Part 3.5 on Gradescope. (Don’t submit the data or other files.)
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.