Vassar CMPU 366: Assignment 5

Assigned	Wednesday, 8 October
Due	~~Friday, 17~~ Sunday, 19 October, 11:59 p.m.

In this assignment, we will use build a neural network model to classify songs by three musical artists: Beyoncé, Drake, and Taylor Swift.

For this assignment, we will use PyTorch to implement a neural network classifier similar to the one we wrote in class for predicting the language of a book title. But rather than count individual letters or words for the features, we will use a pre-trained neural network called DistilBERT to calculate word embeddings.

DistilBERT is a more efficient version of a popular contextual language model called BERT (which we’ll discuss in class soon!). As many machine learning practitioners do, you can use DistilBERT as a black box for now.

Set up
Part 1: Song lyric data
- 1.1: Loading data
- 1.2: Preprocessing data for DistilBERT
Part 2: Neural network classification
Part 3: Evaluation and class imbalance
Submitting the assignment

↑ Contents

Set up

Task: Download and unzip the starter code and data for the assignment.

Task: Install the PyTorch and transformers Python packages:

$ pip3 install -r requirements.txt

Try running the starter file. The first time you do so, it will download the DistilBERT model that will be used.

Part 1: Song lyric data

We have scraped the lyrics from each of the artists’ studio albums Not including some of their most recent albums. Sorry, Showgirl… into CSV files and split it into two datasets: one for training and one for testing.

These CSV files don’t have a header row, but the columns are:

Artist name
Album name
Track name
One line of the lyrics
Line number
Year

(The only ones we’ll worry about are the artist name and the line of lyrics.)

1.1: Loading data

The input to our model will be a vector of numbers. For training and evaluation, we also need to provide a list of labels – the correct classes for each lyric. The starter code includes a hard-coded dictionary mapping the artist names to class numbers.

Task: Fill in the function make_data that takes as input the name of one of the dataset CSV files and the label dictionary. It should read in the contents of the file and return a list of the lyrics (as strings) and a list of the corresponding labels (as the class numbers given in the dictionary).

Hint: Make your life easier; use the Python csv module to split each line into the columns. It knows to account for commas that are inside a column (e.g., in Honestly, Nevermind) vs those that separate columns.

Task: When you’ve filled in the function, uncomment the calls to make_data in main to read the training and testing data and the print statements below them.

Check in

If everything’s working right, the first entry in train_data should be 'She thought she was killing that shit, I told her, "Go harder"' and the first entry in train_labels should be 0 (Beyoncé).

(As a matter of course, we won’t peak at test_data or test_labels!)

1.2: Preprocessing data for DistilBERT

We mapped the labels to appropriate numeric values, and now we need to do the same for the lyrics. For this purpose, we’ll use DistilBERT’s own tokenizer, which maps character sequences to token indices. The starter code loads the DistilBERT tokenizer with the name tokenizer.

We’ll use a maximum length MAX_LENGTH, defaulting to 10, so that we can control how much memory the model consumes. Songs that are longer than the maximum length will be truncated; songs that are shorter will be padded.

Task: Fill in the function prep_bert_data to preprocess your data for DistilBERT, taking in a list of lyrics and calling the tokenizer on each of them in order to return a list of tensors , where each token has dtype=torch.long.You don’t need to use the return_tensors parameter; you can just call torch.tensor on each list, like we did in class for the letter-count features.

Use the right arguments to truncate and pad the songs to the maximum length. (See the parameters in the documentation for calling a PreTrainedTokenizer, which is the parent class of DistilBertTokenizer.)

The tokenizer will return a dictionary with both the input IDs and an attention mask. For this assignment, you can safely ignore the attention mask and just select the first element in the input IDs (even though you may see a warning message saying to use the attention mask).

Task: Uncomment the lines in main calling prep_bert_data to generate the training and testing features.

Check in

If everything’s working right, the first entry in train_feats (encoding the lyrics we saw above) should be

tensor([ 101, 2016, 2245, 2016, 2001, 4288, 2008, 4485, 1010,  102])

Part 2: Neural network classification

In class we saw how to use PyTorch to construct a neural network regression model with two hidden layers and ReLU activation functions. The model for this assignment will be a bit larger and take longer to train.

The starter code is designed to periodically save model checkpoints. Similar to when you play a videogame, these checkpoints ensure that if your program crashes or you kill it, you can reload the model from the file and continue training where you left off.

Once you have a model that you are happy with, you will be able to load it from the checkpoint for inference (prediction), without retraining it.

Task: Uncomment the code in main through the call to make_or_restore_model. The code will automatically create a directory called ckpt where checkpoints will be saved. The code in make_or_restore_model checks if there is a saved checkpoint; if not, it initializes the model from scratch.

Note: If you want to retrain your model from scratch (e.g., because something has gone wrong), you will need to delete the contents of the ckpt directory.

2.1: Incorporating embeddings

Task: Modify the model NN so it feeds your input through a pretrained DistilBERT model. You will need to modify the model definition in two places:

First, in __init__, you will want to add the DistilBERT model that’s loaded near the top of the file.

Next, in forward, you will need to feed the input into DistilBERT. DistilBERT returns multiple outputs; we want only the last_hidden_state. The documentation used to gave a nice example that shows how to do this, which they removed for some reason. It was:

from transformers import AutoTokenizer, DistilBertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)  # ** is turning the inputs dictionary into named arguments to model()

last_hidden_states = outputs.last_hidden_state

2.2: Flattening

The output from DistilBERT is two-dimensional:

One dimension is the number of tokens.
The other is the number of units in the last hidden layer of the DistilBERT model: 768.

To feed the output of DistilBERT into a softmax layer for classification, we will need to flatten it into one dimension. After all, we want a prediction per song lyric, not per word in the song lyric.

Task: Insert a nn.Flatten layer after the DistilBERT layer and before the nn.Linear layer.

You will also need to adjust the input dimensions to the nn.Linear output_layer. The input dimensions should be the number of tokens (MAX_LENGTH, passed in to the constructor as the n_features argument) times the number of DistilBERT units (768).

Check in

You have now implemented a regression classifier on DistilBERT embeddings. Before adding any more layers, check to make sure that your current version runs.

2.3: Going deep

The starter code just gives a simple regression model, equivalent to a feedforward neural network with no hidden layers. We can do better!

Task: Replace the existing linear layers (layer1 and layer2 from the starter code) with four nn.Linear layers, each with a nn.ReLU layer to your model to make it more powerful.

We didn’t see it in class, but you may find it useful to use the nn.Sequential function in your __init__ to group together layers.

The first hidden layer should take the number of dimensions of the DistilBERT embeddings as input (768), and output half as many dimensions. Each subsequent layer should reduce the dimensionality by half.

You should apply the nn.Flatten layer after the hidden layers, but before the softmax layer.

Part 3: Evaluation and class imbalance

When we train the model, we evaluate the overall model performance. But our lyric dataset suffers from class imbalance: there are twice as many Taylor Swift songs as Beyoncé songs.

3.1: Performance by class

Task: Write a function called print_performance_by_class that prints a summary of accuracy by category. Your function should take two parameters: a list of labels (such as test_labels) and a list of model predictions (as tensors).

Your function should print a summary like the one below.

Accuracy by Category:
Category 0: 0.0
Category 1: 0.0
Category 2: 1.0

Hint: You can use the test() function as an example of how to retrieve the model predictions and compare them to the correct labels!

3.2: Question

Task: Uncomment the call to print_performance_by_class in main and then answer this question in a plain-text file or PDF:

How well does your model perform? Report its accuracy for all three classes.

3.3: Sampling predictions

It can be useful to manually inspect some of the model’s predictions.

Task: Write a function called sample_and_print_predictions that takes four arguments: a dataset, the set of features for the dataset, the set of labels for the dataset, and the model. It should randomly sample 10 indices within the dataset and retrieve model predictions for those 10 lyrics.

For each of the sampled lyrics, it should print the lyric, its class, and the predicted class, e.g.,

Lyrics: You wear the same jewels that I gave you
- Class: Taylor Swift
- Prediction: Taylor Swift

Lyrics: I'm swervin' on that, swervin'-swervin' on that
- Class: Beyoncé
- Prediction: Drake

(You don’t need to use this exact format, but it’ll be easier to interpret if you show the actual artist names rather than the label numbers.)

3.4: Adjusting for class imbalance

As mentioned, a major challenge of our song lyric dataset is class imbalance: it contains twice as many Taylor Swift song lyrics as Beyoncé or Drake. But we can mitigate this imbalance problem by weighting less frequent classes more heavily in the loss function.

Task: Calculate the number of lyrics by artist in the training dataset. Add class weights as an argument to the loss function, nn.NLLLoss. You can do this by passing in a tensor that maps labels to weights, as described in the loss function documentation. (You might find it easiest to make a dictionary and then convert it to a tensor.)

Hint: A good weighting scheme is to weight each class inversely to its frequency in the training data.

3.5: Question

Task: Answer this question in the same write-up you started for 3.2:

How well does your model perform now? Report its accuracy for all three classes.

Submitting the assignment

Submit your code and your write-up for Part 3.2 and Part 3.5 on Gradescope. (Don’t submit the data or other files.)

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.