Assigned | Wednesday, 24 September |
Due | Wednesday, 1 October, 1:30 p.m. |
Can you guess which author wrote the following lines: Jane Austen or Herman Melville?
I never met with a disposition more truly amiable.
But Queequeg, do you see, was a creature in the transition stage – neither caterpillar nor butterfly.
I’ll bet you can – but can we make a computer?
Let’s find out! We will build a classifier which, given a
sentence as input, predicts which of the two authors wrote it. For our
data, we will be using Austen’s Emma and
Melville’s Moby Dick, digitized by Project Gutenberg.
For a real
experiment, we’d prefer
more data and might use every available text by each of the authors.
Restricting ourselves to just one book by each author keeps this
assignment more manageable.
Contents
Set up
Task:
Download and extract asmt3.zip
.
For this assignment, we’ll continue using spaCy but will also need another important Python library, scikit-learn. You can install it by running this command:
$ pip3 install scikit-learn
Although the library is called scikit-learn, when we load components
from it, we use the short name sklearn
, e.g.,
from sklearn.linear_model import LogisticRegression
Part 1: Developing a logistic regression classifier
In this part, you will be developing a logistic regression
classifier, starting with the code in whosaid.py
.
Task:
The starter code loads each text as a list of sentences. From these
lists of sentences, build two new lists of (sentence, author) pairs,
while filtering out sentences that are too short to be useful (1–2
words long, e.g., Chapter XIII
). Join these lists into a single
list called sents
.
After this, the code will print how many Austen, Melville, and total sentences there are.
Task: Shuffle the labeled sentence list and partition the sentences into three data sets:
- testing set: the first 1 000 sentences,
- development test set: the next 1 000 sentences, and
- training set: the rest
Before we can give these sentences to a classifier, we need to turn them into features:
Task:
Use CountVectorizer
to convert the three data sets into their corresponding bag-of-words
feature representations, as matrices that can be provided to a
scikit-learn classifier for training and prediction.
The names beginning with X_
should be the features,
while the names beginning with y_
should be lists of the
corresponding labels (authors).
Now we have our training features and labels, so we’re ready to train the classifier:
Task:
Train
a LogisticRegression
classifier on X_train
and y_train
.
Task:
To evaluate the classifier’s performance, compute the F1 score
(and other metrics)
using classification_report
,
looking at the predictions made for the test data.
Before you go on, it’s worth considering if the performance matches what you’d expect a computer to be able to do just by looking at the presence of particular words. Is it higher than you’d expect? Lower? Why might that be?
Let’s take a look at some of the sentences we get right and wrong. We should not look at the test data, but we can use our development test set – data that we didn’t train on, but which we’re not using for our “real” evaluation.
Task: From the development test set, create four subsets based on the real author vs the classifier’s prediction.
Now we can sample random sentences from each category:
Task: Print sample correct and incorrect predictions from the four sub-divided sets.
You may want to run this a few times to see different examples – or you can change the code to print more than one sample per set.
Task:
Uncomment the call to show_most_informative_features
,
which prints out the 40 most informative features for each author,
based on the classifier’s learned weights.
That concludes your classifier development. Note that every time you run your script you will get slightly different performance scores: that’s because you are randomly shuffling the data set every time, resulting in a different partition into training, development test, and test sets.
That brings us to one last step I want you to take:
Task: Replace your sentence-shuffling command with the following line:
random.Random(10).shuffle(sents)
and re-run the script.
This shuffles the sentence list based on a fixed random seed. The result is a list of sentences that have been randomly shuffled but nevertheless in the same sequential order for all of us, which will then lead to identical classifier models for everyone! This effectively freezes the model, and it allows us to share the same reference points for the next part of the homework, which centers on analysis.
Part 2: Analysis and write-up
In this part, you will explore the classifier you trained in Part 1 in order to gain an understanding of its inner workings.
Task: Answer the following questions in a PDF or plain-text document. While you’ll write some code to find the answers, your analysis is the focus here, not the numbers or calculations.
Be sure to number your answers so it’s clear which question you’re answering.
Features
Examine the list of the most informative features. Do you notice any patterns? Any surprising entries? (Your answer should be about one paragraph.)
Main character names
You may be thinking that the classifier is getting a lot of help from the main character names such as Emma, Ahab and Queequeg. Let’s see how well it does without them.
Conveniently,
CountVectorizer
allows you to specify a stop list – a list of words we don’t want to include in the features. Try specifying the providedmain_characters
list and re-run the script.How is the new classifier’s performance? Did it degrade as much as you expected? Why do you think that is? How is the top feature list affected?
When you’re done answering this question, switch back to including the character names. For the rest of the questions, use this original setting.
Trying out sentences
Next we can see what happens if we test the classifier on sentences that aren’t from Emma or Moby Dick!
Test the classifier on the two sentences below. Sent1 is by Jane Austen, but taken from Persuasion. Sent2 is from Alice’s Adventures in Wonderland by Lewis Caroll.
- Sent1
- Anne was to leave them on the morrow, an event which they all dreaded.
- Sent2
- So Alice began telling them her adventures from the time when she first saw the White Rabbit.
Hint
Before you can give the sentences to your classifier’s.predict
method, you’ll have to tokenize them and generate the features, like we did for the sentences in Emma and Moby Dick. You may want to test that this code is working by giving some obvious test sentences like"Emma said hello to Emma"
and"I love whaling in my boat."
.What label did the classifier give to Sent1 and Sent2? Did it match your expectation?
Label probabilities for a sentence
Labeling judgments aside, how likely does your model think it is that Sent1 is Austen? That is essentially P(austen | Sent1). To find out, we need to use the
.predict_proba
method instead of the usualpredict
.Because this method returns the probability for each class, you need to know which is which. The code below demonstrates how to find the probability estimates assigned to either label for the sentence Hello, world:
>>> hellofeats = count_vect.transform(["Hello, world"]) >>> whosaid.predict_proba(hellofeats) array([[0.33251418, 0.66748582]]) >>> whosaid.classes_ array(['austen', 'melville'], dtype='<U8')
That is,
whosaid
thinks it’s 33% likely to be Austen and 67% likely to be Melville.- Try it with Sent1. What is P(austen | Sent1)? That is, given Sent1, how likely is it to be Austen? What is P(melville | Sent1)?
- How about Sent2, P(austen | Sent2) and P(melville | Sent2)?
- From (a) and (b), how
confident
is your classifier about these classifications? Does that match what you expected?
Submitting the assignment
Submit your code and write-up on Gradescope. (Don’t submit the texts or any other files.)
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.