Assignment 3: Who Said It?

Assigned Wednesday, 24 September
Due Wednesday, 1 October, 1:30 p.m.

Can you guess which author wrote the following lines: Jane Austen or Herman Melville?

I never met with a disposition more truly amiable.

But Queequeg, do you see, was a creature in the transition stage – neither caterpillar nor butterfly.

I’ll bet you can – but can we make a computer?

Let’s find out! We will build a classifier which, given a sentence as input, predicts which of the two authors wrote it. For our data, we will be using Austen’s Emma and Melville’s Moby Dick, digitized by Project Gutenberg. For a real experiment, we’d prefer more data and might use every available text by each of the authors. Restricting ourselves to just one book by each author keeps this assignment more manageable.

Contents

Set up

Task: Download and extract asmt3.zip.

For this assignment, we’ll continue using spaCy but will also need another important Python library, scikit-learn. You can install it by running this command:

$ pip3 install scikit-learn

Although the library is called scikit-learn, when we load components from it, we use the short name sklearn, e.g.,

from sklearn.linear_model import LogisticRegression

Part 1: Developing a logistic regression classifier

In this part, you will be developing a logistic regression classifier, starting with the code in whosaid.py.

Task: The starter code loads each text as a list of sentences. From these lists of sentences, build two new lists of (sentence, author) pairs, while filtering out sentences that are too short to be useful (1–2 words long, e.g., Chapter XIII). Join these lists into a single list called sents.

After this, the code will print how many Austen, Melville, and total sentences there are.

Task: Shuffle the labeled sentence list and partition the sentences into three data sets:

  • testing set: the first 1 000 sentences,
  • development test set: the next 1 000 sentences, and
  • training set: the rest

Before we can give these sentences to a classifier, we need to turn them into features:

Task: Use CountVectorizer to convert the three data sets into their corresponding bag-of-words feature representations, as matrices that can be provided to a scikit-learn classifier for training and prediction.

The names beginning with X_ should be the features, while the names beginning with y_ should be lists of the corresponding labels (authors).

Now we have our training features and labels, so we’re ready to train the classifier:

Task: Train a LogisticRegression classifier on X_train and y_train.

Task: To evaluate the classifier’s performance, compute the F1 score (and other metrics) using classification_report, looking at the predictions made for the test data.

Before you go on, it’s worth considering if the performance matches what you’d expect a computer to be able to do just by looking at the presence of particular words. Is it higher than you’d expect? Lower? Why might that be?

Let’s take a look at some of the sentences we get right and wrong. We should not look at the test data, but we can use our development test set – data that we didn’t train on, but which we’re not using for our “real” evaluation.

Task: From the development test set, create four subsets based on the real author vs the classifier’s prediction.

Now we can sample random sentences from each category:

Task: Print sample correct and incorrect predictions from the four sub-divided sets.

You may want to run this a few times to see different examples – or you can change the code to print more than one sample per set.

Task: Uncomment the call to show_most_informative_features, which prints out the 40 most informative features for each author, based on the classifier’s learned weights.

That concludes your classifier development. Note that every time you run your script you will get slightly different performance scores: that’s because you are randomly shuffling the data set every time, resulting in a different partition into training, development test, and test sets.

That brings us to one last step I want you to take:

Task: Replace your sentence-shuffling command with the following line:

random.Random(10).shuffle(sents)

and re-run the script.

This shuffles the sentence list based on a fixed random seed. The result is a list of sentences that have been randomly shuffled but nevertheless in the same sequential order for all of us, which will then lead to identical classifier models for everyone! This effectively freezes the model, and it allows us to share the same reference points for the next part of the homework, which centers on analysis.

Part 2: Analysis and write-up

In this part, you will explore the classifier you trained in Part 1 in order to gain an understanding of its inner workings.

Task: Answer the following questions in a PDF or plain-text document. While you’ll write some code to find the answers, your analysis is the focus here, not the numbers or calculations.

Be sure to number your answers so it’s clear which question you’re answering.

  1. Features

    Examine the list of the most informative features. Do you notice any patterns? Any surprising entries? (Your answer should be about one paragraph.)

  2. Main character names

    You may be thinking that the classifier is getting a lot of help from the main character names such as Emma, Ahab and Queequeg. Let’s see how well it does without them.

    Conveniently, CountVectorizer allows you to specify a stop list – a list of words we don’t want to include in the features. Try specifying the provided main_characters list and re-run the script.

    How is the new classifier’s performance? Did it degrade as much as you expected? Why do you think that is? How is the top feature list affected?

    When you’re done answering this question, switch back to including the character names. For the rest of the questions, use this original setting.

  3. Trying out sentences

    Next we can see what happens if we test the classifier on sentences that aren’t from Emma or Moby Dick!

    Test the classifier on the two sentences below. Sent1 is by Jane Austen, but taken from Persuasion. Sent2 is from Alice’s Adventures in Wonderland by Lewis Caroll.

    Sent1
    Anne was to leave them on the morrow, an event which they all dreaded.
    Sent2
    So Alice began telling them her adventures from the time when she first saw the White Rabbit.
    Hint Before you can give the sentences to your classifier’s .predict method, you’ll have to tokenize them and generate the features, like we did for the sentences in Emma and Moby Dick. You may want to test that this code is working by giving some obvious test sentences like "Emma said hello to Emma" and "I love whaling in my boat.".

    What label did the classifier give to Sent1 and Sent2? Did it match your expectation?

  4. Label probabilities for a sentence

    Labeling judgments aside, how likely does your model think it is that Sent1 is Austen? That is essentially P(austen | Sent1). To find out, we need to use the .predict_proba method instead of the usual predict.

    Because this method returns the probability for each class, you need to know which is which. The code below demonstrates how to find the probability estimates assigned to either label for the sentence Hello, world:

    >>> hellofeats = count_vect.transform(["Hello, world"])
    >>> whosaid.predict_proba(hellofeats)
    array([[0.33251418, 0.66748582]])
    >>> whosaid.classes_
    array(['austen', 'melville'], dtype='<U8')
    

    That is, whosaid thinks it’s 33% likely to be Austen and 67% likely to be Melville.

    1. Try it with Sent1. What is P(austen | Sent1)? That is, given Sent1, how likely is it to be Austen? What is P(melville | Sent1)?
    2. How about Sent2, P(austen | Sent2) and P(melville | Sent2)?
    3. From (a) and (b), how confident is your classifier about these classifications? Does that match what you expected?

Submitting the assignment

Submit your code and write-up on Gradescope. (Don’t submit the texts or any other files.)

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.