Assignment 6: Decision Trees

Assigned: Thursday, 27 October
Due: Wednesday, 2 November, 11:59 p.m.

This week we learned about trees, which can be used to represent data such as ancestry (a family tree) or the spread of (mis)information through a social network. But trees can also be used for designing programs that classify other data – this is the machine-learning model called a decision tree.

You can think of a decision tree as a series of questions that lead to an answer, e.g.,

Source: Creately

In this case, the questions lead to a possible classification of an animal. You may have also seen more serious examples, such as this decision tree about whether someone should be considered a close contact for Covid-19:

Source: Jessica Gould, “You Have Questions About NYC’s New School Quarantine Rules. We Have Answers.”, Gothamist, 2021

This assignment uses very simple notions of what machine learning does, but it’s enough to give you the core idea: We have data on what has happened in the past, and we use statistics about it to make decisions about new situations. (If you take a machine learning class, you will learn how to handle more complex notions of similarity between old and new cases, and more nuanced algorithms for deciding what to do based on the data.)

This is our last Pyret assignment, and it pulls together many of the concepts we’ve studied.

The medical workers in a town where malaria risk is high want to distribute insecticide-treated bed nets, which help prevent the spread of the disease. Since they have a limited budget, they can only distribute nets to the people who are most at risk. To make these decisions, they want to use machine learning on data about people in the town, including

  • the distance of their home from a lake,
  • their age, and
  • whether they are pregnant.

Living close to a lake is associated with increased risk of malaria, and children and pregnant people both have less immunity to malaria and are therefore more susceptible.

To help them build this system, the medical workers have data about people from a similar town, including whether those people contracted malaria.

As we’ve done in the past, we’ll load this data from a Google spreadsheet. Note that the data has been divided into two sheets: train-data, which we’ll use in order to learn who to give bed nets to, and test-data, which we’ll use to evaluate how well the resulting system does at making these decisions.

  • Do not change the name of the file or the functions.
  • Remember to test your functions thoroughly and to write your code clearly. Your work will be graded on clarity and thorough testing, as well as on whether it works correctly.
  • Remember your resources:

In this first part, we’ll look at the structure of our decision trees and complete the assembly of a decision tree. We will use a decision tree to examine a row of the table and to decide whether to give that person a net.

Task: Open the starter file in CPO and select FileSave a copy.

The code first loads the train-data and test-data tables from the spreadsheet. Note that it uses num-sanitizer to change the Boolean values into numbers – instead of true indicating someone is pregnant, now it’s 1. This will be useful later!

Now let’s consider the data definition for a decision tree:

data DecisionTree:
  | decision(answer :: Boolean)
  | question(
      column :: String,
      threshold :: Number,
      less-than-branch :: DecisionTree,
      greater-eq-branch :: DecisionTree)
end

This says that a decision tree is either

  1. a decision whose answer is true or false

  2. a question: How does the value in the specified column compare to some threshold? If it’s less than the threshold, we should follow the first branch of the tree. Otherwise, we should follow the second.

So, to give a simple example, a DecisionTree for whether to wear a coat based on the temperature might be

question("temperature", 55,
  decision(true),
  decision(false))

That is, if it’s less than 55 °F, wear a coat. If not, don’t!

Let’s make a first decision tree for whether to give a bed net to prevent malaria.

Task: Define net-dt to represent this DecisionTree:

Run your code, then examine net-dt in the interactions pane, clicking on the different parts. It should look like this:

Task: Starting with the provided template (dt-fun), define a function

fun give-net(dt :: DecisionTree, r :: Row) -> Boolean:
  ...
end

that uses the given decision tree to decide whether the person in row r should get a net.

  • If dt is a decision, then give-net should return its value (true or false).
  • If dt is a question, then give-net should compare the value in the given column to the threshold, and then recursively examine either
    • the less-than-branch (if the column value is less than the threshold) or
    • the greater-eq-branch (if the column value is greater than or equal to the threshold).

To test your function (in a where block), you should define a small example table with values that lead to each of the five decision nodes in the tree.

In Part 1, we built and used a decision tree to decide who should get bed nets, but the thresholds in that decision tree were arbitrary. Now we’ll build a better tree by learning appropriate thresholds from our training data.

Here’s a way we could learn what distance from a lake is close enough that someone should get a net:

  • We can filter the table to get a table of people who got malaria and a table of people who didn’t.

  • Then we can compute the average distance from a lake for the people in each table.

  • An appropriate threshold would then be halfway between these two averages.

Task: Use a similar process to compute three new thresholds based on train-data:

  • age-thresh, halfway between the average age for non-pregnant people with malaria and non-pregnant people without malaria,
  • dist-thresh-young, halfway between the average dist-from-lake of non-pregnant people younger than age-thresh, with/without malaria, and
  • dist-thresh-old, halfway between the average dist-from-lake of non-pregnant people who are at least as old as age-thresh, with/without malaria.

Notice that to find these three thresholds, you’ll be repeating the same steps, just with different tables and columns. You can do these three steps manually to compute each threshold, but I recommend writing a function to do it for you:

fun find-separator(col :: String, t :: Table) -> Number:
  doc: "Return a number halfway between the average values of column `col` among people who got malaria and the average values of column `col` among people who did not get malaria"
  ...
end

If you write find-separator, you don’t need to give test cases for it.

Task: Next, build a new DecisionTree, net-dt-learned, that is the same as net-dt, except that it uses these new threshold values.

Great! Or is it?

To know if the DecisionTree we learned using train-data is any better than the one we started with, we should evaluate them both on a different set of data – namely, our test-data.

Task: Make a table called results1 by adding a new column called predicted to the test-data table. This new column should have the results of running the give-net function with DecisionTree net-dt on each row.

Hint: Use a lambda expression so you can provide both arguments to give-net.

Task: Make a table called results2, where the predicted column is computed using net-dt-learned.

There are many ways to evaluate predictions, but the most common is called an F1 score, which balances our desire for high precision (only giving nets to those who actually need them) with our desire for high recall (give nets to everyone who needs one).

The following function will compute the F1 score for a table, using the value in the contracted-malaria column and in the new predicted column:

fun f1-score(t :: Table) -> Number:
  doc: "Return the F1 score for a table of data with predictions"

  true-pos = filter-with(t, lam(r): r["contracted-malaria"] and r["predicted"] end).length()
  false-pos = filter-with(t, lam(r): not(r["contracted-malaria"]) and r["predicted"] end).length()
  false-neg = filter-with(t, lam(r): r["contracted-malaria"] and not(r["predicted"]) end).length()

  true-pos / (true-pos + (1/2 * (false-pos + false-neg)))
end

Task: Run f1-score on results1 and on results2. Which has a higher score?

  1. Download your file (FileDownload) and ensure it’s named asmt06.arr.

  2. Upload your assignment on Gradescope.

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.

This assignment contains material adapted from Phyllis Frankl.