Assignment 4: Decision Trees

Assigned Thursday, 15 February
Due Wednesday, 21 February, 11:59 p.m.

Contents

Introduction

This week we learned about trees, which can be used to represent data such as ancestry (a family tree) or the spread of (mis)information through a social network. But trees can also be used for designing programs that classify other data – this is the machine learning model called a decision tree.

You can think of a decision tree as a series of questions that lead to an answer, e.g.,

Source: Creately

In this case, the questions lead to a possible classification of an animal. You may have also seen more serious examples, such as this decision tree about whether someone should be considered a close contact for Covid-19:

Source: Jessica Gould, You Have Questions About NYC’s New School Quarantine Rules. We Have Answers., Gothamist, 2021

This assignment uses very simple notions of what machine learning does, but it’s enough to give you the core idea: We have data on what has happened in the past, and we use statistics about it to make decisions about new situations. If you take a machine learning class, you will learn how to handle more complex notions of similarity between old and new cases, and more nuanced algorithms for deciding what to do based on the data.

This is our last Pyret assignment, and it pulls together many of the concepts we’ve studied.

Problem and data

The medical workers in a town where malaria risk is high want to distribute insecticide-treated bed nets, which help prevent the spread of the disease. Since they have a limited budget, they can only distribute nets to the people who are most at risk. To make these decisions, they want to use machine learning on data about people in the town, including

  • the distance of their home from a lake,
  • their age, and
  • whether they are pregnant.

Living close to a lake is associated with increased risk of malaria, and children and pregnant people both have less immunity to malaria and are therefore more susceptible.

To help them build this system, the medical workers have data about people from a similar town, including whether those people contracted malaria.

As we’ve done in the past, we’ll load this data from a Google spreadsheet. Note that the data has been divided into two sheets: train-data, which we’ll use in order to learn who to give bed nets to, and test-data, which we’ll use to evaluate how well the resulting system does at making these decisions.

Guidelines

  • 4 pointsRemember to test your functions thoroughly and to write your code clearly. Your work will be graded on clarity and thorough testing, as well as on whether it works correctly.
  • To ensure the autograder works, be careful about the names of the functions and the order of their arguments.
  • Remember your resources:

Part 1: Making and using a decision tree

In this first part, we’ll look at the structure of our decision trees and complete the assembly of a decision tree. We will use a decision tree to examine a row of the table and to decide whether to give that person a net.

Task: Open the starter file in CPO and select FileSave a copy.

The code first loads the train-data and test-data tables from the spreadsheet. It uses num-sanitizer to change the Boolean values into numbers – instead of true indicating someone is pregnant, now it’s 1. This will be useful later!

Now let’s consider the data definition for a decision tree:

data DecisionTree:
  | decision(result :: Boolean)
  | question(
      column :: String,
      threshold :: Number,
      less-than-branch :: DecisionTree,
      greater-eq-branch :: DecisionTree)
end

This says that a decision tree is either

  1. a decision whose result is true or false

  2. a question: How does the value in the specified column compare to some threshold? If it’s less than the threshold, we should follow the first branch of the tree. Otherwise, we should follow the second.

So, to give a simple example, a DecisionTree for whether to wear a coat based on the temperature might be

question("temperature", 55,
  decision(true),
  decision(false))

That is, if it’s less than 55 ° F, wear a coat. If not, don’t!

Exercise 1

Let’s make a first decision tree for whether to give a bed net to prevent malaria.

4 points

Task: Define net-dt to represent this DecisionTree:

Run your code, then examine net-dt in the interactions pane, clicking on the different parts. It should look like this:

Exercise 2

4 points

Task: Starting with the provided template (dt-fun), define a function

fun give-net(dt :: DecisionTree, r :: Row) -> Boolean:
  ...
end

that uses the given decision tree to decide whether the person in row r should get a net.

  • If dt is a decision, then give-net should return its value (true or false).
  • If dt is a question, then give-net should compare the value in the given column to the threshold, and then recursively examine either
    • the less-than-branch (if the column value is less than the threshold) or
    • the greater-eq-branch (if the column value is greater than or equal to the threshold).

To test your function (in a where block), you should define a small example table with values that lead to each of the five decision nodes in the tree.

Part 2: Learning a decision tree from data

In Part 1, we built and used a decision tree to decide who should get bed nets, but the thresholds in that decision tree were arbitrary. Now we’ll build a better tree by learning appropriate thresholds from our training data.

Exercise 3

Here’s a way we could learn what distance from a lake is close enough that someone should get a net:

  • We can filter the table to get a table of people who got malaria and a table of people who didn’t.
  • Then we can compute the average distance from a lake for the people in each table.
  • An appropriate threshold would then be halfway between these two averages.
4 points

Task: Use a similar process to compute three new thresholds based on train-data:

  • age-thresh, halfway between the average age for non-pregnant people with malaria and non-pregnant people without malaria,
  • dist-thresh-young, halfway between the average dist-from-lake of non-pregnant people younger than age-thresh, with/without malaria, and
  • dist-thresh-old, halfway between the average dist-from-lake of non-pregnant people who are at least as old as age-thresh, with/without malaria.

Notice that to find these three thresholds, you’ll be repeating the same steps, just with different tables and columns. You can do these three steps manually to compute each threshold, but I recommend writing a function to do it for you: If you write find-separator, you don’t need to give test cases for it.

fun find-separator(col :: String, t :: Table) -> Number:
  doc: "Return a number halfway between the average values of column `col` among people who got malaria and the average values of column `col` among people who did not get malaria"`
  ...
end

Exercise 4 (coach-free)

4 points

Task: Next, build a new DecisionTree named net-dt-learned that is the same as net-dt, except that it uses these new threshold values.

Great! Or is it?

Challenge exercise (coach-free)

To know if the DecisionTree we learned using train-data is any better than the one we started with, we should evaluate them both on a different set of data – namely, our test-data.

2 points

Task: Starting with the test-data table, make two new tables, results1 and results2, each of which has a column called predicted, which contains the results of running give-net on each row.

  • For results1, you should call give-net with the DecisionTree net-dt from Part 1, and
  • For results2, you should call give-net with the DecisionTree net-dt-learned from Part 2.

Hint: Use a lambda (lam) expression so you can provide both arguments to give-net!

There are many ways to evaluate predictions, but the most common is called an F1 score, which balances our desire for high precision (only giving nets to those who actually need them) with our desire for high recall (give nets to everyone who needs one).

The following function will compute the F1 score for a table, using the value in the contracted-malaria column and in the new predicted column:

fun f1-score(t :: Table) -> Number:
  doc: "Return the F1 score for a table of data with predictions"
  true-pos = filter-with(t, lam(r): r["contracted-malaria"] and r["predicted"] end).length()
  false-pos = filter-with(t, lam(r): not(r["contracted-malaria"]) and r["predicted"] end).length()
  false-neg = filter-with(t, lam(r): r["contracted-malaria"] and not(r["predicted"]) end).length()
  true-pos / (true-pos + (1/2 * (false-pos + false-neg)))
end

Task: Run f1-score on results1 and on results2 and observe: Which has a higher score? This tells you whether we actually improved the classifier by learning it from the data!

Submitting the assignment

  1. Download your file (FileDownload) and ensure it’s named asmt04.arr.
  2. Upload your assignment on Gradescope.

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.