# Assignment 4: Decision Trees

 Assigned Thursday, 15 February Due Wednesday, 21 February, 11:59 p.m.

## Introduction

This week we learned about trees, which can be used to represent data such as ancestry (a family tree) or the spread of (mis)information through a social network. But trees can also be used for designing programs that classify other data – this is the machine learning model called a decision tree.

You can think of a decision tree as a series of questions that lead to an answer, e.g.,

In this case, the questions lead to a possible classification of an animal. You may have also seen more serious examples, such as this decision tree about whether someone should be considered a close contact for Covid-19:

This assignment uses very simple notions of what machine learning does, but it’s enough to give you the core idea: We have data on what has happened in the past, and we use statistics about it to make decisions about new situations. If you take a machine learning class, you will learn how to handle more complex notions of similarity between old and new cases, and more nuanced algorithms for deciding what to do based on the data.

This is our last Pyret assignment, and it pulls together many of the concepts we’ve studied.

## Problem and data

The medical workers in a town where malaria risk is high want to distribute insecticide-treated bed nets, which help prevent the spread of the disease. Since they have a limited budget, they can only distribute nets to the people who are most at risk. To make these decisions, they want to use machine learning on data about people in the town, including

• the distance of their home from a lake,
• their age, and
• whether they are pregnant.

Living close to a lake is associated with increased risk of malaria, and children and pregnant people both have less immunity to malaria and are therefore more susceptible.

To help them build this system, the medical workers have data about people from a similar town, including whether those people contracted malaria.

As we’ve done in the past, we’ll load this data from a Google spreadsheet. Note that the data has been divided into two sheets: `train-data`, which we’ll use in order to learn who to give bed nets to, and `test-data`, which we’ll use to evaluate how well the resulting system does at making these decisions.

## Guidelines

• 4 pointsRemember to test your functions thoroughly and to write your code clearly. Your work will be graded on clarity and thorough testing, as well as on whether it works correctly.
• To ensure the autograder works, be careful about the names of the functions and the order of their arguments.

## Part 1: Making and using a decision tree

In this first part, we’ll look at the structure of our decision trees and complete the assembly of a decision tree. We will use a decision tree to examine a row of the table and to decide whether to give that person a net.

Task: Open the starter file in CPO and select FileSave a copy.

The code first loads the `train-data` and `test-data` tables from the spreadsheet. It uses `num-sanitizer` to change the Boolean values into numbers – instead of `true` indicating someone is pregnant, now it’s `1`. This will be useful later!

Now let’s consider the data definition for a decision tree:

```data DecisionTree:
| decision(result :: Boolean)
| question(
column :: String,
threshold :: Number,
less-than-branch :: DecisionTree,
greater-eq-branch :: DecisionTree)
end
```

This says that a decision tree is either

1. a `decision` whose `result` is `true` or `false`

2. a `question`: How does the value in the specified `column` compare to some `threshold`? If it’s less than the threshold, we should follow the first branch of the tree. Otherwise, we should follow the second.

So, to give a simple example, a `DecisionTree` for whether to wear a coat based on the temperature might be

```question("temperature", 55,
decision(true),
decision(false))
```

That is, if it’s less than 55 ° F, wear a coat. If not, don’t!

### Exercise 1

Let’s make a first decision tree for whether to give a bed net to prevent malaria.

4 points

Task: Define `net-dt` to represent this `DecisionTree`:

Run your code, then examine `net-dt` in the interactions pane, clicking on the different parts. It should look like this:

### Exercise 2

4 points

Task: Starting with the provided template (`dt-fun`), define a function

```fun give-net(dt :: DecisionTree, r :: Row) -> Boolean:
...
end
```

that uses the given decision tree to decide whether the person in row `r` should get a net.

• If `dt` is a `decision`, then `give-net` should return its value (`true` or `false`).
• If `dt` is a `question`, then `give-net` should compare the value in the given column to the threshold, and then recursively examine either
• the `less-than-branch` (if the column value is less than the threshold) or
• the `greater-eq-branch` (if the column value is greater than or equal to the threshold).

To test your function (in a `where` block), you should define a small example table with values that lead to each of the five `decision` nodes in the tree.

## Part 2: Learning a decision tree from data

In Part 1, we built and used a decision tree to decide who should get bed nets, but the thresholds in that decision tree were arbitrary. Now we’ll build a better tree by learning appropriate thresholds from our training data.

### Exercise 3

Here’s a way we could learn what distance from a lake is close enough that someone should get a net:

• We can filter the table to get a table of people who got malaria and a table of people who didn’t.
• Then we can compute the average distance from a lake for the people in each table.
• An appropriate threshold would then be halfway between these two averages.
4 points

Task: Use a similar process to compute three new thresholds based on `train-data`:

• `age-thresh`, halfway between the average age for non-pregnant people with malaria and non-pregnant people without malaria,
• `dist-thresh-young`, halfway between the average `dist-from-lake` of non-pregnant people younger than `age-thresh`, with/without malaria, and
• `dist-thresh-old`, halfway between the average `dist-from-lake` of non-pregnant people who are at least as old as `age-thresh`, with/without malaria.

Notice that to find these three thresholds, you’ll be repeating the same steps, just with different tables and columns. You can do these three steps manually to compute each threshold, but I recommend writing a function to do it for you: If you write `find-separator`, you don’t need to give test cases for it.

```fun find-separator(col :: String, t :: Table) -> Number:
doc: "Return a number halfway between the average values of column `col` among people who got malaria and the average values of column `col` among people who did not get malaria"`
...
end
```

### Exercise 4 (coach-free)

4 points

Task: Next, build a new `DecisionTree` named `net-dt-learned` that is the same as `net-dt`, except that it uses these new threshold values.

Great! Or is it?

### Challenge exercise (coach-free)

To know if the `DecisionTree` we learned using `train-data` is any better than the one we started with, we should evaluate them both on a different set of data – namely, our `test-data`.

2 points

Task: Starting with the `test-data` table, make two new tables, `results1` and `results2`, each of which has a column called `predicted`, which contains the results of running `give-net` on each row.

• For `results1`, you should call `give-net` with the `DecisionTree` `net-dt` from Part 1, and
• For `results2`, you should call `give-net` with the `DecisionTree` `net-dt-learned` from Part 2.

Hint: Use a lambda (`lam`) expression so you can provide both arguments to `give-net`!

There are many ways to evaluate predictions, but the most common is called an F1 score, which balances our desire for high precision (only giving nets to those who actually need them) with our desire for high recall (give nets to everyone who needs one).

The following function will compute the F1 score for a table, using the value in the `contracted-malaria` column and in the new `predicted` column:

```fun f1-score(t :: Table) -> Number:
doc: "Return the F1 score for a table of data with predictions"
true-pos = filter-with(t, lam(r): r["contracted-malaria"] and r["predicted"] end).length()
false-pos = filter-with(t, lam(r): not(r["contracted-malaria"]) and r["predicted"] end).length()
false-neg = filter-with(t, lam(r): r["contracted-malaria"] and not(r["predicted"]) end).length()
true-pos / (true-pos + (1/2 * (false-pos + false-neg)))
end
```

Task: Run `f1-score` on `results1` and on `results2` and observe: Which has a higher score? This tells you whether we actually improved the classifier by learning it from the data!

## Submitting the assignment

1. Download your file (FileDownload) and ensure it’s named `asmt04.arr`.