Assigned | Thursday, 15 February |
Due | Wednesday, 21 February, 11:59 p.m. |
Contents
- Introduction
- Problem and data
- Guidelines
- Part 1: Making and using a decision tree
- Part 2: Learning a decision tree from data
- Submitting the assignment
Introduction
This week we learned about trees, which can be used to represent
data such as ancestry (a family tree) or the spread of (mis)information
through a social network. But trees can also be used for designing programs
that classify other data – this is the machine learning model called
a decision tree.
You can think of a decision tree as a series of questions that lead to
an answer, e.g.,
In this case, the questions lead to a possible classification of an
animal. You may have also seen more serious examples, such as this
decision tree about whether someone should be considered a close contact
for Covid-19:
This assignment uses very simple notions of what machine learning
does, but it’s enough to give you the core idea: We have data on what
has happened in the past, and we use statistics about it to make decisions
about new situations.
If you take a machine learning class, you will learn
how to handle more complex notions of similarity between old and new
cases, and more nuanced algorithms for deciding what to do based on the
data.
This is our last Pyret assignment, and it pulls together many of the
concepts we’ve studied.
Problem and data
The medical workers in a town where malaria risk is high want to
distribute insecticide-treated
bed nets, which help prevent the spread of the disease. Since they
have a limited budget, they can only distribute nets to the people who
are most at risk. To make these decisions, they want to use machine
learning on data about people in the town, including
- the distance of their home from a lake,
- their age, and
- whether they are pregnant.
Living close to a lake is associated with increased risk of malaria, and
children and pregnant people both have less immunity to malaria and are
therefore more susceptible.
To help them build this system, the medical workers have data about
people from a similar town, including whether those people contracted
malaria.
As we’ve done in the past, we’ll load this data
from a
Google spreadsheet. Note that the data has been divided into two
sheets: train-data
, which we’ll use in order to learn
who to give bed nets to, and test-data
, which we’ll
use to evaluate how well the resulting system does at making these
decisions.
Guidelines
- 4 pointsRemember to test your
functions thoroughly and to write your code clearly. Your
work will be graded on clarity and thorough testing, as
well as on whether it works correctly.
- To ensure the autograder works, be careful about the names of the
functions and the order of their arguments.
- Remember your resources:
- Table documentation
- Pyret testing design and code clarity
guide
- Coaches
- Your instructor
Part 1: Making and using a decision tree
In this first part, we’ll look at the structure of our decision
trees and complete the assembly of a decision tree. We will use a decision
tree to examine a row of the table and to decide whether to give that
person a net.
Task:
Open
the starter
file in CPO and select File → Save a
copy.
The code first loads the train-data
and test-data
tables from the spreadsheet. It
uses num-sanitizer
to change the Boolean values into
numbers – instead of true
indicating
someone is pregnant, now it’s 1
. This
will be useful later!
Now let’s consider the data definition for a decision tree:
data DecisionTree:
| decision(result :: Boolean)
| question(
column :: String,
threshold :: Number,
less-than-branch :: DecisionTree,
greater-eq-branch :: DecisionTree)
end
This says that a decision tree is either
a decision
whose result
is true
or false
a question
: How does the value in the
specified column
compare to some threshold
? If
it’s less than the threshold, we should follow the first branch of
the tree. Otherwise, we should follow the second.
So, to give a simple example, a DecisionTree
for whether to
wear a coat based on the temperature might be
question("temperature", 55,
decision(true),
decision(false))
That is, if it’s less than 55 ° F, wear a coat. If
not, don’t!
Exercise 1
Let’s make a first decision tree for whether to give a bed net to
prevent malaria.
Run your code, then examine net-dt
in the interactions pane,
clicking on the different parts. It should look like this:
Exercise 2
4 points
Task:
Starting with the provided template (dt-fun
), define a
function
fun give-net(dt :: DecisionTree, r :: Row) -> Boolean:
...
end
that uses the given decision tree to decide whether the person in
row r
should get a net.
- If
dt
is a decision
,
then give-net
should return its value
(true
or false
).
- If
dt
is a question
,
then give-net
should compare the value in the given column
to the threshold, and then recursively examine either
- the
less-than-branch
(if the column value is less than
the threshold) or
- the
greater-eq-branch
(if the column value is greater
than or equal to the threshold).
To test your function (in a where
block), you should
define a small example table with values that lead to each of the
five decision
nodes in the tree.
Part 2: Learning a decision tree from data
In Part 1, we built and used a decision tree to decide who should
get bed nets, but the thresholds in that decision tree were arbitrary.
Now we’ll build a better tree by learning appropriate thresholds
from our training data.
Exercise 3
Here’s a way we could learn what distance from a lake is close
enough that someone should get a net:
- We can filter the table to get a table of people who got malaria and
a table of people who didn’t.
- Then we can compute the average distance from a lake for the people
in each table.
- An appropriate threshold would then be halfway between these two
averages.
4 points
Task:
Use a similar process to compute three new thresholds based
on train-data
:
age-thresh
,
halfway between the average age for non-pregnant people with malaria
and non-pregnant people without malaria,
dist-thresh-young
,
halfway between the average dist-from-lake
of non-pregnant
people younger than age-thresh
, with/without malaria,
and
dist-thresh-old
,
halfway between the average dist-from-lake
of non-pregnant
people who are at least as old as age-thresh
, with/without
malaria.
Notice that to find these three thresholds, you’ll be repeating
the same steps, just with different tables and columns. You can do these
three steps manually to compute each threshold, but I recommend writing
a function to do it for you: If you
write find-separator
, you don’t need to give test
cases for it.
fun find-separator(col :: String, t :: Table) -> Number:
doc: "Return a number halfway between the average values of column `col` among people who got malaria and the average values of column `col` among people who did not get malaria"`
...
end
Exercise 4 (coach-free)
4 points
Task:
Next, build a new DecisionTree
named net-dt-learned
that is the same
as net-dt
, except that it uses these new threshold values.
Great! Or is it?
Challenge
exercise (coach-free)
To know if the DecisionTree
we learned using
train-data
is any better than the one we started with, we should
evaluate them both on a different set of data – namely, our
test-data
.
2 points
Task:
Starting with the test-data
table, make two new
tables, results1
and results2
,
each of which has a column called predicted
, which contains
the results of running give-net
on each row.
- For
results1
, you should call give-net
with the DecisionTree
net-dt
from Part 1,
and
- For
results2
, you should call give-net
with the DecisionTree
net-dt-learned
from
Part 2.
Hint: Use a lambda
(lam
) expression so you can provide
both arguments to give-net
!
There are many ways to evaluate predictions, but the most common is
called an F1 score,
which balances our desire for high precision (only giving nets
to those who actually need them) with our desire for
high recall (give nets to everyone who needs one).
The following function will compute the F1 score for a table, using
the value in the contracted-malaria
column and in the
new predicted
column:
fun f1-score(t :: Table) -> Number:
doc: "Return the F1 score for a table of data with predictions"
true-pos = filter-with(t, lam(r): r["contracted-malaria"] and r["predicted"] end).length()
false-pos = filter-with(t, lam(r): not(r["contracted-malaria"]) and r["predicted"] end).length()
false-neg = filter-with(t, lam(r): r["contracted-malaria"] and not(r["predicted"]) end).length()
true-pos / (true-pos + (1/2 * (false-pos + false-neg)))
end
Task:
Run f1-score
on results1
and
on results2
and observe: Which has a higher score? This
tells you whether we actually improved the classifier by learning it
from the data!
Submitting the assignment
- Download your file (File → Download) and ensure
it’s named
asmt04.arr
.
- Upload your assignment on
Gradescope.
Note: You can submit as many times as you want before the
deadline. Only your latest submission will be graded.
- Introduction
- Problem and data
- Guidelines
- Part 1: Making and using a decision tree
- Part 2: Learning a decision tree from data
- Submitting the assignment
Introduction
This week we learned about trees, which can be used to represent data such as ancestry (a family tree) or the spread of (mis)information through a social network. But trees can also be used for designing programs that classify other data – this is the machine learning model called a decision tree.
You can think of a decision tree as a series of questions that lead to an answer, e.g.,
In this case, the questions lead to a possible classification of an animal. You may have also seen more serious examples, such as this decision tree about whether someone should be considered a close contact for Covid-19:
This assignment uses very simple notions of what machine learning does, but it’s enough to give you the core idea: We have data on what has happened in the past, and we use statistics about it to make decisions about new situations. If you take a machine learning class, you will learn how to handle more complex notions of similarity between old and new cases, and more nuanced algorithms for deciding what to do based on the data.
This is our last Pyret assignment, and it pulls together many of the concepts we’ve studied.
Problem and data
The medical workers in a town where malaria risk is high want to distribute insecticide-treated bed nets, which help prevent the spread of the disease. Since they have a limited budget, they can only distribute nets to the people who are most at risk. To make these decisions, they want to use machine learning on data about people in the town, including
- the distance of their home from a lake,
- their age, and
- whether they are pregnant.
Living close to a lake is associated with increased risk of malaria, and children and pregnant people both have less immunity to malaria and are therefore more susceptible.
To help them build this system, the medical workers have data about people from a similar town, including whether those people contracted malaria.
As we’ve done in the past, we’ll load this data
from a
Google spreadsheet. Note that the data has been divided into two
sheets: train-data
, which we’ll use in order to learn
who to give bed nets to, and test-data
, which we’ll
use to evaluate how well the resulting system does at making these
decisions.
Guidelines
- 4 pointsRemember to test your functions thoroughly and to write your code clearly. Your work will be graded on clarity and thorough testing, as well as on whether it works correctly.
- To ensure the autograder works, be careful about the names of the functions and the order of their arguments.
- Remember your resources:
- Table documentation
- Pyret testing design and code clarity guide
- Coaches
- Your instructor
Part 1: Making and using a decision tree
In this first part, we’ll look at the structure of our decision trees and complete the assembly of a decision tree. We will use a decision tree to examine a row of the table and to decide whether to give that person a net.
Task: Open the starter file in CPO and select File → Save a copy.
The code first loads the train-data
and test-data
tables from the spreadsheet. It
uses num-sanitizer
to change the Boolean values into
numbers – instead of true
indicating
someone is pregnant, now it’s 1
. This
will be useful later!
Now let’s consider the data definition for a decision tree:
data DecisionTree: | decision(result :: Boolean) | question( column :: String, threshold :: Number, less-than-branch :: DecisionTree, greater-eq-branch :: DecisionTree) end
This says that a decision tree is either
a
decision
whoseresult
istrue
orfalse
a
question
: How does the value in the specifiedcolumn
compare to somethreshold
? If it’s less than the threshold, we should follow the first branch of the tree. Otherwise, we should follow the second.
So, to give a simple example, a DecisionTree
for whether to
wear a coat based on the temperature might be
question("temperature", 55, decision(true), decision(false))
That is, if it’s less than 55 ° F, wear a coat. If not, don’t!
Exercise 1
Let’s make a first decision tree for whether to give a bed net to prevent malaria.
Run your code, then examine net-dt
in the interactions pane,
clicking on the different parts. It should look like this:
Exercise 2
Task:
Starting with the provided template (dt-fun
), define a
function
fun give-net(dt :: DecisionTree, r :: Row) -> Boolean: ... end
that uses the given decision tree to decide whether the person in
row r
should get a net.
- If
dt
is adecision
, thengive-net
should return its value (true
orfalse
). - If
dt
is aquestion
, thengive-net
should compare the value in the given column to the threshold, and then recursively examine either- the
less-than-branch
(if the column value is less than the threshold) or - the
greater-eq-branch
(if the column value is greater than or equal to the threshold).
- the
To test your function (in a where
block), you should
define a small example table with values that lead to each of the
five decision
nodes in the tree.
Part 2: Learning a decision tree from data
In Part 1, we built and used a decision tree to decide who should get bed nets, but the thresholds in that decision tree were arbitrary. Now we’ll build a better tree by learning appropriate thresholds from our training data.
Exercise 3
Here’s a way we could learn what distance from a lake is close enough that someone should get a net:
- We can filter the table to get a table of people who got malaria and a table of people who didn’t.
- Then we can compute the average distance from a lake for the people in each table.
- An appropriate threshold would then be halfway between these two averages.
Task:
Use a similar process to compute three new thresholds based
on train-data
:
age-thresh
, halfway between the average age for non-pregnant people with malaria and non-pregnant people without malaria,dist-thresh-young
, halfway between the averagedist-from-lake
of non-pregnant people younger thanage-thresh
, with/without malaria, anddist-thresh-old
, halfway between the averagedist-from-lake
of non-pregnant people who are at least as old asage-thresh
, with/without malaria.
Notice that to find these three thresholds, you’ll be repeating
the same steps, just with different tables and columns. You can do these
three steps manually to compute each threshold, but I recommend writing
a function to do it for you: If you
write find-separator
, you don’t need to give test
cases for it.
fun find-separator(col :: String, t :: Table) -> Number: doc: "Return a number halfway between the average values of column `col` among people who got malaria and the average values of column `col` among people who did not get malaria"` ... end
Exercise 4 (coach-free)
Task:
Next, build a new DecisionTree
named net-dt-learned
that is the same
as net-dt
, except that it uses these new threshold values.
Great! Or is it?
Challenge exercise (coach-free)
To know if the DecisionTree
we learned using
train-data
is any better than the one we started with, we should
evaluate them both on a different set of data – namely, our
test-data
.
Task:
Starting with the test-data
table, make two new
tables, results1
and results2
,
each of which has a column called predicted
, which contains
the results of running give-net
on each row.
- For
results1
, you should callgive-net
with theDecisionTree
net-dt
from Part 1, and - For
results2
, you should callgive-net
with theDecisionTree
net-dt-learned
from Part 2.
Hint: Use a lambda
(lam
) expression so you can provide
both arguments to give-net
!
There are many ways to evaluate predictions, but the most common is called an F1 score, which balances our desire for high precision (only giving nets to those who actually need them) with our desire for high recall (give nets to everyone who needs one).
The following function will compute the F1 score for a table, using
the value in the contracted-malaria
column and in the
new predicted
column:
fun f1-score(t :: Table) -> Number: doc: "Return the F1 score for a table of data with predictions" true-pos = filter-with(t, lam(r): r["contracted-malaria"] and r["predicted"] end).length() false-pos = filter-with(t, lam(r): not(r["contracted-malaria"]) and r["predicted"] end).length() false-neg = filter-with(t, lam(r): r["contracted-malaria"] and not(r["predicted"]) end).length() true-pos / (true-pos + (1/2 * (false-pos + false-neg))) end
Task:
Run f1-score
on results1
and
on results2
and observe: Which has a higher score? This
tells you whether we actually improved the classifier by learning it
from the data!
Submitting the assignment
- Download your file (File → Download) and ensure
it’s named
asmt04.arr
. - Upload your assignment on Gradescope.
Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.