The purpose of this lab is to give you practice:
- extracting rows and columns from a table,
- writing and testing helper functions,
- filtering data with
filter-with
, - adding a column to a table,
- summarizing columns, and
- visualizing relationships.
The purpose of this lab is to give you practice:
filter-with
,Today’s lab – and all future labs, unless otherwise noted – can be completed working in a pair, if you would like to do so.
FiveThirtyEight conducted a survey in which tens of thousands of people were asked to choose between two candies. The data is analyzed in the article “The Ultimate Halloween Candy Power Ranking”, which is worth a read after the lab. From the responses, they compiled a data set with
Number
attributes like the winning percentage, relative
price, and sugar percentage, and alsoBoolean
attributes such as whether the candy has
chocolate, is fruity, has caramel, or is hard.In this lab, you’ll be looking at the relationships between these columns.
To get started, copy this code for loading a Pyret table from a Google Sheet and paste it at the top of your program, in the definitions pane:
include gdrive-sheets include shared-gdrive("dcic-2021", "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep") ssid = "1YoN-Z8T5DqNEmV_hpR3i6-yP5Poe9gdYwbP9nQhQlBw" data-sheet = load-spreadsheet(ssid) candy-data = load-table: name, chocolate, fruity, caramel, nutty, nougat, crisped-rice, hard, bar, pluribus, sugar-percent, price-percent, win-percent source: data-sheet.sheet-by-name("candy-data", true) end
Save this code in a file named lab03.arr
and press
the Run button to process the code in the definitions pane.
Then type candy-data
in the interactions pane to see the
data as a table.
Note: For this lab, you’ll want to refer to the Pyret Tables Documentation (and not the built-in Pyret documentation).
For each exercise, separate it from the next one by copy-and-pasting this line:
# -----------------------------------
Let’s use the power of filtering to learn more from our candy data.
We want to know which candies have the most sugar!
Task:
First you’ll first need to write a function that takes a table Row
as input and returns a Boolean that answers the question “is the
sugar-percent
of the Row greater than 75%?”
We’ve started the function for you:
fun over-75-percent-sugar(r :: Row) -> Boolean: doc: "Returns true if the sugar-percent column in a row is over 75%" ... where: over-75-percent-sugar(candy-data.row-n(0)) is false over-75-percent-sugar(candy-data.row-n(4)) is true end
Note the examples in the where
block to check that the
function works correctly! Often when we’re working with a big table
of data, we’ll define a second, smaller table just for testing,
since our real data set might change. For this lab, the data set is
“frozen” – it’s not changing – so it’s
okay to use it for testing your functions.
Task: Next, write an expression (not a function) to filter our table to only include the candy with more than 75% sugar.
You may want to review the filter-with
function we saw in class. The first input to filter-with
is
the Table to be filtered and second input is the name of the predicate
function (which you just wrote!) That’s right, as you hopefully remember, functions
can be passed as inputs to other functions! If you’ve got
everything right, your expression will evaluate to a table with 15 rows in
it.
Now that we the candies that will satisfy our sweet tooth, let’s consider the high end of the market.
Task:
Write an expression in the definitions pane that produces a table
containing only the candies where price-percent
is greater
than 90%.
This requires you to write another helper function. As we demonstrated
above, be sure to test this function with a where
block –
and to do so for each of the subsequent functions you write.
The result should be a table with eight rows in it.
How many of the candies have chocolate?
Task: Write an expression (in the definitions pane) that outputs the number of chocolate candies.
You can get the length of a table by writing
⟨table⟩.length()
, where
⟨table⟩ is the name of a table or an expression
that evaluates to a table.
If your filter is correct, you’ll find a table with 37 chocolate candies in it.
Of the candies that have chocolate, what proportion also have caramel?
Task: Write an expression in the definitions pane that outputs the proportion of chocolate candies with caramel.
If you got the right answer, you’ll see that 10/37 or about 27% of the chocolate candies also have caramel.
Checkpoint: When you reach this point in the lab, raise your hand and go through your solutions with a coach or the instructor before continuing.
Task:
Build a column of Boolean values that indicates whether a candy is
fruity and hard, but not a pluribus
(i.e., multiple candies in a
packet, like Skittles or M&Ms).Hint: Take a look at the
build-column
function.
Task: Write an expression in the definitions pane that uses this new column to compute how many candies meet this condition.
Task: Write an expression in the definitions pane to compute which candy has the highest winning percentage out of those candies for which your new column is true.
Hint: This requires the use of the
order-by
function, in addition to
.row-n
and
filter-with
.
Task: Write an expression in the definitions pane to compute the average winning percentage for the candies for which this column is true.
Hint: This requires the use of the
mean
function.
What’s the relationship between sugar and winning percentage? Do these two attributes seem correlated? One way to gain intuition on this is to create a scatterplot that puts one attribute on each axis.
Task:
Look at the documentation for the scatter-plot
function. Try generating a scatterplot of sugar vs winning percentage.
We recommend putting sugar on the x-axis since our hypothesis is that it affects a candy’s winning percentage.
Task: Write an expression in the definitions pane that creates the scatterplot, and, in a comment, summarize what, if anything, you think it shows.
This lab has mostly been about getting you comfortable working with tabular data and practicing some common operators on tables. It also gets you thinking about our course’s focus on data: What patterns of manipulating data do we often use in computations? How does the organization of our data impact our ability to answer these questions?
Here, we see that filtering, ordering, and summarizing data are some of the key operations. So far we’ve only looked at these operations on tabular data, but these same building blocks will arise many times through this course. When you have a computation to perform around data, you should start by thinking through what combinations of filtering, sorting and summarizing will help you compute your answer.
lab03.arr
file to the Lab 3 assignment
on Gradescope.
If you worked with a partner, you should submit a single copy of the lab
with both your names – see
these instructions.