Lab 3: Candy analysis

16 September 2022

The purpose of this lab is to give you practice:

  • extracting rows and columns from a table,
  • writing and testing helper functions,
  • filtering data with filter-with,
  • adding a column to a table,
  • summarizing columns, and
  • visualizing relationships.

This lab can be completed in pairs!

If you choose to work in a pair, you’ll make a single code file which you’ll upload to Gradescope with both your names.

As you work through the lab, take turns “driving” and “navigating”. That is, for a while you type in CPO while your partner is reading the assignment and then you trade.


FiveThirtyEight conducted a survey in which tens of thousands of people were asked to choose between two candies. From the responses, they compiled a data set with

  • Number attributes like the winning percentage, relative price, and sugar percentage, and also
  • Boolean attributes such as whether the candy has chocolate, is fruity, has caramel, or is hard.

This data is analyzed in the article “The Ultimate Halloween Candy Power Ranking”, which is worth a read after the lab.

In this lab, you’ll be looking at the relationships between these columns.

To get started, include this code at the top of your program, in the definitions pane:

include gdrive-sheets

include shared-gdrive("dcic-2021",
  "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep")


ssid = "1YoN-Z8T5DqNEmV_hpR3i6-yP5Poe9gdYwbP9nQhQlBw"
data-sheet = load-spreadsheet(ssid)

candy-data =
  load-table:
    name, chocolate, fruity, caramel, nutty, nougat, crisped-rice,
    hard, bar, pluribus, sugar-percent, price-percent, win-percent
    source: data-sheet.sheet-by-name("candy-data", true)
  end

This code loads a Pyret table from a Google Sheet. Press the Run button to process the code in the definitions pane. Then type candy-data in the interactions pane to see the data as a table.

Note: For this lab, you’ll want to refer to the Pyret Tables Documentation (and not the built-in Pyret documentation).

Let’s use the power of filtering to learn more from our candy data.

Suggestion: For each section, separate it from the next part by copy-and-pasting this line:

# -----------------------------------

We want to know which candies have the most sugar!

Task: First you’ll first need to write a function that takes a table Row as input and returns a Boolean that answers the question “is the sugar-percent of the Row greater than 75%?”. We’ve started the function for you:

fun over-75-percent-sugar(r :: Row) -> Boolean:
  doc: "Return true if the sugar-percent column in a row is over 75%"
  ...
where:
  over-75-percent-sugar(candy-data.row-n(0)) is false
  over-75-percent-sugar(candy-data.row-n(4)) is true
end

Note the examples in the where block to check that the function works correctly! Often when we’re working with a big table of data, we’ll define a second, smaller table just for testing, since our real data set might change. For this lab, the data set is frozen and it’s okay to use it for testing your functions.

Task: Next, write an expression (not a function) to filter our table to only include the candy with more than 75% sugar.

You may want to review the filter-with function we saw in class. The first input to filter-with is the Table to be filtered and second input is the name of the predicate function (which you just wrote!)

(That’s right, as you hopefully remember, functions can be passed as inputs to other functions!)

If you’ve got everything right, your expression will evaluate to a table with 15 rows in it.

Now that we the candies that will satisfy our sweet tooth, let’s consider the high end of the market.

Task: Write an expression in the definitions pane that produces a table containing only the candies where price-percent is greater than 90%.

This requires you to write another helper function. As we demonstrated above, be sure to test this function with a where block – and to do so for each of the subsequent functions you write.

The result should be a table with eight rows in it.

How many of the candies have chocolate?

Task: Write an expression (in the definitions pane) that outputs the number of chocolate candies.

Hint: You can get the length of a table by writing ⟨table⟩.length(), where ⟨table⟩ is the name of a table or an expression that evaluates to a table.

If your filter is correct, you’ll find a table with 37 chocolate candies in it.

Of the candies that have chocolate, what proportion also have caramel?

Task: Write an expression in the definitions pane that outputs the proportion of chocolate candies with caramel.

If you got the right answer, you’ll see that 10/37 or about 27% of the chocolate candies also have caramel.


Checkpoint: Call over a coach once you reach this point and talk over your code with them.


Task: Build a column of Boolean values that indicates whether a candy is fruity and hard, but not a pluribus (i.e., multiple candies in a packet, like Skittles or M&Ms).

Hint: Take a look at the build-column function. Call over a coach if you want help using this.

Task: Write an expression in the definitions pane that uses this new column to compute how many candies meet this condition.

Task: Write an expression in the definitions pane to compute which of the candies for which your new column is true has the highest winning percentage.

Hint: This requires the use of the order-by function, in addition to .row-n and filter-with.

Task: Write an expression in the definitions pane to compute the average winning percentage for the candies for which this column is true.

Hint: This requires the use of the mean function.

On Assignment 3 – and in class next week – we’ll be drawing various plots to visualize tabular data. If you have time at the end of lab, we encourage you to try visualizing the candy data we used in this lab. Here’s a question to look at:

What’s the relationship between sugar and winning percentage? Do these two attributes seem correlated? One way to gain an intuition on this is to create a scatterplot that puts one attribute on each axis.

We haven’t made plots in class yet, but at this point you’re getting good at reading the documentation and trying things out!

Task: Look at the documentation for the scatter-plot function. Try to figure out how to use it to generate a scatterplot of sugar vs winning percentage.

It doesn’t matter which variable goes on which axis. If you get stuck, ask for help!

Task: Write an expression in the definitions pane that creates the scatterplot, and, in a comment, summarize what, if anything, you think it shows.

On Assignment 3, you’ll be using other visualizations, but they’re created in much the same way!

This lab has mostly been about getting you comfortable working with tabular data and practicing some common operators on tables. It also gets you thinking about our course’s focus on data: What patterns of manipulating data do we often use in computations? How does the organization of our data impact our ability to answer these questions?

Here, we see that filtering, ordering, and summarizing data are some of the key operations. So far we’ve only looked at these operations on tabular data, but these same building blocks will arise many times through this course. When you have a computation to perform around data, you should start by thinking through what combinations of filtering, sorting and summarizing will help you compute your answer.

  • When you’ve completed the exercises, show your code to your instructor or one of the coaches.
  • Then upload your lab03.arr file to the Lab 3 assignment on Gradescope.

This lab includes material adapted from Kathi Fisler and colleagues at Brown University.