Lab 3: Candy Analysis

2 February 2024

Today’s lab

The purpose of this lab is to give you practice:

  • extracting rows and columns from a table,
  • writing and testing helper functions,
  • filtering data with filter-with,
  • adding a column to a table,
  • summarizing columns, and
  • visualizing relationships.

Today’s lab – and all future labs, unless otherwise noted – can be completed working in a pair, if you would like to do so.

Contents

Getting started

FiveThirtyEight conducted a survey in which tens of thousands of people were asked to choose between two candies. The data is analyzed in the article “The Ultimate Halloween Candy Power Ranking”, which is worth a read after the lab. From the responses, they compiled a data set with

  • Number attributes like the winning percentage, relative price, and sugar percentage, and also
  • Boolean attributes such as whether the candy has chocolate, is fruity, has caramel, or is hard.

In this lab, you’ll be looking at the relationships between these columns.

To get started, copy this code for loading a Pyret table from a Google Sheet and paste it at the top of your program, in the definitions pane:

include gdrive-sheets

include shared-gdrive("dcic-2021",
  "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep")

ssid = "1YoN-Z8T5DqNEmV_hpR3i6-yP5Poe9gdYwbP9nQhQlBw"
data-sheet = load-spreadsheet(ssid)

candy-data =
  load-table:
    name, chocolate, fruity, caramel, nutty, nougat, crisped-rice,
    hard, bar, pluribus, sugar-percent, price-percent, win-percent
    source: data-sheet.sheet-by-name("candy-data", true)
  end

Save this code in a file named lab03.arr and press the Run button to process the code in the definitions pane. Then type candy-data in the interactions pane to see the data as a table.

Note: For this lab, you’ll want to refer to the Pyret Tables Documentation (and not the built-in Pyret documentation).

For each exercise, separate it from the next one by copy-and-pasting this line:

# -----------------------------------

Part 1: Filtering

Let’s use the power of filtering to learn more from our candy data.

Exercise 1.1: Sugar rush

We want to know which candies have the most sugar!

Task: First you’ll first need to write a function that takes a table Row as input and returns a Boolean that answers the question “is the sugar-percent of the Row greater than 75%?”

We’ve started the function for you:

fun over-75-percent-sugar(r :: Row) -> Boolean:
  doc: "Returns true if the sugar-percent column in a row is over 75%"
  ...
where:
  over-75-percent-sugar(candy-data.row-n(0)) is false
  over-75-percent-sugar(candy-data.row-n(4)) is true
end

Note the examples in the where block to check that the function works correctly! Often when we’re working with a big table of data, we’ll define a second, smaller table just for testing, since our real data set might change. For this lab, the data set is “frozen” – it’s not changing – so it’s okay to use it for testing your functions.

Task: Next, write an expression (not a function) to filter our table to only include the candy with more than 75% sugar.

You may want to review the filter-with function we saw in class. The first input to filter-with is the Table to be filtered and second input is the name of the predicate function (which you just wrote!) That’s right, as you hopefully remember, functions can be passed as inputs to other functions! If you’ve got everything right, your expression will evaluate to a table with 15 rows in it.

Exercise 1.2: Pricey

Now that we the candies that will satisfy our sweet tooth, let’s consider the high end of the market.

Task: Write an expression in the definitions pane that produces a table containing only the candies where price-percent is greater than 90%.

This requires you to write another helper function. As we demonstrated above, be sure to test this function with a where block – and to do so for each of the subsequent functions you write.

The result should be a table with eight rows in it.

Exercise 1.3: Chocolate

How many of the candies have chocolate?

Task: Write an expression (in the definitions pane) that outputs the number of chocolate candies.

You can get the length of a table by writing ⟨table⟩.length(), where ⟨table⟩ is the name of a table or an expression that evaluates to a table.

If your filter is correct, you’ll find a table with 37 chocolate candies in it.

Exercise 1.4: Chocolate and caramel

Of the candies that have chocolate, what proportion also have caramel?

Task: Write an expression in the definitions pane that outputs the proportion of chocolate candies with caramel.

If you got the right answer, you’ll see that 10/37 or about 27% of the chocolate candies also have caramel.

Checkpoint: When you reach this point in the lab, raise your hand and go through your solutions with a coach or the instructor before continuing.

Part 2: Building columns and analyzing them

Exercise 2.1: New column

Task: Build a column of Boolean values that indicates whether a candy is fruity and hard, but not a pluribus (i.e., multiple candies in a packet, like Skittles or M&Ms).Hint: Take a look at the build-column function.

Task: Write an expression in the definitions pane that uses this new column to compute how many candies meet this condition.

Exercise 2.2: Maximum

Task: Write an expression in the definitions pane to compute which candy has the highest winning percentage out of those candies for which your new column is true.

Hint: This requires the use of the order-by function, in addition to .row-n and filter-with.

Exercise 2.3: Mean

Task: Write an expression in the definitions pane to compute the average winning percentage for the candies for which this column is true.

Hint: This requires the use of the mean function.

Part 3: Scatterplot

What’s the relationship between sugar and winning percentage? Do these two attributes seem correlated? One way to gain intuition on this is to create a scatterplot that puts one attribute on each axis.

Task: Look at the documentation for the scatter-plot function. Try generating a scatterplot of sugar vs winning percentage.

We recommend putting sugar on the x-axis since our hypothesis is that it affects a candy’s winning percentage.

Task: Write an expression in the definitions pane that creates the scatterplot, and, in a comment, summarize what, if anything, you think it shows.

Takeaways

This lab has mostly been about getting you comfortable working with tabular data and practicing some common operators on tables. It also gets you thinking about our course’s focus on data: What patterns of manipulating data do we often use in computations? How does the organization of our data impact our ability to answer these questions?

Here, we see that filtering, ordering, and summarizing data are some of the key operations. So far we’ve only looked at these operations on tabular data, but these same building blocks will arise many times through this course. When you have a computation to perform around data, you should start by thinking through what combinations of filtering, sorting and summarizing will help you compute your answer.

Submitting the lab

  • When you’ve completed the exercises, show your code to your instructor or one of the coaches.
  • Then upload your lab03.arr file to the Lab 3 assignment on Gradescope. If you worked with a partner, you should submit a single copy of the lab with both your names – see these instructions.