Lab 4: Cleaning Table Data

23 September 2022

The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.

Up until now, the data that you worked with in lab and on assignments has been sanitized for you – that is, you could expect Pyret to process the data without any extra work.

For this lab, we’ll analyze the data from the survey that you and the other students taking 101 filled out earlier this week. Since the data is being gathered directly from Google Forms submissions, there exists data that either isn’t valid or isn’t useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need.

Task: Go to the response spreadsheet and make a copy of it (by selecting FileMake a Copy).

This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.

Once you’ve made a copy of the spreadsheet, you need to share the spreadsheet so Pyret can import it.

Task: Press the green Share button in the upper right hand corner of the spreadsheet, then click on Get link and select Anyone with the link from the dropdown.

Now we’re ready to import the spreadsheet into a Pyret program.

Task: Add the following code at the top of your program:

include gdrive-sheets
include data-source

include shared-gdrive("dcic-2021",
  "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep")


ssid = ""  # Put the ID of your spreadsheet here
data-sheet = load-spreadsheet(ssid)

student-data =
  load-table:
    timestamp, house, stem-level, sleep-hours,
    schoolwork-hours, student-athlete, extracurricular-hours
    source: data-sheet.sheet-by-name("Responses", true)
    sanitize student-athlete using string-sanitizer
  end

student-data

Task: Put the ID of your copy of the spreadsheet in your code above.

Note: Your ssid will look like the highlighted portion of the spreadsheet URL:

https://docs.google.com/spreadsheets/d/10dXMqd5nyg20_-2rdtHLkEfnWxXDUhvw4j5_We4ssek/edit

But your ssid will be different; don’t use this one!

To make sure the spreadsheet loaded correctly, check that you see the student-data table when you run the program. If you are not able to see your table, call your instructor or a coach over to help you troubleshoot.

Now that we have the data collected, we need to make sure that the data is clean and processed before we analyze it.

Have a look at your student-data table. Notice that everything is represented as a string, even the numeric columns. Also notice the values in the last two columns are in the format some(x) or none.

In order to turn these values into the data types we all know and love – String, Number, etc. – we need to use the sanitize functions when we load the table. See the textbook for an example of how to do this and the Pyret documentation for a list of available sanitizers.

Hint: You’ll want to use at least num-sanitizer and string-sanitizer.

Task: Make a new table called student-data-sanitized, which is the same as student-data but using the sanitizers.

When you try to sanitize the extracurricular-hours column, Pyret should report one or more errors due to values that can’t be converted to numbers. To fix these sanitizer errors, you’ll have to edit the spreadsheet.

When you need to make edits to the data, it’s much better to edit a copy of the data, leaving the original data unchanged. So, that’s what we’ll do:

Task: Copy our data into a new sheet in the spreadsheet by clicking on the small arrow in the Responses tab in the bottom left of the spreadsheet window and then select Duplicate.

Task: Once you’ve duplicated the sheet, rename it to “Cleaned Data”.

Now when you load the spreadsheet to make student-data-sanitized, your source line will look like this:

source: data-sheet.sheet-by-name("Cleaned Data", true)

Task: Make edits to the “Cleaned Data” sheet of your spreadsheet until your data passes the sanitizer. If the data for the cell that’s giving an error makes no sense for that cell, it’s best to remove it. However, if there is a clear way clean up the data, you should edit the data for that cell.

Task: Answer the following question with a block comment (#| ... |#) in your code. Make the first line of your comment Question 1.1 to make it easy for the coach to find your answer.

Question 1.1: What type of edits did you need to make to the data pass the sanitizer? How did you handle cells that were obviously not correct? How did you handle the other types of errors? Take a look at the “Cleaned Data” sheet in this copy of the spreadsheet, which contains our edits to the table. Do you agree with our edits? Did you do anything different?

Now that we have our sanitized table, we can now work on normalizing the data in our columns. In the "house" column, students were asked to select which one of the nine houses they live in. For students who don’t live in one of those houses, they selected “other” and listed where they live.

Task: Use transform-column to normalize this column such that if the answer is not one of the nine Vassar houses, we should change the response to "OTHER".

For reference, the nine houses listed on the form were:

  • "Main Building (1861)"
  • "Strong House (1893)"
  • "Raymond House (1897)"
  • "Lathrop House (1901)"
  • "Davison House (1902)"
  • "Jewett House (1907)"
  • "Josselyn House (1912)"
  • "Cushing House (1927)"
  • "Noyes House (1958)"

Check yourself: How do you know you did the normalization correctly? You can cross check your data by using the count function on your new table. You should only see the nine houses represented in the count table, along with "OTHER". (If there were no responses for a particular house, it won’t show up – this is fine.)

Use the table you created in Exercise 1.2 as your starting point for this exercise.

To normalize the "student-athlete" column, we’re going to change it from a String to a Boolean:

Task: Make a new table student-data-cleaned, where the "student-athlete" column is true for any response that has "yes" (in upper, lower, or mixed case) anywhere in the string and false otherwise.

Before you begin, take a moment to write down the tasks that you need to do for this computation. You’ll need to make use of a couple of string functions to do this transformation. Also remember that the transform-column can return a different type from its input type.

Now that the data has been tidied up, we can start asking questions about it.

Task: Use the count function to answer the following questions:

  • What percentage of the survey responders are student-athletes?
  • Which house has the most survey students living there?

You don’t need to come up with an expression that computes the answer directly; just write your answer in a comment.

Task: Show a scatter-plot which has the "stem-level" on the x-axis and "schoolwork-hours" on the y-axis. (Refer to the table documentation for plot functions.) Do you notice any correlation between the points?

Task: Show a histogram of the "sleep-hours" column with a bin-width of 1.

Task: Ask and answer at least one more question about the data. Discuss why you wanted to know the answer to that question.

Congratulations! You have reached the end of lab. Here is an optional exercise in case you are looking for a challenge:

Task: Write a function percent-true that takes a table and column name as input and returns the percent of rows that are true for the column specified.

fun percent-true(t :: Table, col :: String) -> Number:
  doc: "Return the percentage of rows that are true in column 'col'"
  ...
end

What’s neat about this function is it will work on any table that has a column of type Boolean!

Task: Use this helper function to find the percentage of survey responders who are student-athletes. Check to see if it’s the same answer you got for Exercise 2.1.

  • When you’ve completed the exercises, show your code to your instructor or one of the coaches.

  • Then upload your lab04.arr file to the Lab 4 assignment on Gradescope.

When you’re done with the lab, check out these spurious correlations. It’s entertaining, we promise!

This lab includes material adapted from Kathi Fisler and colleagues at Brown University and Jason Waterman at Vassar College.