Lab 4: Cleaning Table Data

The objective of this lab is to gain familiarity with sanitizing, visualizing, and gaining insights from data.

Up until now, the data that you worked with had been sanitized for you (i.e. you could expect Pyret to process the data without any extra work). The data in this lab, however, is being ported directly from Google Forms submissions, so there exists data that either isn’t valid or isn’t useful, and you need to find ways to effectively sort through such cases so that you can gather the insights you need. For this lab we'll analyze the data from the form you and the other students filled out earlier this week.

Go to the response spreadsheet, make a copy of it (select “Make a Copy” under the “File” tab), and import it into your program.

This copy of the spreadsheet is yours to tweak manually. There are some values that Pyret will not allow you to read in, which you will have to correct by hand directly in the spreadsheet.

Once you've made a copy of the spreadsheet, you need to share the spreadsheet so Pyret can import it. Press the green "Share" button in the upper right hand corner of the spreadsheet, then click on "Get link" and select "Anyone with the link" from the dropdown.

To import the spreadsheet, include the following code at the top of your program. You will put the ssid of your copy of the spreadsheet in your code below.

Note: The highlighted portion of the spreadsheet URL is what your ssid looks like:

https://docs.google.com/spreadsheets/d/`1PCNs1fcNCiEepbDMc5u9Edar-EZjSqUwf-Z41i33fkM`/edit

Your ssid will be different, so don't use the one above.

include shared-gdrive("dcic-2021",
  "1wyQZj_L0qqV9Ekgr9au6RX2iqt2Ga8Ep")
include gdrive-sheets

# spreadsheet id from Google Sheets
ssid = "" # Put the ssid of your spreadsheet here
data-sheet = load-spreadsheet(ssid)
student-data =
  load-table: timestamp, house, stem-level, sleep-hours, schoolwork-hours, 
  student-athlete, extracurricular-hours
  source: data-sheet.sheet-by-name("Form Responses 1", true)
  end
student-data

See if you can view the student-data table to make sure the spreadsheet loaded correctly. If you are not able to see your table, call your instructor or a coach over to help you troubleshoot.

Now that we have the data collected, we need to make sure that the data is clean and processed before analysis.

Once you import your student-data table, have a look at it. Notice that everything is represented as a string, even the numeric columns. Also notice the values in the last two columns are in the format some(x) or none. In order to turn these values into the data types we all know and love (String, Num, etc.) we need to use the sanitize functions upon import. See the texbook for an example of how to do this and the manual for a list of available sanitizers. Hint: you'll want to use at least the num-sanitizer and the string-sanitizer. Call your new table student-data-sanitized.

When you run the sanitizer, your spreadsheet should give an error when it tries to sanitize the "extracurricular-hours" column. If it doesn't, look at which sanitizer you are using with the column. Since we are asking for hours, we should be using the num-sanitizer for this column. To fix these sanitizer errors, you'll have to edit the spreadsheet. If the data for the cell that is giving an error makes no sense for that cell, it's best to remove it. However, if there is a clear way clean up the data, you can edit the data for that cell.

When you need to make edits to the data, it's much better to edit a copy of the data, leaving the original data in place. We'll do this by copying our data into a new sheet in the spreadsheet. Click on the small arrow in the "Form Responses 1" tab in the bottom left of the spreadsheet and then select "Duplicate".

Once you've duplicated the sheet, rename it to "Cleaned Data".

Now when you load the spreadsheet, your source line will look like this:

source: data-sheet.sheet-by-name("Cleaned Data", true)

Make edits to the "Cleaned Data" sheet of your spreadsheet until your data passes the sanitizer and answer the following question with a block comment in your code. Make the first line of your comment Question 1.1 to make it easy for the coach to find your answer.

Question 1.1: What type of edits did you need to make to the data pass the sanitizer? How did you handle cells that were obviously not correct? How did you handle the other types of errors? Take a look at "Data" sheet, which contains our edits to the table. Do you agree with our edits? Did you do anything different?

Now that we have our sanitized table, we can now work on normalizing the data in our columns. In the "house" column, students were asked to select which one of the nine houses they live in. For students who don't live in one of those houses, they selected "other" and listed where they live. What we would like to do is normalize this colum such that if the answer is not one of the nine Vassar houses, we should change the response to "OTHER". The nine houses listed on the form were:

  • Main
  • Strong
  • Raymond
  • Davison
  • Lathrop
  • Jewett
  • Josselyn
  • Cushing
  • Noyes

Hint: you may find the transform-column function useful.

Check Yourself: How do you know you did the normalization correctly? You can cross check your data by using the count function on your new table. You should see all 9 houses represented in the count table, and the value for "OTHER" should be 5.

Use the table that created in part 1.2 as your starting point for this section.

To normalize this column, we are going to transform the "student-athlete" column from a String to a Boolean. To do this, we'd like to take any response that has "yes" (in either upper or lower case) in any part of the string as true and false otherwise. Before you begin, take a moment to write down the tasks that you need to do for this computation. You'll need to make use of a couple of string functions to do this transformation. Also remember that the transform-column can return a different type from its input type.

Give your new table the name student-data-cleaned. We'll use this table in Part2.

Now that the data has been tidied up, we can start asking questions about it.

Both of these questions can answered by using count. You don't need to come up with an expression that computes the answer directly.

What percentage of the survey responders are student-athletes?

Which house has the most survey students living there?

Show a scatter-plot which has the "stem-level" on the x-axis and "schoolwork-hours" on the y-axis. Do you notice any correlation between the points?

Show a histogram of the "sleep-level" column with a bin-width of 1.

Ask and answer at least one more question about the data. Discuss why you wanted to know the answer to that question.

Congratulations! You have reached the end of lab. Here is an optional exercise in case you are looking for a challenge. You are not required to do this exercise.

Write a helper function percent-true which takes a table and column name as input and returns a the percent of rows that are true for the column specified. What is neat about this helper function, is it will work on any table that has a column of type Boolean. Use this helper function to find the percentage of survey responders who are student-athletes. Check to see if it's the same answer you got for part 2.1.

fun percent-true(t :: Table, col :: String) -> Number:
  doc: "returns the percentage of rows that in column 'col' that are true"
  ...
end

When you've complete the exercises, show your code to your instructor or one of the coaches.

Then, upload your lab04.arr file to the Lab 4 assignment on Gradescope.

If you have extra time, check out these spurious correlations. It’s entertaining, we promise!

This lab includes material adapted from Kathi Fisler and colleagues at Brown University.