Assignment 4: Famous People

Assigned: Thursday, 22 September
Due: Wednesday, 28 September, 11:59 p.m.

The Pantheon 1.0 data set contains information about 11,340 famous individuals, based on articles in international versions of Wikipedia.

For this assignment, we have taken a random sample of 1000 rows (that is, information about 1000 people) and have selected a few of the most interesting columns.

Task: Load the table from this Google spreadsheet.

You can refer to Assignment 3 or the recent class slides and labs for the required include statements and code for loading a table from a Google spreadsheet. Use the same column names that are in the spreadsheet.

For the table functions, refer to this page rather than the Pyret documentation.

Before we try to work with this data, we should think about what the valid contents of the fields could be. It turns out, this can be pretty hard, and many programmers and companies get this wrong in practice!

Task: Read this short article on falsehoods programmers believe about names.

Task: Read this article on gender storage in databases.

Task: Answer the following questions in a multi-line comment (#| ... |#) at the top of your program before moving on to Part 2.

  1. Take a look at the Pantheon 1.0 data. Based on the two articles you’ve just read, what are two assumptions that this table seems to reflect? Provide an example from the dataset of each assumption you name and explain why the assumption is harmful or doesn't always hold true.

  2. What is one other assumption – not about names or gender – that the data set makes? Provide an example and explain why the assumption is harmful or doesn’t always hold true.

Task: Add a new column, "first-name". This should contain a substring of the value in the "name" column, stopping before the first space (" "). Call the resulting table people-names.

Hint: See the string-index-of and string-substring functions! Note what string-index-of returns when the substring isn’t found. What names could that happen for? Consider what would be an appropriate value for the "first-name" column in this case.

Task: In a comment, identify two (or more) names where this function doesn't return the right thing.

Task: The table contains a "birthstate" column, but this is only relevant for people born in the United States or other places that are divided into states. Use transform-column to change missing values to "NA", standing for “not applicable” or “not available”. Call the resulting table people-states.

Task: Filter the people-states table to only include people whose "country" is the United States. (Be careful – the capitalization in this column is inconsistent!) Call the resulting table people-us-states.

Task: Count how many times each state occurs in the people-us-states table using the count function.

Task: Visualize the distribution of states you just counted by making a pie chart!

The table contains a "birthyear" column. Because the famous individuals lived throughout recorded history, some of the years are BCE – Before the Common Era. These are represented as negative numbers, which is accurate, but confusing to see!

Task: Transform this column into strings like "1970 CE" and "367 BCE". Call the resulting table people-years.

Hint: Beware: Not all of the entries in this column are proper numbers, which is why it's loaded as strings!

As with previous assignments, you are expected to follow good Pyret style, including writing docstrings and examples for each function. Review the Testing and Style Guidelines and ask questions if anything’s unclear!

  1. Download your file (FileDownload) and ensure it’s named asmt04.arr.

  2. Upload your assignment on Gradescope.

Note: You can submit as many times as you want before the deadline. Only your latest submission will be graded.

Part of this assignment is adapted from Kathi Fisler and colleagues at Brown University.