Mini-project

Assigned Thursday, 4 April
Part 1 due Wednesday, 10 April, 11:59 p.m.
Part 2 due Wednesday, 17 April, 11:59 p.m.

Contents

Introduction

During this class, we have seen some of the many kinds of data that we can write computer programs to clean, analyze, and visualize. This mini-project gives you the opportunity to practice these skills working with a dataset of interest to you.

Projects may be done individually, but we recommend working in a group of two.

Your project will have two deliverables:

Planning notebook25%
Final notebook75%

If you work in a group, you will submit a single copy of each deliverable with all your names – see these instructions.

Part 1: Planning

Before you spend too much time working on your mini-project, we want to make sure you’ve identified an appropriate dataset and thought about the questions you’d like to ask.

There are two ways you can approach the mini-project:

  1. You can start by thinking about questions you’d like to investigate, along the lines of these from a course at Northeastern. Once you’ve formulated a question, you can look for datasets that would let you answer it.
  2. Alternatively, you can start by exploring tabular datasets that are out there, like those in the archive of the Data is Plural newsletter or those listed by Miriam Posner. Once you’ve found a dataset that matches your interests, you can think about what kind of questions you could ask about it.

It’s fine to choose a dataset that has more – or more complex – information than you’ll use. Rather, beware of choosing data that is too small, simple, or tidy because you won’t have enough to work with. (Also beware of choosing a dataset whose creators have already answered the exact questions you would ask!)

You’re required to turn in a notebook that contains:

  1. A text cell containing a brief (1-to-3-sentence) description of the dataset you’re interested in using.
  2. Code cells that demonstrate successfully loading the dataset from a CSV file, TSV file, or spreadsheet into a datascience module Table.
  3. A text cell briefly listing preprocessing the data requires: Are there missing values? Inconsistent formats? Other issues?

    This doesn’t need to be comprehensive, but we want to see that you’ve looked at the data and thought about the issues you’ll face.

  4. A text cell listing 2–4 questions you want to explore with this data.

    This list isn’t binding – you can report on different questions in your final submission – but we want to see that you’ve thought about what you want to look into.

    These should be questions you’re actually interested in answering. If you can’t think of anything you want to know about the data, then you should choose a different dataset!

Part 2: Final

Your final report will again be a Colab notebook, mixing text, code, and results.

Divide it into the following sections:

  1. Problem statement
    Describe what question(s) are you going to ask and the data you’re using to ask them.
  2. Data processing
    Load the dataset and perform appropriate clean-up by selecting and/or renaming columns, filling in or filtering missing values, transforming values to be consistent or easier to analyze, etc. Briefly describe in text cells what your code in the code cells is doing.
  3. Analysis
    Answer the question(s) you posed, including at least two different kinds of visualizations (scatterplots, line graphs, bar charts, maps, etc.) For full credit, the questions should require a significant amount of analysis, such as building a new column, creating a new table, and so on – not just drawing a scatterplot for two columns in the data.
  4. References
    Cite any resources you used. At a minimum, this should include where you got the data set from. (If you consulted a generative AI system for help at any stage, you should acknowledge that in this section as well, including a link to a transcript.)