Mini-project

Assigned Thursday, 6 April
Part 1 due Wednesday, 12 April, 11:59 p.m.
Part 2 due Wednesday, 19 April, 11:59 p.m.

Contents

Introduction

During this class, we have seen some of the many kinds of data that we can write computer programs to clean, analyze, and visualize. This mini-project gives you the opportunity to practice these skills working with a dataset of interest to you.

Projects may be done in groups of 1–3 students, although I strongly recommend working in a group of two.

Your project will have two deliverables:

Planning notebook25%
Final notebook75%

If you work in a group, you will submit a single copy of each deliverable with all your names – see these instructions.

Part 1: Planning

Due 12 April, 11:59 p.m.

Before you spend too much time working on your mini-project, we want to make sure you’ve identified an appropriate dataset and thought about the questions you’d like to ask.

There are two ways you can approach the mini-project:

  1. You can start by thinking about questions you’d like to investigate, along the lines of these from a course at Northeastern. Once you’ve formulated a question, you can look for datasets that would let you answer it.
  2. Or you can start by exploring tabular datasets that are out there, like those in the archive of the Data is Plural newsletter or those listed by Miriam Posner. Once you’ve found a dataset that matches your interests, you can think about what kind of questions you could ask about it.

It’s fine to choose a dataset that has more information than you’ll use. Beware of choosing data that is too small, simple, or tidy because you won't have enough to work with. (Also beware of choosing a dataset whose creators have already answered the exact questions you would ask!)

You’re required to turn in a notebook that contains:

  1. A text cell containing a brief (1-to-3-sentence) description of the dataset you’re interested in using.
  2. Code cells that demonstrate successfully loading the dataset from a CSV file, TSV file, or spreadsheet into a datascience module Table.
  3. A text cell briefly listing preprocessing the data requires. Are there missing values? Inconsistent formats? Other issues?

    This doesn’t need to be comprehensive, but we want to see that you’ve looked at the data and thought about the issues you’ll face.

  4. A text cell listing 2–4 questions you want to explore with this data.

    This list isn’t binding – you can report on different questions in your final submission – but we want to see that you’ve thought about what you want to look into.

    These should be questions you’re actually interested in. If you can’t think of anything you want to know about the data, then you should probably choose a different dataset!

Part 2: Final

Due 19 April, 11:59 p.m.

Your final report will again be a Colab notebook, mixing text, code, and results.

Divide it into the following sections:

  1. Problem statement
    Describe what question(s) are you going to ask and the data you're using to ask them.
  2. Data processing
    Load the dataset and perform any appropriate clean-up, selecting and/or renaming columns, filling in or filtering missing values, transforming values, etc. Briefly describe what your code is doing in text cells.
  3. Analysis
    Answer the question(s) you posed, including at least two different kinds of visualizations (scatterplots, line graphs, bar charts, maps, etc.)
  4. References
    Cite any resources you used. At a minimum, this should include where you got the data set from.