Assigned | Thursday, 4 April |
Part 1 due | Wednesday, 10 April, 11:59 p.m. |
Part 2 due | Wednesday, 17 April, 11:59 p.m. |
Contents
Introduction
During this class, we have seen some of the many kinds of data that
we can write computer programs to clean, analyze, and visualize. This
mini-project gives you the opportunity to practice these skills working
with a dataset of interest to you.
Projects may be done individually, but we recommend working in a
group of two.
Your project will have two deliverables:
Planning notebook 25%
Final notebook 75%
If you work in a group, you will submit a single copy of each
deliverable with all your names –
see these
instructions.
Part 1: Planning
Before you spend too much time working on your mini-project, we want
to make sure you’ve identified an appropriate dataset and thought about
the questions you’d like to ask.
There are two ways you can approach the mini-project:
- You can start by thinking about questions you’d like to investigate,
along the lines
of these
from a course at Northeastern. Once you’ve formulated a question,
you can look for datasets that would let you answer it.
- Alternatively, you can start by exploring tabular datasets that are
out there, like those in
the archive of
the Data is Plural newsletter or
those listed
by Miriam Posner. Once you’ve found a dataset that matches your
interests, you can think about what kind of questions you could ask
about it.
It’s fine to choose a dataset that has more – or more complex
– information than you’ll use. Rather, beware of choosing data
that is too small, simple, or tidy because you won’t have enough to work
with. (Also beware of choosing a dataset whose creators have already
answered the exact questions you would ask!)
You’re required to turn in a notebook that contains:
- A text cell containing a brief (1-to-3-sentence) description of the
dataset you’re interested in using.
- Code cells that demonstrate successfully loading the dataset from
a CSV file, TSV file, or spreadsheet into
a
datascience
module Table
.
A text cell briefly listing preprocessing the data requires: Are
there missing values? Inconsistent formats? Other issues?
This doesn’t need to be comprehensive, but we want to see that you’ve
looked at the data and thought about the issues you’ll face.
A text cell listing 2–4 questions you want to explore with this
data.
This list isn’t binding – you can report on different questions in
your final submission – but we want to see that you’ve thought about
what you want to look into.
These should be questions you’re actually interested in answering. If
you can’t think of anything you want to know about the data, then you
should choose a different dataset!
Part 2: Final
Your final report will again be a Colab notebook, mixing text, code,
and results.
Divide it into the following sections:
- Problem statement
Describe what question(s) are you going to ask and the data you’re
using to ask them.
- Data processing
Load the dataset and perform appropriate clean-up by selecting and/or
renaming columns, filling in or filtering missing values, transforming
values to be consistent or easier to analyze, etc. Briefly describe in
text cells what your code in the code cells is doing.
- Analysis
Answer the question(s) you posed, including at least two different kinds
of visualizations (scatterplots, line graphs, bar charts, maps,
etc.) For full credit, the questions should require a significant
amount of analysis, such as building a new column, creating a new
table, and so on – not just drawing a scatterplot for two columns
in the data.
- References
Cite any resources you used. At a minimum, this should include where you
got the data set from. (If you consulted a generative AI system for
help at any stage, you should acknowledge that in this section as
well, including a link to a transcript.)
Introduction
During this class, we have seen some of the many kinds of data that we can write computer programs to clean, analyze, and visualize. This mini-project gives you the opportunity to practice these skills working with a dataset of interest to you.
Projects may be done individually, but we recommend working in a group of two.
Your project will have two deliverables:
Planning notebook 25% Final notebook 75%
If you work in a group, you will submit a single copy of each deliverable with all your names – see these instructions.
Part 1: Planning
Before you spend too much time working on your mini-project, we want to make sure you’ve identified an appropriate dataset and thought about the questions you’d like to ask.
There are two ways you can approach the mini-project:
- You can start by thinking about questions you’d like to investigate, along the lines of these from a course at Northeastern. Once you’ve formulated a question, you can look for datasets that would let you answer it.
- Alternatively, you can start by exploring tabular datasets that are out there, like those in the archive of the Data is Plural newsletter or those listed by Miriam Posner. Once you’ve found a dataset that matches your interests, you can think about what kind of questions you could ask about it.
It’s fine to choose a dataset that has more – or more complex – information than you’ll use. Rather, beware of choosing data that is too small, simple, or tidy because you won’t have enough to work with. (Also beware of choosing a dataset whose creators have already answered the exact questions you would ask!)
You’re required to turn in a notebook that contains:
- A text cell containing a brief (1-to-3-sentence) description of the dataset you’re interested in using.
- Code cells that demonstrate successfully loading the dataset from
a CSV file, TSV file, or spreadsheet into
a
datascience
moduleTable
. A text cell briefly listing preprocessing the data requires: Are there missing values? Inconsistent formats? Other issues?
This doesn’t need to be comprehensive, but we want to see that you’ve looked at the data and thought about the issues you’ll face.
A text cell listing 2–4 questions you want to explore with this data.
This list isn’t binding – you can report on different questions in your final submission – but we want to see that you’ve thought about what you want to look into.
These should be questions you’re actually interested in answering. If you can’t think of anything you want to know about the data, then you should choose a different dataset!
Part 2: Final
Your final report will again be a Colab notebook, mixing text, code, and results.
Divide it into the following sections:
- Problem statement
Describe what question(s) are you going to ask and the data you’re using to ask them. - Data processing
Load the dataset and perform appropriate clean-up by selecting and/or renaming columns, filling in or filtering missing values, transforming values to be consistent or easier to analyze, etc. Briefly describe in text cells what your code in the code cells is doing. - Analysis
Answer the question(s) you posed, including at least two different kinds of visualizations (scatterplots, line graphs, bar charts, maps, etc.) For full credit, the questions should require a significant amount of analysis, such as building a new column, creating a new table, and so on – not just drawing a scatterplot for two columns in the data. - References
Cite any resources you used. At a minimum, this should include where you got the data set from. (If you consulted a generative AI system for help at any stage, you should acknowledge that in this section as well, including a link to a transcript.)