Vassar CMPU 366: Project advice

The most important criterion for choosing a topic is that it genuinely excites you! Start by thinking if there are particular languages, genres/domains, linguistic phenomena, or machine learning models that you’d like the chance to explore.

It’s a tricky thing to come up with a project from scratch, particularly when you don’t have a lot of time. If you’re not sure about starting ideas, I’d recommend looking at the shared SemEval tasks (including previous years), browsing through the textbook, and maybe using the search box in the ACL anthology.

As you develop an idea, here are some things to keep in mind:

Make your research question clear and concrete.

Getting a good research question is a subtle thing. Borrowing terms from this helpful advice post, you want it to be

clear to an audience of your classmates,
focused enough that you will be able to address it with one or two narrow experiments,
concise enough to state within the first few sentences of a paragraph,
complex enough that the answer isn’t immediately evident, and
arguable, in the sense it should be possible to provide evidence that supports or rejects an answer to your research question.

Pick a clean, ready-to-use dataset.

Dataset processing takes a long time! You’ll probably need to do some no matter what, but it’ll be easier if you plan to use a dataset that’s already prepared for processing.

Good sources of these datasets include leveraging existing shared tasks (for instance, SemEval and CoNLL tasks and the GLUE benchmark) or datasets that have been used for lots of NLP processing in the past (e.g., this list of ten corpora).

There are also some sites that are easier to get data from, like anything related to Wikimedia (e.g., Wikipedia), or StackExchange (e.g., StackOverflow). You can also check sites like Kaggle to see if they have anything available.

If the process to acquire the dataset for your project takes more than 24 hours or costs money, it’s probably not a good option for this class. (But feel free to ask me about it first.)

Keep things narrow.

It’s okay if your project doesn’t create a new dataset, model, and evaluation all in one swoop! The smaller the scope is for what you’re doing, the easier it will be to provide evidence that you did it well. For instance, you could

perform a replication study (e.g., take existing code for an experiment and check that it’s doing what it says, plus break down their results a bit more),
take a large unstructured dataset and curate it into one that helps answer a more particular question and show a simple model works on it,
try to get a good result on an existing shared task/analyze the contents of the text of a shared task to see what parts are “easy” or “hard”, or
create a new metric/evaluation and show it does something interesting.

For anything you’re planning to do, consider the scale of the data.

While it’s possible for the department to set you up with additional resources – including disk space, parallel processing, or GPUs – this takes time to set up and may require you to learn how to use new tools. The bigger the dataset you’re working with, the quicker you need to work so you have enough time to run your processing.