Data Science for Linguists 2019

Course home for
LING 1340/2340

HOME
Policies
Term project guidelines
Learning resources by topic
Schedule table

Homework 3: Machine Learning with the ETS Corpus

This homework carries a total of 80 points.

For this homework, we will apply machine learning methods to the ETS Corpus of Non-Native Written English data, using Python’s sklearn library.

Data set

Repo, files and folders

Goals

Let’s build machine learning models for the following three tasks.

  1. L1 identification: Given a response, predict the first language (L1) of the writer
  2. Topic (= prompt) identification: Given a response, predict the prompt for which it was written
    • This task is pretty similar to 1. above with one caveat: topic distribution is not even.
  3. Score prediction: Given a response, predict its score
    • The targets (‘low’, ‘intermediate’, ‘high’) are ordinal, not nominal, categories. The question then becomes: Regression models or classification models?

The three tasks have a lot in common, but they also come with some fundamental differences that require special attention, which are noted above.

Evaluation:

  1. L1 identification: Remember the data set itself came with its own training-test-development data partition. Use this split in your evaluation.
  2. Topic (= prompt) identification: Use your own training-test split, in 80-20 ratio. Use random seed 0.
  3. Score prediction: Try 5-fold cross validation.

Plots:

Machine learning algorithms:

Features:

BONUS POINTS:

Your report, aka Jupyter Notebook file

It should meet the following requirements:

Submission

A big change: don’t create a pull request before the homework deadline. (Reason: the moment you create a pull request, the entire content of your request becomes visible on the upstream repo.) Instead, just make sure to push your final, “submission-ready” version to your fork, before the 3:30pm deadline. After the deadline passes, go ahead and submit your pull request. You can do this at the beginning of the class too.