Data Science for Linguists 2021

Course home for
     LING 1340/2340

HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table

Homework 3: Machine Learning with the ETS Corpus

This homework carries a total of 80 points.

For this homework, we will apply machine learning methods to the ETS Corpus of Non-Native Written English data, using Python’s sklearn library.

Dataset

Repo, files and folders

Goals

Let’s build machine learning models for the following three tasks.

  1. L1 identification: Given a response, predict the first language (L1) of the writer
  2. Topic (= prompt) identification: Given a response, predict the prompt for which it was written
    • This task is pretty similar to 1. above with one caveat: topic distribution is not even.
  3. Score prediction: Given a response, predict its score
    • The targets (‘low’, ‘intermediate’, ‘high’) are ordinal, not nominal, categories. The question then becomes: Regression models or classification models?

The three tasks have a lot in common, but they also come with some fundamental differences that require special attention, which are noted above.

Evaluation:

  1. L1 identification: Remember the data set itself came with its own training-test-development data partition. Use this split in your evaluation.
  2. Topic (= prompt) identification: Use your own training-test split, in 80-20 ratio. Use random seed 0.
  3. Score prediction: Try 5-fold cross validation.

Plots:

Machine learning algorithms:

Features:

BONUS POINTS:

Your report, aka Jupyter Notebook file

It should meet the following requirements:

Submission

Your private repository DS4Ling-Homework counts as your submission, which you had already shared with the instructors. I will be grading your very last commit on/before the deadline. If you want a later version instead as your submission (with late penalty), let me know. After the deadline has passed, we will also share our JNB files via Class-Exercise-Repo, like we did for HW2.