Homework 3: Machine Learning with the ETS Corpus

This homework carries a total of 80 points.

For this homework, we will apply machine learning methods to the ETS Corpus of Non-Native Written English data, using Python’s sklearn library.

Dataset

You should be fairly familiar with this corpus by now. It was published by the LDC (Linguistic Data Consortium). I have acquired license on behalf of all Pitt faculty and students. I am distributing the data through the privately held “Licensed-Datasets” repo of our class GitHub organization.
You should have it already cloned the repo as your local repository. There’s no need to fork.
As with the previous homework, you should keep the corpus data files in its original directory location and its original form.
NEW to HW3: you should work with a saved pickle file. See next section.

Repo, files and folders

Work in your own private repository DS4Ling-Homework, which already contains your previous work on the ETS data. Create a new Jupyter Notebook file for this homework.
As before, the ETS corpus data files should remain in their original directory location. Do not move or copy them into this repo.
You shouldn’t have to start from the original ETS data files and go through the basic text processing steps every single time! We already completed these steps as part of HW2. So, assuming you have the processed dataframes saved as pickle files at the end of HW2, you should simply unpickle the two dataframes essay_df and prompt_df, and then go from there. See Na-Rae’s HW2 KEY script on how it’s done, then modify your HW2 to produce the pickle files. The files should be in DS4Ling-Homework, alongside your JNB files.
Should you be committing these pickle files? Absolutely not! Make sure to update your .gitignore file to ignore pickle (.pkl) files and CSV (.csv) files.
While working on your code, you should be frequently committing and pushing to your fork. Important!

Goals

Let’s build machine learning models for the following three tasks.

L1 identification: Given a response, predict the first language (L1) of the writer
Topic (= prompt) identification: Given a response, predict the prompt for which it was written
- This task is pretty similar to 1. above with one caveat: topic distribution is not even.
Score prediction: Given a response, predict its score
- The targets (‘low’, ‘intermediate’, ‘high’) are ordinal, not nominal, categories. The question then becomes: Regression models or classification models?

The three tasks have a lot in common, but they also come with some fundamental differences that require special attention, which are noted above.

Evaluation:

L1 identification: Remember the data set itself came with its own training-test-development data partition. Use this split in your evaluation.
Topic (= prompt) identification: Use your own training-test split, in 80-20 ratio. Use random seed 0.
Score prediction: Try 5-fold cross validation.

Plots:

Confusion matrices in seaborn’s “heat map” format is an obvious choice. You should use it.
In addition, you should try at least one additional visualization method.

Machine learning algorithms:

Try at least two different machine learning algorithms/models for each task.

Features:

The “bag-of-words” approach is the most basic one. You should try it.
Try additional features. Text-based features that go beyond individual words could be interesting. How about features that are not based on text: for example, will L1 be a helpful predictor for the prompt? How about the token count for the score level?

BONUS POINTS:

Try dimensionality reduction through PCA (Principal Component Analysis) or SVD (Singular Value Decomposition).

Your report, aka Jupyter Notebook file

It should meet the following requirements:

At the top of your Jupyter Notebook should be a markdown cell with the homework title (use #), your name, email and date.
It should have these five section headers (use ##):
- Load the data, recap basic stats
- Task 1: L1 identification
- Task 2: prompt identification
- Task 3: score prediction
- Summary analysis
Those are the top-level sections (use ##), but feel free to add your own sub-sections and other markdown cells as you see fit.
Don’t forget to make it presentable! Keep in mind that your JNB file is a written project report with embedded Python code. Don’t just produce evaluation scores: your own analysis and interpretation of the numbers are the end goal.

Submission

Your private repository DS4Ling-Homework counts as your submission, which you had already shared with the instructors. I will be grading your very last commit on/before the deadline. If you want a later version instead as your submission (with late penalty), let me know. After the deadline has passed, we will also share our JNB files via Class-Exercise-Repo, like we did for HW2.