Data Science for Linguists 2019

Course home for
LING 1340/2340

HOME
Policies
Term project guidelines
Learning resources by topic
Schedule table

Homework 4: Supercomputing the Yelp Dataset

This homework carries a total of 80 points.

In class, we switched to a more proper venue for the Yelp data: the CRC’s h2p supercomputer cluster. We already gave To-do #12 a supercomputing treatment. For this homework, we will take it one step further: machine learning. What does it take to unleash machine learning on “big data”? Let’s find out.

For this homework, in your home directory on CRC, create a folder called hw4_yelp/ and keep all your work in here. This will be your submission.

Plan of attack

Wrestling with big data can get hairy without some pre-planning. Separate your work into two stages:

  1. A TRIAL RUN on a tiny slice of data: 10,000 reviews
    • The “FOO” principle still applies: start small. You don’t want to discover a bug in your script 30 minutes into a big job! In your hw4_yelp/ directory, create a file named review_10k.json as a mini version of review.json, containing the first 10,000 lines, and work with it.
    • PLATFORM: Remote Jupyter Notebook through CRC hub. Make sure to choose SMP - 1 core, 3 hours when spawning!
    • This is when you explore your data, experiment with your models, visualize: essentially the whole gamut of data-sciencing.
    • The goal here is for you to find a good model and some parameters for a benchmark, and also to develop the code you will use will later use on a lot more data. The miniature data size and the Jupyter Notebook interface make this process far more manageable.
    • To make Part 2 quicker, you should write your code here to make use of GridSearchCV and pipelines.
    • That said, don’t go overboard with many different classifier models. You can even focus on a single classifier model with different parameters.
  2. A FULL(ish)-SCALE RUN
    • This you don’t want to do through Jupyter Notebook! You will need a proper slurm job submission.
    • PLATFORM therefore should be: Command-line python script through slurm job submission.
    • Your Python script can be exported from your Jupyter Notebook via the command jupyter nbconvert --to script your_notebook.ipnyb. This produces your_notebook.py. Streamline it by removing code blocks that are mainly there for exploration. Visualization likewise is not useful here (unless you save out plots into a file).
    • To begin you should train and test ONE classifier, using GridSearchCV. Basically this means passing in lists of length 1 for each of the parameters you’re trying to tune.
    • Once you’ve done that, you can try using the same classifier with different parameters, but make sure to take advantage of parallel processing. You should utilize the n_jobs parameter in GridSearchCV, and the --ntasks SBATCH configuration. You’ll notice that as you increase n_jobs, things will run faster, but you will consume more memory. You can increase runtime and memory limits with #SBATCH --time and --mem, respectively.
    • Still, the entire set of 6.7 million reviews is a LOT. We recommend you start with initial 1 or 2 million lines. Would be nice to successfully train/test with the entire set, but your best try is fine for this homework. You might find your jobs taking a few hours to run…
    • You can print any output and capture it in a file using by specifying the appropriate filename in the job script with #SBATCH --output=<output filename>.

Task: Classifying negative vs. positive reviews

As for the task itself, we’ll keep it simple: negative vs. positive review classification. It’s essentially a binary classification problem and a type of sentiment analysis.

Summary report as README.md

That’s a lot of moving parts and files, so create a summary document that rounds up everything plus a bit of further analysis. In your README.md file:

  1. List the content of your submission, that is, the files in your hw4_yelp/ directory. Give a very short description of the content/purpose.
  2. Your “full-scale run” produced some numbers which will need interpretation and analysis. Do so here in its own separate section.
  3. A section on further analysis. Look closely at the documentation for the dataset. What are some other analysis questions you might be able to investigate with your machine learning skills? What else would you find interesting to predict, and which attributes might serve as useful features?

Some things to keep in mind:

Submission

We have read access to your hw4_yelp/ directory, so that’s your submission. It should include: