Data Science for Linguists 2021

Course home for
     LING 1340/2340

HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table

Homework 4: Supercomputing the Yelp Dataset

This homework carries a total of 80 points.

In class, we switched to a more proper venue for the Yelp data: the CRC’s h2p supercomputer cluster. We already gave To-do #12 a supercomputing treatment. For this homework, we will take it one step further: machine learning. What does it take to unleash machine learning on “big data”? Let’s find out.

For this homework, in your home directory on CRC, create a folder called hw4_yelp/ and keep all your work in here. This will be your submission.

Plan of attack

Wrestling with big data can get hairy without some pre-planning. Separate your work into two stages:

  1. A TRIAL RUN on a tiny slice of data: 10,000 reviews
    • The “FOO” principle still applies: start small. You don’t want to discover a bug in your script 30 minutes into a big job! In your hw4_yelp/ directory, create a file named review_10k.json as a mini version of yelp_academic_dataset_review.json, containing only 10,000 lines, and work with it.
    • PLATFORM: Remote Jupyter Notebook through CRC’s JupyterHub. Make sure to choose SMP - 1 core, 3 hours when spawning!
    • This is when you explore your data, experiment with your models, visualize: essentially the whole gamut of data-sciencing.
    • The goal here is for you to find a good model and some parameters for a benchmark, and also to develop the code you will use will later use on a lot more data. The miniature data size and the Jupyter Notebook interface make this process far more manageable.
    • To make Part 2 quicker, you should write your code here to make use of GridSearchCV and pipelines.
    • That said, don’t go overboard with many different classifier models. You can even focus on a single classifier model with different parameters.
  2. A FULL(ish)-SCALE RUN
    • This you don’t want to do through Jupyter Notebook! You will need a proper slurm job submission.
    • PLATFORM therefore should be: Command-line python script through slurm job submission.
    • Your Python script can be exported from your Jupyter Notebook:
      • First, load appropriate python if you haven’t already: module load python/3.7.0
      • Then: jupyter nbconvert --to script your_notebook.ipnyb. This produces your_notebook.py.
    • Streamline your script by removing code blocks that are mainly there for exploration. Visualization likewise is not useful here (unless you save out plots into a file).
    • To begin, train and test just ONE parameter combination using GridSearchCV. Basically this means passing in lists of length 1 for each of the parameters you’re trying to tune.
    • Once you’ve done that, you can try using the same model with a larger # of parameter combos, but make sure to take advantage of parallel processing. You should utilize the n_jobs parameter in GridSearchCV (your python script), and the #SBATCH --cpus-per-task=n setting (your Slurm script). You’ll notice that as you increase the number n things will run faster, but you will consume more memory (verify through seff <job-id>).
    • If your job is getting killed because of time limit, add this line in your script: #SBATCH --time=180 which will give you 3 hours of runtime instead of the default 60. (Caveat: requesting a lager time block may result in a longer wait time in job queue.) But what if 3, 4, or even 5 hours aren’t enough? Before requesting even more, you should first look into options for decreasing time complexity of your code.
    • Still, the entire set of 8.6 million reviews is a LOT. We recommend you start with initial 1 or 2 million lines, and go up to 4 mil only after. Would be nice to successfully train/test with the entire set, but your best try is fine for this homework. You might find your jobs taking a few hours to run…
    • If you are using full 8.6 million reviews (yay!), do NOT make a copy in your working directory. Instead, your script should read directly from the original file /bgfs/ling2340_2021s/shared_data/yelp_dataset_2021/yelp_academic_dataset_review.json.
    • BTW, if you need to cancel your job, use: scancel <job-id>.

Task: Classifying negative vs. positive reviews

As for the task itself, we’ll keep it simple: negative vs. positive review classification. It’s essentially a binary classification problem and a type of sentiment analysis.

Summary report as README.md

That’s a lot of moving parts and files, so create a summary document that rounds up everything plus a bit of further analysis. In your README.md file:

  1. List the content of your submission, that is, the files in your hw4_yelp/ directory. Give a very short description of the content/purpose.
  2. Your “full(ish)-scale run” produced some numbers which will need interpretation and analysis. Do so here in its own separate section.
  3. A section on further analysis. Look closely at the documentation for the dataset. What are some other analysis questions you might be able to investigate with your machine learning skills? What else would you find interesting to predict, and which attributes might serve as useful features?
  4. Lastly, summarize the computing resource aspect of your work: data size, parallel jobs utilized, memory usage, wall-time, etc. Also, any thoughts on big data and supercomputing you might have.

Some things to keep in mind:

Submission

We have read access to your hw4_yelp/ directory, so that’s your submission. It should include: