Data Science for Linguists 2025

Course home for
     LING 1340/2340

HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table

Homework 4: Supercomputing the Yelp Dataset

This homework carries a total of 50 points.

In class, we switched to a more proper venue for the Yelp data: the CRC’s supercomputer cluster. We already gave To-do #15 a supercomputing treatment. For this homework, we will take it one step further: machine learning. What does it take to unleash machine learning on “big data”? Let’s find out.

For this homework, in your home directory on CRC, create a folder called hw4_yelp/ and keep all your work in here. This will be your submission.

Remember: you can find the Yelp review data file under the shard data folder /ix1/ling2340_2025s/shared_data/yelp_dataset_2021/yelp_academic_dataset_review.json.

Plan of attack

Wrestling with big data can get hairy without some pre-planning. Separate your work into two stages:

Stage 1. A TRIAL RUN on a tiny slice of data: 10,000 reviews

Stage 2. A FULL(ish)-SCALE RUN

Task: Classifying negative vs. positive reviews

As for the task itself, we’ll keep it simple: negative vs. positive review classification. It’s essentially a binary classification problem and a type of sentiment analysis.

Summary report as README.md

That’s a lot of moving parts and files, so create a summary document that rounds up everything plus a bit of further analysis. In your README.md file:

  1. List the content of your submission, that is, the files in your hw4_yelp/ directory. Give a very short description of the content/purpose.
  2. Your “full(ish)-scale run” produced some numbers which will need interpretation and analysis. Do so here in its own separate section.
  3. A section on further analysis. What are some other analysis questions you might be able to investigate with your machine learning skills? What else would you find interesting to predict, and which attributes might serve as useful features?
  4. Lastly, summarize the computing resource aspect of your work: data size, parallel jobs utilized (if any), memory usage, wall-time, etc. Also, any thoughts on big data and supercomputing you might have.

Some things to keep in mind:

Submission

We have read access to your hw4_yelp/ directory, so that’s your submission. It should include: