Daily To-do Assignments

To-do #1

Due 1/10 (Fri), 12:45pm (15 minutes before class)

The goal of this To-do is to get you started with Git. To that end, complete my LSA 2019 tutorial Part 1 “Intro to Git”, linked under the “Git” section of the Learning Resources page. Detailed instructions:

Are you completely new to Git? Then, my written tutorial alone won’t be enough to help you wrap your head around what’s going on. I highly recommend watching the videos from this LinkedIn Learning course. It’s much easier when someone shows you while explaining.
- LinkedIn Learning is free for Pitt: make sure to log-in with your Pitt ID.
- If you’re short on time, watch through Chapter 6 which will give you enough foundations, then turn to my tutorial.
Have you used Git before but aren’t very familiar? You should also start with the LinkedIn videos!
When you’re done with my tutorial, produce three screenshots of your console screen as proof of completion, that shows output of the three commands:
- git config --global --list
- git log
- ls -la (make sure you are in your languages folder)

SUBMISSION: On Canvas. Upload your screenshot files through the To-do1 submission link.

To-do #2

Due 1/13 (Mon), 12:45pm

The Internet is full of published linguistic data sets. Let’s data-surf! Instructions:

Go out and find two linguistic data sets you like. One should be a corpus, the other should be some other format. They must be free and downloadable in full. Make sure they are linguistic data sets, meaning designed specifically for linguistic inquiries or NLP engineering purposes.
You might want to start with various bookmark sites listed in the following Learning Resources sections: Linguistic Data, Open Access, Data Publishing and Corpus Linguistics. But don’t be constrained by them.
Download the data sets and poke around. Open up a file or two to take a peek. (No need to do this in Python: Save that for HW1.)
In a text file named datasets_yourname.txt, (note the .txt extension), make note of:
- The name of the data resource
- The author(s)
- The URL of the download page
- Its makeup: size, type of language, format, etc.
- License: whether it comes with one, and if so what kind?
- Anything else noteworthy about the data. A sentence or two will do.
If you are comfortable with markdown, make an .md file instead of a text file.

Git/GitHub submission instructions:

If you haven’t already, fork Class-Exercise-Repo from our class GitHub org. Then, clone your fork onto your laptop. Details are on today’s slides.
Inside, you will find two directories: activity1/ and todo2/. activity1 is for a practice run (instructions on this slide), which I highly recommend.
When you’re ready, copy or move your text file document into todo2/. Make sure it’s named something like datasets_yourname.txt so it won’t conflict with some other student’s.
Do the usual local git routine, which ends with committing. Then push to your own GitHub fork.
Confirm that your GitHub fork has your file.

If you already know about pull requests, go ahead and create one as the last step. We will go over the mechanics of pulling and merging in our next class.

SUBMISSION: That’s it! Your forked GitHub repository counts as your submission.

To-do #3

Learn about the numpy library. Between the Python Data Science Handbook chapter and the DataCamp tutorial, the DataCamp intro should be more accessible, so I recommend it for this To-do. Create your own study notes as a Jupyter Notebook file entitled numpy_notes_yourname.ipynb. Include examples from the DataCamp tutorial, explanations, etc.

SUBMISSION: Your file should be in the todo3/ directory of the Class-Exercise-Repo. Make sure your fork is up-to-date. Push to your GitHub fork, and create a pull request for me.

To-do #4

Study the pandas library (through the Python Data Science Handbook and/or the DataCamp tutorials). pandas is a big topic with lots to learn: aim for about 2/3. While doing so, try it out on TWO spreadsheet (.csv, .tsv, etc.) files:

The first file should be your choice. You can get one from this CSV Files archive, or make up your own. Keep it super small and simple at 5-100 rows. This is supposed to be a toy dataset that helps you learn pandas, and contending with a large volume of data at the same time will only hinder your learning process.
The second one should be billboard_lyrics_1964-2015.csv by Kaylin Pavlik, from her project ‘50 Years of Pop Music’. (Note: you might need to specify ISO8859 encoding when opening it.)

Don’t change the filename of any downloaded CSV files or edit them in any way – important! Also, no need to put them in a data/ folder, place them right inside todo4/ next to your Jupyter Notebook file. Name your Jupyter Notebook file pandas_notes_yourname.ipynb.

SUBMISSION: Your files should be in the todo4/ directory of Class-Exercise-Repo. Commit and push all three files (including your data files!) to your GitHub fork, and create a pull request for me.

To-do #5

This one is a continuation of To-do #4: work further on your pandas study notes. You may create a new JNB file, or you can expand the existing one. Also: try out a spreadsheet submitted by a classmate. You are welcome to view the classmate’s notebook to see what they did with it. (How to find out who submitted what? Git/GitHub history of course.) Give them a shout-out.

A head-up: When you view your classmate’s JNB, if you do so by opening up your local copy in Jupyter Notebook (as opposed to viewing an online copy on github.com), Anaconda Jupyter will insert a new invisible timestamp into the local JNB file. This means you would have altered the file without typing anything, which will lead to git nudging you to commit the change, which you shouldn’t! If this happens, you should restore the older version of the file using git checkout.

SUBMISSION: We’ll stick with the todo4/ directory in Class-Exercise-Repo. Push to your GitHub fork, and create a pull request for me.

To-do #6

Plotting time! matplotlib and seaborn are popular Python libraries for plot graphs and visualization. The goal of this To-do is practice them using the “English” data:

Data: english.csv, in the todo6 folder of “Class-Exercise-Repo”
Data from Balota, D.A., Cortese, M.J. and Pilotti, M. (1999) Visual lexical decision latencies for 2906 words.
- Official page: http://psychnet.wustl.edu/coglab/?page_id=275 (NOT corpora!)
- Download link dead, says “do not distribute data second hand”… :-(
- Distributed as part of the languageR package: https://www.rdocumentation.org/packages/languageR/versions/1.5.0
- Column descriptions available here: https://www.rdocumentation.org/packages/languageR/versions/1.5.0/topics/english

Your Jupyter Notebook study notes should be named plot_notes_yourname.ipynb.

SUBMISSION: Your files should be in the todo6/ directory of Class-Exercise-Repo. Commit and push to your GitHub fork, and create a pull request for me.

To-do #7

What have the previous students of LING 1340/2340 accomplished? What do finished projects look like? Let’s have you explore their past projects. Details:

GitHub wise, we’ll brave a new frontier: editing a shared file! We will collect our notes in this doc: past_project_critiques.md. To minimize potential conflicts and streamline clean-up efforts, I’ve already charted out everyone’s sections. Find your name in there.
Pick two projects from four previous years: 2024 spring, 2023 spring, 2022 spring, and 2021 spring.
Your critique should consist of: (1) one thing you thought was done well, (2) one avenue for improvement, and (3) one thing you learned.
As you pull from upstream, your file may end up being in conflict. It’s your job to edit the markdown file and get it to a tidy state before your submission. Be careful not to trample on your classmate’s work! You can learn about how to resolve conflicts here.

SUBMISSION: As usual, push to your fork and create a pull request. Make sure your team’s markdown file is in good shape!

To-do #8

Let’s dig into the issues of copyright and license in language data. We’ll then pool our discussion questions together in a shared markdown document.

Review the topics of linguistic data, open access, and data publishing, focusing in particular on:

Dr. Lauren Collister’s 2022 article “Copyright and Sharing Linguistic Data” for the Open Handbook of Linguistic Data Management and the web site “Copyright and Intellectual Property Toolkit” at Pitt Library.
Also, watch her recorded presentation on the topic (75 minutes), available via Pitt’s Panopto.

Think of a discussion question or two on the topic: it could be something general, or it could be a specific question relating to your own data use case. Add yours to this shared markdown file.

SUBMISSION: As usual, push to your fork and create a pull request. Like last time, make sure the shared markdown file is in good shape!

To-do #9

Did you know there’s a linguistic data project run in our department? There IS, and it’s about… Pittsburghese! Archive of Pittsburgh Language and Speech (APLS) is a project by Dr. Dan Villarreal and his student RAs aimed at curating a sociolinguistic data resource for Pittsburgh English. Let’s explore this project.

Start with the APLS documentation site here: https://djvill.github.io/APLS/
- Don’t skip the Terms of use page… important!
Sign in to APLS and explore the data. Log-in credentials can be found on the shared markdown document.
Time to take collective notes. In the shared markdown file in the todo9/ folder, enter three items:
1. A comment from a DATA USER’s point of view. Your impression, what you discovered, etc.
2. A comment about the DEVELOPMENT side of this project.
3. A question for Maya Asher, a veteran APLS project member, who will give a guest presentation on Friday.

SUBMISSION: As usual, push to your fork and create a pull request.

To-do #10

We have a special theme: data format treasure hunt! Let’s compile a roster of common formats used for linguistic data. I want everyone to contribute TWO examples from this list:

CSV/TSV, XML, JSON, CoNLL-U, Penn Treebank, TextGrid, CHAT, other (project-specific format).

Some ground rules to ensure a variety in our collection:

A maximum of TWO entries per format. For example, if XML already has two entries, you should find something else.
A maximum of TWO examples from the same source. So, if GUM was referenced twice already, pick a different resource.

This means you will have to watch the current state of this shared document to make sure your contribution does not duplicate a classmate’s. Check GitHub for the latest update, and also – important – any outstanding pull requests from your classmates that I haven’t processed yet.

This time around, the shared markdown document is organized per data format. Log your two entries under the appropriate format headers. Details:

Provide a link to the project home.
If you could find annotation guidelines/documentation, provide a link.
If a specific data file’s URL can be linked, do so.
Make note of what type of linguistic annotation it contains: POS, syntactic bracketing, named entity, etc.
For “other” type, specify what format it is. If it’s an ad-hoc, project-specific format, add a brief explanation of how the format is structured.

SUBMISSION: As usual, push to your fork and create a pull request. Watch for potential conflicts: make sure the shared markdown file is in good shape.

To-do #11

Let’s learn about web scraping. It is in fact a vast topic which requires learning about the very building blocks of web sites (HTML, CSS, etc.). DataCamp has a whole course devoted to it (Web Scraping in Python), but for now, let’s all just dip our toes.

Try out the “Web Scraping with BeautifulSoup” tutorial posted in the Web and Social Media Mining section of our learning resources page. Try out a web page of your own choice! Name your Jupyter Notebook bs4_web_scraping_YOURNAME.ipynb, which should be in the todo11 folder of our Class-Exercise-Repo.

Fair warning: not all web sites are easily scrapable, and some are even configured to detect and block web scraping queries. If you are stumped, shoot for web sites with simple functionality.

And a note on licensing: normally, if you’re going to scrape a significant portion of a web site you should absolutely pay close attention to the site’s TOS, but downloading a web page or two for learning purposes should be within the bounds of fair use.

SUBMISSION: As usual, push to your fork and create a pull request.

To-do #12

Let’s get started with sklearn, a popular machine learning library. We’ll try sentiment classification on movie reviews. Follow this tutorial in your own Jupyter Notebook file. Feel free to explore and make changes as you see fit. If you haven’t already, review the Python Data Science Handbook chapters to give yourself a good grounding. If you want to get a serious start on ML learning: watch DataCamp tutorials Supervised Learning with scikit-learn, and NLP Fundamentals in Python.

Students who took LING 1330: compare sklearn’s Naive Bayes with the NLTK’s treatment and include a blurb on your impressions and questions. (You don’t have to run NLTK’s code, unless you want to!)

SUBMISSION: Your jupyter notebook file should be in the todo12 folder of Class-Exercise-Repo. As usual, push to your fork and create a pull request.

To-do #13

What has everyone been up to? Let’s take a look – it’s a “visit your classmates” day!

First off, prepare your own “Guestbook” file. It’s already been created in the Class-Lounge repo, but you should edit it so that:
- It has your project title and a link to your repo, and your name
- And a bit of personalization if you want, like a greeting.
Now visit your classmates’ projects! We will go by the order of first names, which can be found right in the directory. You should visit two people after you (Ashley: Claire and Jana, Qidu: Sara and Ashley, etc.)
Take a look around, and write on their guestbook. (You don’t have to wait until it’s prepped.) Like the previous “past projects” To-do, your entry should consist of: one thing you thought was done well, one avenue for improvement or suggestion, and one thing you learned.

SUBMISSION: Since Class-Lounge is a fully collaborative repo, there is no formal submission process.

To-do #14

Let’s poke at big data. Well, big-ish – how about 7 million restaurant reviews? The Yelp Open Dataset was from Yelp’s Dataset Challenge some years ago, where Yelp made their huge review dataset available for academic groups that participate in a data mining competition. Challenge accepted! Before we begin:

You will need a strong and stable internet connection and at least 16GB of free disk storage space.
Provision enough time. Downloading the dataset alone may take 25 minutes or longer.

Mode of operation

After downloading the dataset, you should operate exclusively in a command-line environment, utilizing unix tools.
I am supplying general instructions below, but you will have to fill in the blanks between steps, such as cd-ing into the right directory, invoking your Anaconda Python and finding the right file argument. Make sure to refer to the last lecture slides!!

Step 1: Preparation, exploration

Let’s download this beast and poke around.

Download the JSON portion of the data. (We don’t need the photos.)
Move the downloaded zip archive file into your Documents/Data_Science directory. You might want to create a new folder there to store the data files. Unzip the file, which should create a new folder.
From this point on, operate exclusively in command line.
The data file yelp_dataset.tar is in the .tar format. Look it up if you are not familiar. Untar it using tar -xvf. I will extract 5 json files along with a PDF document.
Using various unix commands (ls -laFh, head, tail, wc -l, etc.), find out: how big are the json files? What do the contents look like? How many reviews are there?
How many reviews use the word ‘horrible’? Find out through grep and wc -l. Take a look at the first few through head | less. Do they seem to have high or low stars?
How many reviews use the word ‘scrumptious’? Do they seem to have high stars this time?

Step 2: A stab at processing

How much processing can our own puny personal computer handle? Let’s find out.

First, take stock of your computer hardware: disk space, memory, processor, and how old it is.
Create a Python script file: process_reviews.py. Content below. You can use nano, or you could use your favorite editor (atom, notepad++) provided that you launch the application through command line.

import pandas as pd
import sys
from collections import Counter

filename = sys.argv[1]

df = pd.read_json(filename, lines=True, encoding='utf-8')
print(df.head(5))

wtoks = ' '.join(df['text']).split()
wfreq = Counter(wtoks)
print(wfreq.most_common(20))

We are NOT going to run this on the whole review.json file! Start small by creating a tiny version consisting of the first 10 lines, named FOO.json, using head and >.
Then, run process_reviews.py on FOO.json. Note that the json file should be supplied as command-line argument to the Python script, so your command will look something like below.
- python process_reviews.py FOO.json
Confirm it ran successfully. If it didn’t, examine your python and path setting and find a workaround. Consult the last lecture slides, pages 5-7.
Next, re-create FOO.json with incrementally larger total # of lines and re-run the Python script. The point is to find out how much data your system can reasonably handle. Could that be 1,000 lines? 100,000?
While running this experiment, closely monitor the process on your machine. Windows users should use Task Manager, and Mac users should use Activity Monitor.
Finally, write up a short summary on this shared markdown file in Class-Lounge. A few sentences will do. How was your laptop’s handling of this data set? What sorts of resources would it take to successfully process it in its entirety and through more computationally demanding processes? Any other observations?

SUBMISSION: Your entry on this shared MD file. Make sure to properly resolve conflicts (if any)!

To-do #15

Trying out CRC, with bigger data + better code!

Warm-up

Following the H2P user guide, log into your CRC account.
Try out the To-do #14 redux I demonstrated in class, using 1 million review data.

A note about seff’s reporting on memory use. You might see the memory usage showing up as 0.00 MB instead of 9GB. Turns out, for short jobs, its reporting is not reliable. System monitoring is done on 30-second intervals, so jobs shorter than 30 seconds may completely slip through between polling.

Take 1: Bigger data

Let’s go bigger! Reprise the experiment using 4 million reviews this time.
Files involved: review_4mil.json (newly created), process_reviews.py (same Python script), todo14.sh (slurm script), todo14.out (newly generated output file).
Pay attention to the --mem-per-cpu option. Does your job need larger memory?
After the job is done, study the job stats via seff job-id command.

Take 2: Better code

Let’s improve our Python code this time. Create a new script file process_reviews_eff.py with the following. The code produces the same results, but structured differently.

import pandas as pd
import sys
from collections import Counter

filename = sys.argv[1]

df_chunks = pd.read_json(filename, chunksize=10000, lines=True, encoding='utf-8')

wfreq = Counter()

for chunk in df_chunks:
    for text in chunk['text']:
        wfreq.update(text.split())

print(wfreq.most_common(20))

Keep the data size the same: 4 million reviews.
Create a new slurm script todo15.sh that runs this new Python script, with the output file name todo15.out.
After the job is done, study the job stats via seff job-id command. Night and day! What about this new Python code led to this much improvement in efficiency? Give it some thought, we’ll discuss in class.

Take 3: EVEN BIGGER data and better code (optional, ONLY IF you’re curious!)

Run the new, improved Python script on all of review data, which consists of 8.6 million reviews total.
When the job is done, again check the stats. How long did it take to run, and how much RAM did the process use?

SUBMISSION: Your files on CRC are your submission. I have read access to them.

To-do #16

Visit your classmates, round 2.

First, maintain your own guestbook. Respond to your classmates’ visit logs.
- No need to post a lengthy response – you are not starting a debate here! You can think of it as something akin to Facebook’s “like”, just a small acknowledgment or a thank you.
Now visit your classmates’ projects, two of ‘em. Like before we go by the order of first names, which can be found right in the directory. You should visit two people after your previous visits.
- Take a look around, and write on their guestbook. Like the previous visit, your entry should consist of: one thing you thought was done well, one avenue for improvement or suggestion, and one thing you learned.

SUBMISSION: Since Class-Lounge is a fully collaborative repo, there is no formal submission process.

To-do #17

Visit your classmates, round 3. You know what to do!