Course home for
LING 1340/2340
HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table
Due 1/11 (Wed), 9:45am
The goal of this To-do is to get you started with Git. To that end, complete my LSA 2019 tutorial Part 1 “Intro to Git”, linked under the “Git” section of the Learning Resources page. Detailed instructions:
git config --global --list
git log
ls -la
(make sure you are in your languages
folder)SUBMISSION: On Canvas. Upload your screenshot files through the To-do1 submission link.
Due 1/23 (Fri), 9:45am
The Internet is full of published linguistic data sets. Let’s data-surf! Instructions:
datasets_yourname.txt
, (note the .txt
extension), make note of:
.md
file instead of a text file.Git/GitHub submission instructions:
Class-Exercise-Repo
from our class GitHub org. Then, clone your fork onto your laptop. Details are on today’s slides.todo2/
directory.datasets_yourname.txt
so it won’t conflict with some other student’s.SUBMISSION: That’s it! Your forked GitHub repository counts as your submission.
Learn about the numpy
library: study the Python Data Science Handbook and/or the DataCamp tutorial. While doing so, create your own study notes, as a Jupyter Notebook file entitled numpy_notes_yourname.ipynb
. Include examples, explanations, etc. Replicating DataCamp’s examples is also something you could do. You are essentially creating your own reference material.
SUBMISSION: Your file should be in the todo3/
directory of the Class-Exercise-Repo
. Make sure your fork is up-to-date. Push to your GitHub fork, and create a pull request for me.
Study the pandas
library (through the Python Data Science Handbook and/or the DataCamp tutorials). pandas
is a big topic with lots to learn: aim for about 1/2. While doing so, try it out on TWO spreadsheet (.csv, .tsv, etc.) files:
billboard_lyrics_1964-2015.csv
by Kaylin Pavlik, from her project ‘50 Years of Pop Music’.
(Note: you might need to specify ISO8859 encoding when opening.)Don’t change the filename of any downloaded CSV files or edit them in any way – important! Name your Jupyter Notebook file pandas_notes_yourname.ipynb
.
SUBMISSION: Your files should be in the todo4/
directory of Class-Exercise-Repo
.
Commit and push all three files to your GitHub fork, and create a pull request for me.
This one is a continuation of To-do #4: work further on your pandas
study notes. You may create a new JNB file, or you can expand the existing one. Also: try out a spreadsheet submitted by a classmate. You are welcome to view the classmate’s notebook to see what they did with it. (How to find out who submitted what? Git/GitHub history of course.) Give them a shout-out.
SUBMISSION: We’ll stick to the todo4/
directory in Class-Exercise-Repo
. Push to your GitHub fork, and create a pull request for me.
Plotting time! matplotlib
and seaborn
are popular Python libraries for plot graphs and visualization. The goal of this To-do is practice them using the “English” data:
english.csv
, in the todo6
folder of “Class-Exercise-Repo”languageR
package: https://www.rdocumentation.org/packages/languageR/versions/1.5.0Your Jupyter Notebook study notes should be named plot_notes_yourname.ipynb
.
SUBMISSION: Your files should be in the todo6/
directory of Class-Exercise-Repo
.
Commit and push to your GitHub fork, and create a pull request for me.
What have the previous students of LING 1340/2340 accomplished? What do finished projects look like? Let’s have you explore their past projects. Details:
SUBMISSION: As usual, push to your fork and create a pull request. Make sure your team’s markdown file is in good shape!
Due earlier at 9am!!
Let’s dig into the issues of copyright and license in language data. We’ll then pool our questions together for Dr. Lauren Collister.
Review the topics of linguistic data, open access, and data publishing, focusing in particular on her 2022 article for the Open Handbook of Linguistic Data Management and the “Copyright and Intellectual Property Toolkit”. Then watch her guest presentation from a previous class; her slides can be found here.
Think of a question or two on the topic, and add yours along with your name to this Word document posted on our MS Teams forum. Dr. Collister will join our class on Friday to answer them.
SUBMISSION: The shared MS Word document is your submission.
Let’s learn about web scraping. It is in fact a vast topic which requires learning about the very building blocks of web sites (HTML, CSS, etc.). DataCamp has a whole course devoted to it (Web Scraping in Python), but for now, let’s all just dip our toes.
Try out the “Web Scraping with BeautifulSoup” tutorial posted in the Web and Social Media Mining section of our learning resources page. Try out a web page of your own choice! Name your Jupyter Notebook bs4_web_scraping_YOURNAME.ipynb
, which should be in the todo9
folder of our Class-Exercise-Repo
.
SUBMISSION: As usual, push to your fork and create a pull request.
With AI and natural language technologies making big waves, computational semantics is enjoying renewed popularity. One of the well-known projects is Abstract Meaning Representation (AMR), a formalism for semantic representation of English sentences. The project home page is found here: https://amr.isi.edu/index.html
It might initially look cryptic, but you might see similarities to the PropBank we learned in LING 1330. Your job: give yourself a crash course to learn as much as you can about AMR. Details:
AMR_notes_YOURNAME.md
) present the AMR instances along with your explanation of what’s going on.(c / collect-01
:ARG0 (h / he)
:ARG1 (b / butterfly)
:mode interrogative)
SUBMISSION: Your note should go into todo10
folder of Class-Exercise-Repo
. As usual, push to your fork and create a pull request.
Let’s try sentiment analysis on movie reviews. Follow this tutorial in your own Jupyter Notebook file. Feel free to explore and make changes as you see fit. If you haven’t already, review the Python Data Science Handbook chapters to give yourself a good grounding. If you want to get a serious start on ML learning: watch DataCamp tutorials Supervised Learning with scikit-learn, and NLP Fundamentals in Python.
Students who took LING 1330: compare sklearn’s Naive Bayes with the NLTK’s treatment and include a blurb on your impressions and questions. (You don’t have to run NLTK’s code, unless you want to!)
SUBMISSION: Your jupyter notebook file should be in the todo11
folder of Class-Exercise-Repo
. As usual, push to your fork and create a pull request.
What has everyone been up to? Let’s take a look – it’s a “visit your classmates” day!
Class-Lounge
repo, but you should edit it so that:
SUBMISSION: Since Class-Lounge
is a fully collaborative repo, there is no formal submission process.
Let’s poke at big data. Well, big-ish – how about 7 million restaurant reviews? The Yelp DataSet Challenge has been going strong for 10+ years now, where Yelp make their huge review dataset available for academic groups that participate in a data mining competition. Challenge accepted! Before we begin:
Let’s download this beast and poke around.
Documents/Data_Science
directory. You might want to create a new folder there for the data files..tar
format. Look it up if you are not familiar. Untar it using tar -xvf
. I will extract 5 json files along with a PDF document.ls -laFh
, head
, tail
, wc -l
, etc.), find out: how big are the json files? What do the contents look like? How many reviews are there?grep
and wc -l
. Take a look at the first few through head | less
. Do they seem to have high or low stars?How much processing can our own puny personal computer handle? Let’s find out.
process_reviews.py
. Content below. You can use nano, or you could use your favorite editor (atom, notepad++) provided that you launch the application through command line.import pandas as pd
import sys
from collections import Counter
filename = sys.argv[1]
df = pd.read_json(filename, lines=True, encoding='utf-8')
print(df.head(5))
wtoks = ' '.join(df['text']).split()
wfreq = Counter(wtoks)
print(wfreq.most_common(20))
review.json
file! Start small by creating a tiny version consisting of the first 10 lines, named FOO.json
, using head
and >
.process_reviews.py
on FOO.json
. Note that the json file should be supplied as command-line argument to the Python script, so your command will look something like below.
python process_reviews.py FOO.json
FOO.json
with incrementally larger total # of lines and re-run the Python script. The point is to find out how much data your system can reasonably handle. Could that be 1,000 lines? 100,000?Class-Lounge
. A few sentences will do. How was your laptop’s handling of this data set? What sorts of resources would it take to successfully process it in its entirety and through more computationally demanding processes? Any other observations?SUBMISSION: Your entry on this shared MD file. Make sure to properly resolve conflicts (if any)!
Trying out CRC, with bigger data + better code!
review_4mil.json
(newly created), process_reviews.py
(same Python script), todo13.sh
(new slurm script), todo13.out
(newly generated output file).seff job-id
command.process_reviews_eff.py
with the following. The code produces the same results, but structured differently.import pandas as pd
import sys
from collections import Counter
filename = sys.argv[1]
df_chunks = pd.read_json(filename, chunksize=10000, lines=True, encoding='utf-8')
wfreq = Counter()
for chunk in df_chunks:
for text in chunk['text']:
wfreq.update(text.split())
print(wfreq.most_common(20))
todo13.sh
to run this new script.seff job-id
command. Night and day! What about this new Python code led to this much improvement in efficiency? Give it some thought, we’ll discuss in class.SUBMISSION: Your files on CRC are your submission. I have read access to them.
Visit your classmates, round 2.
SUBMISSION: Since Class-Lounge
is a fully collaborative repo, there is no formal submission process.
Visit your classmates, round 3. You know what to do!
Visit your classmates, last round! You have 3 classmates you haven’t visited yet. You can visit 2, or if you are inclined, visit all 3.