Course home for
LING 1340/2340
Due 1/10 (Th), 3:30pm
The Internet is full of published linguistic data sets. Let’s data-surf! Instructions:
.txt
extension), make note of:
.md
file instead of a text file.SUBMISSION: Upload your text file to To-do1 submission link, on CourseWeb.
Due 1/17 (Th), 3:30pm
Learn about the numpy
library: study the Python Data Science Handbook and/or the DataCamp tutorial.
While doing so, create your own study notes, as a Jupyter Notebook file entitled numpy_notes_yourname.ipynb
.
Include examples, explanations, etc. Replicating DataCamp’s examples is also something you could do.
You are essentially creating your own reference material.
SUBMISSION: Your file should be in the todo2/
directory of the Class-Exercise-Repo
.
Make sure it’s configured for the “upstream” remote and your fork is up-to-date. Push to your GitHub fork, and create a pull request for me.
Due 1/22 (Tue)
Study the pandas
library (through the Python Data Science Handbook and/or the DataCamp tutorials). pandas
is a big topic with lots to learn: aim for about 1/2. While doing so, try it out on TWO spreadsheet (.csv, .tsv, etc.) files:
billboard_lyrics_1964-2015.csv
by Kaylin Pavlik, from her project ‘50 Years of Pop Music’.
(Note: you might need to specify ISO8859 encoding.)Name your Jupyter Notebook file pandas_notes_yourname.ipynb
. Don’t change the filename of any downloaded CSV files or edit them in any way.
SUBMISSION: Your files should be in the todo3/
directory of Class-Exercise-Repo
.
Commit and push all three files to your GitHub fork, and create a pull request for me.
Due 1/24 (Thu)
This one is a continuation of To-do #3: work further on your pandas
study notes. You may create a new JNB file, or you can expand the existing one. Also: try out a spreadsheet submitted by a classmate. You are welcome to view the classmate’s notebook to see what they did with it. (How to find out who submitted what? Git/GitHub history of course.) Give them a shout-out.
SUBMISSION: We’ll stick to the todo3/
directory in Class-Exercise-Repo
. Push to your GitHub fork, and create a pull request for me.
Due 2/5 (Tue), 3:30pm
For this To-do, refer back to the edited version of english.csv
from class Activity 3. Add a markdown cell block to your Jupyter Notebook file for activity3 clearly labeling the beginning of To-do #5.
This time we’ll look at the response times for the naming task (RTnaming
). The equipment that Balota et al. used to gather this naming data was voice-activated. As such, the acoustic properties of a word’s initial segment may have affected the time it took to register a response. Let’s figure out whether it did.
Voice
specifies whether a word’s initial phoneme was voiced or voiceless. Make a boxplot for the distribution of reaction times across voiced and voiceless phonemes, grouped by subject age.SUBMISSION: Submit a pull request including your updated JNB file.
Due 2/12 (Tue)
The Gries & Newman article cites many famous corpora and corpus resources. Let’s round them all up in a single spot, complete with web links. We will collaborate on a shared document called 'corpora_tools_list.md'
.
Class-Plaza
repo belongs to all of us: we are all listed as a collaborator.Your job is to fill out the three tables: add at least one entry to each table. Make sure you are not duplicating someone else’s entry. Because everyone is editing the same document, you may run into a conflict while trying to push. Make sure you have read and understood this tutorial on Git conflicts and resolve accordingly.
SUBMISSION: There is no formal submission process, because this one does not involve you issuing a pull request or anything like that. I will check on the repo later to see you have indeed made your contribution.
Due 2/14 (Thu)
Let’s pool our questions together for Dr. Lauren Collister, who will be our guest speaker on Thursday.
Review the topics of linguistic data, open access, and data publishing, focusing in particular on these three resources: Data Management Plans for Linguistic Research, Kitzes (2018), and the Copyright and Intellectual Property Toolkit.
Think of a question for Lauren, and add yours along with your name to the questions_collister.md
file in our Class-Plaza
repo.
SUBMISSION: Push your commit directly to the Class-Plaza
repo. Make sure you don’t trample on someone else’s contribution. If there is a conflict, it is your job to resolve it.
Due 2/19 (Tue)
Let’s try Twitter mining! On a tiny scale that is. This blog post Data Analysis using Twitter presents an easy-to-follow, step-by-step tutorial, so you should follow it along.
First, you will need to install the tweepy
library:
which pip
does not show your Anaconda version of pip, it means you cannot simply go pip install tweepy
. You will instead have to specify the complete path for Anaconda’s pip. So, find your Anaconda installation path, and install Tweepy like so:
/c/ProgramData/Anaconda3/Scripts/pip install tweepy
/c/Users/your-user-name/Anaconda3/...
. You should use TAB completion while typing out the path.which -a python
. The -a
flag shows all python executables found in your path.Notes on using tweepy
:
SUBMISSION: We are switching back to Class-Exercise-Repo
; use the todo8/
folder. Your Jupyter Notebook file should have your name in the file name. Push to your fork and create a pull request. Make sure you have redacted your personal API keys!
Due 2/26 (Tue)
Let’s try sentiment analysis on movie reviews. Follow this tutorial in your own Jupyter Notebook file. Feel free to explore and make changes as you see fit. If you haven’t already, review the Python Data Science Handbook chapters to give yourself a good grounding. Also: watch DataCamp tutorials Supervised Learning with scikit-learn, and NLP Fundamentals in Python.
Students who took LING 1330 (=everyone): compare sklearn’s Naive Bayes with NLTK’s treatment and include a blurb on your impression. (You don’t have to run NLTK’s code, unless you want to!)
SUBMISSION: Your jupyter notebook file should be in the todo9
folder of Class-Exercise-Repo
. As usual, push to your fork and create a pull request.
Due 2/28 (Thu)
What have the previous students of LING 1340/2340 accomplished? What do finished projects look like? Let’s have you explore their past projects. Details:
todo10_past_project_critiques.md
in Class-Plaza
.SUBMISSION: Push your commit directly to the Class-Plaza
repo. Make sure you don’t trample on someone else’s contribution. If there is a conflict, it is your job to resolve it.
Due 3/7 (Thu)
What has everyone been up to? Let’s take a look – it’s a “visit your classmates” day!
Class-Plaza
repo, but you should edit it so that:
SUBMISSION: Since Class-Plaza
is a fully collaborative repo, there is no formal submission process.
Due 3/21 (Thu)
Let’s poke at big data. Well, big-ish. The Yelp DataSet Challenge is now on its 13th round, where Yelp has made their huge review dataset available for academic groups that participate in a data mining competition. Challenge accepted! Before we begin:
yelp_tryout_yourname.md
in the to-do12/
directory of Class-Exercise-Repo
.Let’s download this beast and poke around.
Documents/Data_Science
directory. You might want to create a new folder there for the data files..tar
format. Look it up if you are not familiar. Untar it using tar -xvf
. I will extract 6 json files along with some PDF documents.ls -laFh
, head
, tail
, wc -l
, etc.), find out: how big are the json files? What do the contents look like? How many reviews are there?grep
and wc -l
. Take a look at the first few through head | less
. Do they seem to have high or low stars?How much processing can our own puny personal computer handle? Let’s find out.
process_reviews.py
. Content below. You can use nano, or you could use your favorite editor (atom, notepad++) provided that you launch the application through command line.import pandas as pd
import sys
from collections import Counter
filename = sys.argv[1]
df = pd.read_json(filename, lines=True, encoding='utf-8')
print(df.head(5))
wtoks = ' '.join(df['text']).split()
wfreq = Counter(wtoks)
print(wfreq.most_common(20))
review.json
file! Start small by creating a tiny version consisting of the first 10 lines, named FOO.json
, using head
and >
.process_reviews.py
on FOO.json
. Note that the json file should be supplied as command-line argument to the Python script. Confirm it runs successfully.FOO.json
with incrementally larger total # of lines and re-run the Python script. The point is to find out how much data your system can reasonably handle. Could that be 1,000 lines? 100,000?yelp_tryout_yourname.md
. A paragraph will do. How was your laptop’s handling of this data set? What sorts of resources would it take to successfully process it in its entirety and through more computationally demanding processes? Any other observations?SUBMISSION: Your markdown file should be in the todo12
directory in Class-Exercise-Repo
. As usual, push to your fork and create a pull request.
Due 3/26 (Tue)
It’s “visit your classmates” day, round 2!
SUBMISSION: Since Class-Plaza
is a fully collaborative repo, there is no formal submission process.
Due 4/2 (Tue)
Visit your classmates, round 3.
Due 4/4 (Thu), 3:30pm This To-do is a (re)introduction to Praat, everyone’s favorite idiosyncratic phonetics data analysis tool, as well as to the TIMIT Corpus, which is located in the Licensed Data Sets repo.
Download and install the newest version of Praat. (Praat changes often and drastically, so this really is important to do.) Also note that the full Praat “manual” is on this site, but it is not well organized.
Read the documentation for the TIMIT corpus (Licensed-Data-Sets/TIMIT Acoustic-Phonetic Continuous Speech Corpus/timit/TIMIT/README.DOC).
From within Praat, open the files associated with SA1 for speaker FCJF0 (/TIMIT/TRAIN/DR1/FCJF0/SA1.*). Note that one file cannot be opened by Praat, and another will kick up a warning.
From the Praat Objects window, select the two TextGrid objects and Merge them. Then select the resulting TextGrid “merged” and the Sound “SA1” and View & Edit them.
SUBMISSION: Upload a markdown file with your observations, and with any questions you have, to the to-do15
directory in Class-Exercise-Repo
.
Due 4/11 (Thu) 4th and final round of visit your classmates. Also the last To-do!