Homework 1: Explore Linguistic Data

This homework carries a total of 50 points.

For this homework, let’s have you explore a linguistic data set using the Python skills you already have.

Data set

You can process any publicly available linguistic data set of your choice.
You already took a look at two data sets for To-do 1: you may pick one of them, or you may choose something else entirely.
Many of you are familiar with NLTK’s corpora. Do NOT use NLTK’s pre-loaded corpora such as nltk.corpus.brown and nltk.corpus.inaugural.
You may, however, download a zipped archive through this page and process the files as you would any other corpora. You might find 1.9 Loading your own Corpus section useful.

GitHub dry run with To-do 1

Let’s have you practice forking before getting on with the homework. Joey and I demonstrated this in class: details here.
If you haven’t already, fork Class-Exercise-Repo from our class GitHub org. Then, clone your fork onto your laptop.
Inside, you will find todo1/ directory.
Find the text file you previously submitted for To-do 1, and copy it into that directory. Make sure it’s named something like todo1_yourname.txt, so it won’t conflict with some other student’s.
Do the usual local git routine, which ends with committing. Then push to your own GitHub fork.
Then, create a pull request for me. I’ll respond as soon as I can!

Repo, files and folders

Start by forking this HW1-Repo GitHub repository.
Once you have your own fork in your GitHub account, clone it onto your laptop. DO NOT DIRECTLY CLONE MY ORIGINAL REPO WITHOUT FORKING FIRST.
In the directory, you will find a folder with your name, say narae/. Your code, as a Jupyter notebook file (xxx.ipynb), should go into that directory.
There is a file named your_file_here.txt. I put it there simply because without it git will ignore empty directories. If you know about git rm, use the command to delete it; otherwise, leave it there.
You can name the notebook file whatever you want, but remember – no spaces; use underscore _ instead.
In your personal directory, create a new directory called data, all in lowercase. All your data files should go into this directory. No need to change data file names to remove spaces; in fact, do NOT modify the downloaded files at all.
In your Python code, use relative paths when referencing data files. That is, use something like open("data/corpusfile1.txt") instead of the full path.
The repo is already configured, via the .gitignore file in the root, to ignore the data/ directory and its content, meaning, you are excluding your data files from submission. Jupyter’s auto-save files are also ignored.
While working on your Jupyter Notebook, you should be frequently committing and pushing to your fork.

Your code

Your Python code should be written as a Jupyter Notebook. If you are not familiar with it, watch the Lynda.com tutorial.
At the top of your Jupyter notebook should be a markdown cell with following information:
- Your name, email and date
- Info on your data set. The name, author(s), download URL, etc. Basically what you reported back in To-do 1.
The second cell, another markdown cell, should contain a self-assessment:
- A summary of what your code does and how you addressed your “discovery” question.
- A future wish: something that you would have liked to be able to do with this data set but do not know how at the moment.
Your code should achieve the following:
- Open the data files and read in the data.
- Print out some representative snippets of the data. Don’t flash the entire thing – just enough to get a sense of the content.
- Print out some basic stats, such as the total number of data points. For corpora, this could be the number of text files, sentences, word tokens, etc.
- Additionally, you should make one discovery. Pick a question you think you can address with reasonable effort, and explore the data for an answer.
Try and see if you can find a way to utilize the upcoming NumPy library in your data processing. Aggregation functions such as numpy.sum() and numpy.mean() are easy choices. This part is optional.
Don’t forget to make use of markdown cells for organization, explanation and notes. Use comments as you see fit.

Submission

When you think you have the final version of your work, “Restart & Run All” your Jupyter notebook one last time, so the output is all orderly and tidy. Save the notebook.
After your local Git routine, push the HW1 repo to your own GitHub fork one last time. Check your GitHub fork to make sure the notebook file is there and everything looks OK. Your data files shouldn’t be there.
Finally, create a pull request for me.

This is a form of rolling submission. I will not wait until the homework deadline before I start processing pull requests and merging in your contributions. That means you will be able to view other students’ submissions before the deadline. You should feel free to do so.