Data Science for Linguists 2019

Course home for
LING 1340/2340

HOME
Policies
Term project guidelines
Learning resources by topic
Schedule table

Homework 2: Process the ETS Corpus

This homework carries a total of 80 points.

For this homework, we will explore and process the ETS Corpus of Non-Native Written English, using Python’s numpy and pandas libraries we recently learned.

Data set

Repo, files and folders

Beware: if you git-delete the dummy file through git rm BEFORE creating your notebook file in there, git will see the directory as empty and hide it from your view. (That's the reason I included the dummy file in the first place.) No need to panic. You can undo it by typing in git checkout -- your_script_here.txt. This essentially throws away that last change you made in the repo. After that, git status to confirm you now have a clean slate.

Goals

The first goal of this work is Basic Data Processing, which involves processing of CSV files and text files. You should build two DataFrame objects: essay_df and prompt_df.

Building essay_df (see screenshot):

  1. Start with index.csv and build a DataFramed named essay_df. Specifics:
    • Rename 'Language' column name to 'L1', which is a more specific terminology.
    • Likewise, change 'Score Level' to 'Proficiency'. Single-word column names are more handy. (Why?)
  2. Augment the essay_df DataFrame with an additional column named 'Partition' with three appropriate values: ‘TS’ for testing, ‘TR’ for training, and ‘DV’ for development.
    • This information you can get from the three additional CSV files, which split the data into thee partitions: training, testing, and development sets.
  3. Create a new column called 'Text', which stores the string value of each essay content. You will need to find a way to read in each essay text stored as individual files.
  4. As you work on EDA and linguistic analysis, you may create additional columns to hold new data points. I’ll leave these up to you. Initially, the DataFrame should look like the screenshot above.

Building prompt_df (see screenshot):


The second goal is Exploratory Data Analysis (EDA).

  1. Read up on documentation in order to gain understanding of the data set. There is a README file, a PDF document, and the LDC publication page. What is the purpose of this data, what sort of information is included, and what is the organization?
  2. Then, explore the data to confirm the content. For example, the PDF document contains tables illustrating the make-up of the data and various data points. Don’t take their word for it! You should find a way to confirm and demonstrate these data points through your code.
  3. Visualization: Try out at least one plot/graph.

The third and last goal is Linguistic Analysis. In particular, we want to be able to highlight quantitative differences between three learner groups: low, medium and high levels of English proficiency. Explore the following:

  1. Text length: Do the groups as a whole write longer or shorter responses?
    → Can be measured through average response length in number of words
  2. Syntactic complexity: Are the sentences simple and short, or are they complex and long?
    → Can be measured through average sentence length
  3. Lexical diversity: Are there more of the same words repeated throughout, or do the essays feature more diverse vocabulary?
    → Can be measured through type-token ratio (with caveat!)
  4. Vocabulary level: Do the essays use more of the common, everyday words, or do they use more sophisticated and technical words?
    → a. Can be measured through average word length (common words tend to be shorter), and
    → b. Can be measured against published lists of top most frequent English words.

There are five total measurements. Choose three if you are relatively new to programming; experienced programmers should work on four or all five. Importantly, you must employ statistical tests such as the t-test or one-way ANOVA to demonstrate whether or not a difference is significant.

Additional pointers on analysis: If you haven’t taken LING 1330 “Introduction to Computational Linguistics”, you might want to learn about NLTK’s core text processing functions from these slides. Additionally, if you are going for 4b., which is the most involved kind of task, you might want to consult this homework or this one.

Your report, aka Jupyter Notebook file

Submission

  1. When you think you are finished, “Restart & Run All” your Jupyter notebook one last time, so the output is all orderly and tidy. Save the file.
  2. After the usual local Git routine, push to your own GitHub fork one last time. Check it to make sure your file is there and everything looks OK.
  3. Finally, create a pull request for me.

This is NOT rolling submission. The original repo is private, meaning only our class members have read access. Your fork will automatically become private for your access only. I will wait until the assignment deadline to merge in your pull requests, which means your homework will essentially be visible to yourself only until I process your pull request. (ADDENDUM: Turns out there was a loophole. A pull request and its content, even those from a private fork, are visible to everyone who has read access to the original upstream repo. We are changing the submission method starting from HW3.)