Data Science for Linguists 2022

Course home for
     LING 1340/2340

HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table

Term Project Guidelines

Jump to: Components, Submission, Milestones, Project Ideas, Project Plan, 1st Progress Report, 2nd Progress Report, 3rd Progress Report, Presentation Guidelines, Final Project Submission Guidelines.

Individual students will work on a project of their own choice and design over the course of the semester, culminating with a class presentation followed by a final project delivery. The goal of this project is to make a linguistic discovery through application of data-intensive methods.

Components

A project consists of three main components: data, analysis, and presentation.

A. Data

Start with found data. Many linguistics research projects begin with a targeted data collection effort – field work, surveys, elicitation, human subjects, and more. But the underlying assumption of data science is that data exists in the wild, and it is up to a data scientist to harness it. True to this assumption, we will have you start with data that is found in the wild, be it published data sets, corpora, or social media streams.

Add value. You should not, however, be content with data as it is packaged and presented to you. In many cases, your data will need a lot of work – sourcing, cleaning up, and reorganizing. In other cases, you may be dealing with published data that’s more or less ready for analysis. You are, then, expected to add value: augmenting, annotating and leveraging multiple data sets are all potential avenues.

Follow best data practices. Throughout this semester, we will be learning about best data practices, both emerging and firmly established in data science circles. Make sure your own data efforts and the output are in compliance.

B. Analysis

Linguistic analysis. You will have designed your data with a research question in mind. Your data should make a suitable empirical basis for your linguistic inquiry; your research question should be properly motivated and addressed in a theoretically and methodologically sound manner. You interpretations of the findings should likewise be rigorously supported by your data. Even with meticulous preparation, however, your data in the end may not prove fruitful grounds for your original research question. Pivoting is therefore allowed up to a certain point; whether or not this move is ultimately successful, reasons for pivoting and/or failure of the original research agenda must be thoroughly probed and documented, since this sort of outcome is all part and parcel in research efforts deeply grounded in real-life data and, further, provides valuable insight.

Computational methods. In your linguistic analysis, you are expected to employ various computational methods including natural language processing, statistics, machine learning, topic modeling and more. Proper techniques should be used in accordance with your research question and the specifics of your data. At the same time, you should demonstrate mastery of these techniques by justifying your choice of computational methods and thoroughly evaluating the outcome, rather than blindly applying them and accepting the returned output. As with linguistic analysis, failed experimentation should not be brushed aside, but rather receive proper investigation and documentation, as this is all part of the discovery process.

C. Presentation

This component encompasses all audience-facing aspects of your project, which include but are not limited to:

Weight distribution. Ideally, a project will have the three components in perfect balance: a total of say 180 points will be equally split between data/analysis/presentation as 60-60-60. In reality, everyone’s project will be different: some will have ambitious and challenging data curation plans, while others might wish to focus their efforts on extensive use of advanced computational methods. To accommodate this, a limited amount of trade-off is provisioned between the “data” and the “analysis” components: more data-focused projects therefore may have up to 70-50-60 distribution with more data-side contribution, while projects heavily focused on analysis are allowed to go easier on data-related efforts, with up to 50-70-60 split.

Submission

Your project should be initiated and developed in the form of a GitHub-hosted public repository. The final deliverables should include:

Milestones

The term project carries a total of 400 points, which you accrue over the course of the semester through meeting several, structured, milestones. Refer to the Schedule page for the dates.

MilestonePointsDistribution: Data ;
Analysis ; Presentation
What
1Project ideas20 ⒹⒹⒶⒶ Send instructor 1-2 project ideas.
2Project plan20 ⓟⓟ Finalize project plan, create a GitHub project repository.
31st progress report40 ⒹⒹⒹⒹⒹⒹⓟⓟ Focus on data curation, report progress.
42nd progress report40 ⒹⒹⒹⒹⒶⒶⓟⓟ Continue with data curation, attempt analysis.
53rd progress report40 ⒹⒹⒶⒶⒶⒶⓟⓟ Data-side effort should be done; ramp up analysis.
6Project presentation60 ⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ Oral presentation of your work in classroom.
7Final project submission180 ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ ⒹⒹⒹⒹⒹⒹⒶⒶⒶⒶⒶⒶⓟⓟⓟⓟⓟⓟ Turn in final project in the form of a GitHub repository.

More detail will follow as each milestone approaches.

Project Ideas

You should come up with one or two project ideas. Include these details:

Submission: In the project_ideas/ directory of Class-Exercise-Repo, create a markdown-formatted text file named project_ideas_YOURNAME.md. Commit, push to your fork, and create a pull request for the instructors.

Project Plan

Launch your project as a GitHub repository and publish a project plan.


Submission: Your project repo counts as your submission.

1st Progress Report

For the 1st progress report, focus on your data. This milestone consists of 30 data points and 10 presentation points. Goals:

Contents:

  1. progress_report.md
    • Create a section entitled “1st Progress Report”, and then provide a summary of what you accomplished. Keep it short (a screen-full), and provide links to related documents, including your Jupyter Notebook and data samples.
    • Include a subsection where you outline a couple of options (or a single option, if you are fairly sure) regarding the “sharing plan” for your data. You should plan out how much and what you will be sharing. Make sure to include a justification.
  2. A python script in the form of a Jupyter Notebook.
    • Provide an overview of your data. Clearly document each step of your data processing pipeline.
    • Compile some basic stats on your data: the size and the make up are the bare minimum.
    • Bullet points have their uses, but let’s see some written summaries and explanations too.
    • Remember: your Jupyter Notebook file is also your presentation. Make it easy for the instructors and your classmates to understand what you are doing. Explain your goals, show your data and your processes.
  3. Some form of your data. If all of your data is currently stored in a git-ignored directory, make an appropriately sized samples available in a directory called data_samples/.

Above are the minimum requirements, but do feel free to impose additional organization as you see fit. This is your project after all! But when you do so, make sure you provide an explanation.

Some of you may have discovered that your project is not panning out as you had hoped and you need to start over. This is your last chance to do so; you will have to launch a viable project quickly. As far as your project repository is concerned, you should keep the old one (along with its Git history) but alter it to fit your new project:
  • Change your GitHub repository's name. That changes its URL, so you will need to update your local Git's remotes setting.
  • In your project_plan.md file, round up the old content into a section and mark it clearly as your old plan which is no longer current. On the top, write out your new plan.
  • Upate your progress_report.md file with an explanation of what happened and why this change of course was necessary. You'll need your 1st progress report too.
  • You should edit README.md and any other files accordingly to fit your new project direction. They shouldn't contain references to your old plan.

Submission: Your project repo counts as your submission.

2nd Progress Report

For the 2nd progress report, ease up your focus on data and start working on analysis. This milestone consists of 20 data points, 10 analysis points and 10 presentation points. Goals:

As for the progress report itself, these should be the content:

  1. Your progress report: progress_report.md
    • Create a section entitled “2nd Progress Report”, and then provide a summary of what you accomplished. Again keep it short (a screen-full), and provide links to related documents, including your Jupyter Notebook and other folders/documents.
    • Include two subsections:
      • Sharing scheme for the “found” portion of your data. You had already made some tentative plans as part of the previous progress report; you are finalizing the scheme here.
      • Your decision on licensing for your project and reasons/justification. See “Your license” item below.
  2. Your code in the form of Jupyter Notebook. You have three options:
    • EXISTING: the existing script file which was part of your 1st progress report. You continue to update and add to it.
    • NEW REPLACEMENT: a whole new script file that replaces the earlier one. The script you submitted earlier as part of the 1st progress report is now regarded as initial exploration and is no longer part of your work pipeline.
    • NEW CONTINUING: a new script file that’s part of a pipeline. The earlier script you submitted for the 1st progress report accomplishes PART 1 of your work pipeline, and this new file is PART 2 that picks up where PART 1 left off.
    • On top of your script, specify which type it is so we will have a sense of how the script fits in your project. Make a note of this in your progress report section as well.
  3. Your data: include it in a designated folder. Suggested name: data/. Be careful not to commit anything that you cannot publish.
    • If including the found portion of your data in its entirety, make sure it’s within your right to do so. Present a justification in progress report.
    • If you are including samples, make sure it’s within fair use. Document your sampling method and justification in progress report.
    • Are you including derived data? Again provide justification.
    • Are you including some new data you created yourself, like annotation? Again, document it.
  4. Your license: LICENSE.md.
    • This is a binding licensing document, intended as audience-facing. This is where you lay out your licensing terms for your future visitors wanting to use your data and code.
    • Do not confuse this with the license of the dataset you downloaded: this document is about YOU specifying a license for YOUR PROJECT REPOSITORY.
    • You may adopt popular, existing licensing standards: revisit Lauren Collister’s materials, and consult this quick guide and also this one from GitHub Help.
    • Include reasons/justifications in the appropriate subsection in your progress report.

Submission: Your project repo counts as your submission.

NOTE: After ‘submission’, don’t hold yourself back from pushing more updates and changes thinking you should freeze the repo until grading is done. There’s no need: the instructors have access to your repo at every stage it moves through.

3rd Progress Report

For the 3rd and last progress report, you should focus on analysis. This milestone consists of 10 data points, 20 analysis points and 10 presentation points. Goals:

As for the progress report itself, these should be the content:

  1. Your progress report: progress_report.md
    • Create a section entitled “3rd Progress Report”, and then provide a summary of what you accomplished. Again keep it short (a screen-full), and provide links to related documents, including your Jupyter Notebook and other folders/documents.
  2. Your code in the form of Jupyter Notebook. The same three options:
    • EXISTING: the existing script file which was part of your earlier progress report. You continue to update and add to it.
    • NEW REPLACEMENT: a whole new script file that replaces something earlier. The script you submitted earlier as part of the 1st progress report is now regarded as initial exploration and is no longer part of your work pipeline.
    • NEW CONTINUING: a new script file that’s part of a pipeline. The earlier script you submitted for a previous progress report accomplishes PART 1 of your work pipeline, and this new file is PART 2 that picks up where PART 1 left off.
    • On top of your script, specify which type it is so we will have a sense of how the script fits in your project. Make a note of this in your progress report section as well.
  3. Your data:
    • Some of you have worked on your data files. Make sure to note it in your progress report.
    • Are your data files finished as of last progress report? No new changes since? If so, make a note of it in your progress report.
  4. README.md file:
    • We’ll give this file a proper structure in due time, but for now, put a link to your guestbook, so you’ll have a handy link.

Submission: Your project repo counts as your submission.

Presentation Guidelines

Format

Content

Evaluation

Final Project Submission Guidelines

You’ve worked hard through many project milestones, and it’s time to prepare your project for final submission. Unlike the three progress reports where the focus was firmly on the process, the final submission should highlight the results and your interpretation of them. The process should still get a fair and clear illustration, but you should prune out from your production code any “branches” representing trials-and-errors that led to a dead end. (You are encouraged to move any old code bits into a designated subfolder.) All in all, your GitHub repo should present a coherent picture of your project, from start to finish.

Your repo: files and folders

(Note: Objects that are entirely/substantially new in this submission are in orange.) Below are the required files with fixed file names:

In addition, you should have:

Lastly, some of you might have extra files and directories serving some purpose. Perhaps a folder containing some old code that is no longer relevant, or something like that. Make sure to explain what these are in your README.md document.

README.md

Revamp your README document and give it a proper structure. This document is what greets your visitors, so its goal should be to give them a short but proper orientation. It should include:

Images folder and files

Your final_report.md file will need figures and graphs for illustration.

Your code: Jupyter Notebook

The same usual guidelines for your Jupyter Notebook files continue to apply: your code should work correctly while walking the audience through the whole process. This time around, however, your code should be in a streamlined form: you should prune your code of any unsuccessful bits and experiments that have since been abandoned. In other words, your code files should demonstrate your project in a lean and coherent manner. Some important points:

  1. For many of you, breaking down your code into multiple Jupyter Notebook files will make organizational sense. For example, the first notebook could focus on data clean-up effort, and the second one takes from there and conducts data analysis, and so forth.
  2. Still, your Jupyter Notebook file will get long and unwieldy, which makes scrolling a pain. So, a remedy: at the top, include a “Table of contents” that provides handy shortcuts to various subsections via section anchors, which are auto-generated from section headers. See this screenshot for how it’s done. More about this in a bit.
  3. Your code may produce interim outputs (such as saved pickle files). If you decide against sharing them, make sure to exclude them from GitHub repo via .gitignore.
  4. Make sure to “Restart & Run All” your Jupyter Notebook file before pushing to your GitHub repo! You want all cell outputs to be tidy.
  5. So, about those section anchors. Unfortunate thing is, they don’t work on GitHub once your JNB is uploaded there. See my notebook on GitHub, where clicking on links don’t do a thing (annotated screenshot of this tragedy). Luckily for us, there is an alternative online viewing method: jupyter.org’s nbviewer. After I copy and paste my GitHub’s notebook URL into the box, my notebook then becomes accessible here. And the links all work (see annotated screenshot here)… hurray! But why all this fuss? The thing is, when you write up your final report (next section), you will need to refer back to relevant sections of your Jupyter Notebook. Without section links, the best you can do is link to the entire notebook which could be miles-long, and that’s just not very reader-friendly, at all.

final_report.md

Think of this as a usual “final report” that is in a markdown format instead of MS Word. Details:

Submission: Your entire project repo is your submission. Make sure everything is in order and looks good. It is due 6pm on May 1 (Sun).