Learning Resources by Topic

This is not just a link dump. These resources are carefully curated textbook stand-ins, and you are fully expected to learn from them! There are multiple types:

Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
Software installation links. Download and install on your machine.
Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
References -- for looking things up.

Linguistic Data, Open Access, Data Publishing

Linguistics Data Repositories, Guides at UC San Diego Libraries [link]
Linguistic Linked Open Data [link]
Linguistic Data Consortium (LDC) [link]
Data Sharing for Linguists, guest lecture by Lauren Collister [slides]
Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]
D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
Copyright and Intellectual Property Toolkit by Lauren Collister [link]
TEI: A Gentle Introduction to XML [link]
json.org: Introducing JSON [link], JSON example (vs. XML) [link]

Corpus Linguistics

Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [PDF]
NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
NLTK Book Ch.11 Managing Linguistic Data [chapter]
NLTK Corpora Index [link] [GitHub repo]
FSNLP Ch.4 Corpus-Based Work Links [link]
Corpus-based Linguistics Links [link]
Corpus Resource Database (CoRD) [link]

Linguistic Annotation, Ontology, and Knowledge Engineering

Adding Linguistic Annotation, Geoffrey Leech (2005). In Martin Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice. [link]
Natural Language Annotation for Machine Learning [Chapter 1, full ebook]
NLTK Book Ch.11 Managing Linguistic Data [chapter]
WordNet: a lexical database for English [link]
Universal Dependencies [project home]
AMR: Abstract Meaning Representation [project home]
WebAnno annotation tool [home, GitHub]

Speech and Multimedia Data

Using Praat for Linguistic Research (by Will Styler) [tutorial]
Corpus Phonetics Tutorial (by Eleanor Chodroff) [tutorial]
The Language Archive (home of ELAN) [home]
ELAN Linguistic Annotator [home], BU ASL Corpus as ELAN [link]
(DataCamp) Spoken Language Processing in Python [course]
Jurafsky & Martin (2020) Speech and Language Processing Ch. 26 Automatic Speech Recognition and Text-to-Speech [PDF]
Less technical overview of ASR:
- Speech Recognition – ASR Model Training (by Jonathan Hui) [part 1], [part 2], [part 3 with recap]
- Introduction to ASR (by Maël Fabien, with IPA!!) [link]
More info on software, tools (Python libraries, MFA, SoX, etc.) in the “Tools” section below!

Statistics References

Python Statistics Fundamentals: How to Describe Your Data [link]
SPSS Tutorials: Analyzing Data [link]
T-test using Python and Numpy [tutorial]
Laerd.com [independent t-test, dependent t-test, ANOVA, Spearman’s rank-order correlation]
Python for Data Science [covariance/correlation, t-test, ANOVA]
(DataCamp) Introduction to Statistics in Python [course]
(DataCamp) Statistics Fundamentals with Python [skill track]

Data Processing Fundamentals: Python’s numpy, pandas, and visualization libraries

Python Data Science Handbook. (2016) O’Reilly Media [book]
(DataCamp) Introduction to Python for Data Science, Ch.4 NumPy [course] [Ch.4 Numpy]
(DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [course]
(DataCamp) pandas Foundations [course]
(DataCamp) Manipulating DataFrames with pandas [course]
Visualization: pandas 0.20.3 documentation [link]
Chris Albon’s Notes on ML & AI, “Data Wrangling” [link]
19 Essential Snippets in Pandas [link]

Twitter text mining tutorials: [Adil Moujahid], [Marco Bonzanini], [Anthony Sistilli]
(DataCamp) Analyzing Social Media Data in Python [course]
(DataCamp) Web Scraping in Python [course]
Web Scraping with Python (LinkedIn Learning) [link]
Scrapy Tutorial (official documentation) [link]
Mapping the United Swears of America by Jack Grieve [link]

Machine Learning

Python Data Science Handbook. (2016) O’Reilly Media [book]
Movie Reviews Sentiment Analysis with Scikit-Learn [link]
Topic Modeling with Scikit Learn [link]
(DataCamp) Supervised Learning with scikit-learn [tutorial]
(DataCamp) Unsupervised Learning in Python [course]
(DataCamp) NLP Fundamentals in Python [course]
Scikit-Learn cheat sheet [Towards Data Science], [DataCamp], [PDF]

Big Data Essentials

CRC: Center for Research Computing at Pitt [link] [h2p user guide] [JupyterHub (SMP)]
Why and How to Use Pandas with Large Data (but not big data…) [link]
A Beginner’s Guide to Big O Notation [link]
Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]

Advanced NLP

(We only touched on this topic in class, for your own future learning!)

spaCy: Industrial-Strength Natural Language Processing in Python [link]
(DataCamp) Advanced NLP with spaCy [course]
(DataCamp) NLP in Python [skill track]
TensorFlow
- Classify Text with BERT [Google tutorial]
- What is TensorFlow [intro]
BERT + PyTorch
- TowardsDS: Bert text classification using PyTorch [tutorial]
- Bert for Advance (sic) NLP with Transformers in Pytorch [tutorial], Text classification on the Corpus of Linguistic Acceptability (COLA) [tutorial]
- Analytics Vidhya: [demystifying BERT], [BERT for text classification]
- Visual guide to BERT and ELMo [BERT], [BERT and ELMo]

Tools

Below focuses more on the software tools side of resources.

Git and GitHub

Git download & installation [link]
LSA 2019 Reproducible Research Workshop tutorials: Part 1 Intro to Git, Part 2 Linking Git with GitHub
GitHub Help: Fork a repo [link]
How to get started with Git and GitHub [YouTube]
git - the simple guide [link]
Tutorials: Become a git guru. (Uses BitBucket instead of GitHub, ignore parts on SVN) [link]

Markdown

GitHub Guides: Mastering Markdown [link]
Chrome browser Markdown Viewer extension [link]

Anaconda and Jupyter Notebook

Anaconda Python download & installation: use version 3.8. [link]
Don’t want Anaconda? If you already have another Python distribution installed, you can simply add on jupyter via pip3. Follow the directions on this page, under OPTION 2 for Python.
Using IDLE in Anaconda Python
- Windows 10: follow this setup guide
- Mac OS: open up Terminal, type idle3 then enter.
Lynda.com Tutorial: Introduction to Jupyter Notebook, basics, Markdown, How to Launch. (Skip “mathematical typesetting” video.) [link]
Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]
Jupyter nbviewer (alternative to GitHub’s view) [link]

Command line, Bash/Zsh and Unix Tools

(Mac only) Bash vs. Zsh [link]
Software Carpentry Lesson: The Unix Shell [link]
Using Unix for Linguistic Research (by Will Styler) [link]
Regular Expressions Tutorial [link]
The best introduction ever to grep, by softpanorama.org [link]
(Mac only) Installing GNU grep [link], installing pcre grep [link]

Text Editor

Atom [link] recommended for all systems.
Also good: Notepad++ [link] (Windows only) and Sublime Text [link] (all platforms).
On the command-line side, nano is easiest to use. It is already on Macs; on Windows, it comes with Git-Bash.

Speech and Multimedia Software

Praat: doing phonetics by Computer [home]
praat-textgrids: Praat TextGrid manipulation in Python [library link]
Parselmouth – Praat in Python, the Pythonic way [GitHub link]
SpeechRecognition: Speech recognition module for Python, supporting several engines and APIs [GitHub link]
SoX [home] and FFmpeg [home], quick tutorial on both [link]
Montreal Forced Aligner [home], [GitHub repo]
Kaldi ASR Toolkit [home]
Phonetisaurus G2P [GitHub repo]
OpenFST Library [home]
ELAN: annotation tool for audio and video recordings [home]
More info in the “Speech and Multimedia Data” topic section above!

The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.

Natural Language Processing, NLTK, Computational Linguistics

Natural Language Toolkit (NLTK) Project Home [link]
NLTK Book, Python3 Edition [index] [navigation panel]
NLTK HOWTOs [link]
LING 1330/2330 Intro to CL

Python References

Python 3 Quick Reference [PDF]
Python 3 Notes
- FAQ
- Text samples