Learning Resources by Topic

This is not just a link dump. These resources are carefully curated textbook stand-ins, and you are fully expected to learn from them! There are multiple types:

Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
Software installation links. Download and install on your machine.
Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
References -- for looking things up.

Linguistic Data, Open Access, Data Publishing

Linguistics Data Repositories, Guides at UC San Diego Libraries [link]
Linguistic Linked Open Data [link]
Linguistic Data Consortium (LDC) [link]
The Open Handbook of Linguistic Data Management. (2022) MIT Press. Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister (Eds.) [full book]
- Lauren Collister. (2022) Copyright and Sharing Linguistic Data. [chapter]
- Na-Rae Han. (2022) Transforming Data. [chapter]
Copyright and Intellectual Property Toolkit by Lauren Collister [link]
D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]

Corpus Linguistics

Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [PDF]
NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
NLTK Book Ch.11 Managing Linguistic Data [chapter]
NLTK Corpora Index [link] [GitHub repo]
FSNLP Ch.4 Corpus-Based Work Links [link]
Corpus-based Linguistics Links [link]
Corpus Resource Database (CoRD) [link]
Lillian Lee’s Corpus Datasets index [link]
TEI: A Gentle Introduction to XML [link]
json.org: Introducing JSON [link], JSON example (vs. XML) [link]

Linguistic Annotation, Ontology, and Knowledge Engineering

Adding Linguistic Annotation, Geoffrey Leech (2005). In Martin Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice. [link]
Natural Language Annotation for Machine Learning [Chapter 1, full ebook]
NLTK Book Ch.11 Managing Linguistic Data [chapter]
WordNet: a lexical database for English [link]
Universal Dependencies [project home]
AMR: Abstract Meaning Representation [project home]
WebAnno annotation tool [home, GitHub]

Speech and Multimedia Data

Using Praat for Linguistic Research (by Will Styler) [tutorial]
Corpus Phonetics Tutorial (by Eleanor Chodroff) [tutorial]
The Language Archive (home of ELAN) [home]
ELAN Linguistic Annotator [home], BU ASL Corpus as ELAN [link]
(DataCamp) Spoken Language Processing in Python [course]
Jurafsky & Martin (2020) Speech and Language Processing Ch. 26 Automatic Speech Recognition and Text-to-Speech [PDF]
Less technical overview of ASR:
- Speech Recognition – ASR Model Training (by Jonathan Hui) [part 1], [part 2], [part 3 with recap]
- Introduction to ASR (by Maël Fabien, with IPA!!) [link]
More info on software, tools (Python libraries, MFA, SoX, etc.) in the “Tools” section below!

Statistics References

Python Statistics Fundamentals: How to Describe Your Data [link]
SPSS Tutorials: Analyzing Data [link]
T-test using Python and Numpy [tutorial]
Laerd.com [independent t-test, dependent t-test, ANOVA, Spearman’s rank-order correlation]
Statistics By Jim [ANOVA and F-test, Interpreting correlations]
Python for Data Science [covariance/correlation, t-test, ANOVA]
(DataCamp) Introduction to Statistics in Python [course]
(DataCamp) Statistics Fundamentals with Python [skill track]

Data Processing Fundamentals: Python’s numpy, pandas, and visualization libraries

Python Data Science Handbook. (2016) O’Reilly Media [book]
(DataCamp) Career Track: Data Scientist with Python (includes all courses below and more) [track]
- (DataCamp) Introduction to Python, Ch.4 NumPy [course] [Ch.4 Numpy]
- (DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [course]
- (DataCamp) Data Manipulation with pandas [course]
- (DataCamp) Joining Data with pandas [course]
- (DataCamp) Cleaning Data with Python [course]
- (DataCamp) Introduction to Data Visualization with Matplotlib [course]
- (DataCamp) Introduction to Data Visualization with Seaborn [course]
Visualization: pandas 0.20.3 documentation [link]
Chris Albon’s Notes on ML & AI, “Data Wrangling” [link]
19 Essential Snippets in Pandas [link]

Twitter text mining tutorials: [Marco Bonzanini] [Anthony Sistilli] [Suhem Parack, latest API v2]
Tweepy Twitter API documentation (v1.1, v2)
(DataCamp) Analyzing Social Media Data in Python [course]
(DataCamp) Web Scraping in Python [course]
Web Scraping with Python (LinkedIn Learning) [link]
Web Scraping with BeautifulSoup [tutorial]
Scrapy Tutorial (official documentation) [link]
Mapping the United Swears of America by Jack Grieve [link]

Machine Learning

Python Data Science Handbook. (2016) O’Reilly Media [book]
Movie Reviews Sentiment Analysis with Scikit-Learn [link]
Topic Modeling with Scikit Learn [link]
(DataCamp) Supervised Learning with scikit-learn [tutorial]
(DataCamp) Unsupervised Learning in Python [course]
(DataCamp) NLP Fundamentals in Python [course]
Scikit-Learn cheat sheet [Towards Data Science], [DataCamp], [PDF]

Big Data Essentials

CRC: Center for Research Computing at Pitt [link] [h2p user guide] [OnDemand, Help doc]
Why and How to Use Pandas with Large Data (but not big data…) [link]
A Beginner’s Guide to Big O Notation [link]
Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]

Advanced NLP

(We will only touch on this topic in class. For your own future learning!)

spaCy: Industrial-Strength Natural Language Processing in Python [link]
(DataCamp) Skill track: NLP in Python (includes below and more) [track]
- (DataCamp) Advanced NLP with spaCy [course]
- (DataCamp) Sentiment Analysis in Python [course]
- (DataCamp) Feature Engineering for NLP in Python [course]
TensorFlow
- Classify Text with BERT [Google tutorial]
- What is TensorFlow [intro]
BERT + PyTorch
- TowardsDS: Bert text classification using PyTorch [tutorial]
- Bert for Advance (sic) NLP with Transformers in Pytorch [tutorial], Text classification on the Corpus of Linguistic Acceptability (COLA) [tutorial]
- Analytics Vidhya: [demystifying BERT], [BERT for text classification]
- Visual guide to BERT and ELMo [BERT], [BERT and ELMo]

Tools

Below focuses more on the software tools side of resources.

Git and GitHub

Git download & installation [link]
LSA 2019 Reproducible Research Workshop tutorials: Part 1 Intro to Git, Part 2 Linking Git with GitHub
(LinkedIn Learning) Git Essential Training: The Basics [tutorial]
git - the simple guide [link]
GitHub Help: Fork a repo [link]
(Macs) Troubleshooting GitHub authentication errors: How to set up HTTPS Personal Access Tokens [guide, tutorial]

Markdown

GitHub Guides: Mastering Markdown [link]
Chrome browser Markdown Viewer extension [link]

Anaconda and Jupyter Notebook

Anaconda Python download & installation: current Python version is 3.9. [link]
Using IDLE in Anaconda Python
- Windows 10 or 11: follow this setup guide
- Mac OS: open up Terminal, type idle3 then enter.
(LinkedIn Learning) Work with Jupyter Notebooks, how to launch, basics, Markdown [link]
Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]
Jupyter nbviewer (alternative to GitHub’s view) [link]

Command line, Bash/Zsh and Unix Tools

(Mac only) Bash vs. Zsh [link]
Software Carpentry Lesson: The Unix Shell [link]
Using Unix for Linguistic Research (by Will Styler) [link]
Regular Expressions Tutorial [link]
The best introduction ever to grep, by softpanorama.org [link]
(Mac only) Installing GNU grep [link], installing pcre grep [link]

Text Editor

Recommended: Notepad++ [link] (Windows only); VS Code [link] and Sublime Text [link] (all platforms).
On the command-line side, nano is easiest to use. It is already on Macs; on Windows, it comes with Git-Bash.
(Atom [link] has been great but it is being discontinued, unfortunately.)

Speech and Multimedia Software

Praat: doing phonetics by Computer [home]
praat-textgrids: Praat TextGrid manipulation in Python [library link]
Parselmouth – Praat in Python, the Pythonic way [GitHub link]
SpeechRecognition: Speech recognition module for Python, supporting several engines and APIs [GitHub link]
SoX [home] and FFmpeg [home], quick tutorial on both [link]
Montreal Forced Aligner [home], [GitHub repo]
Kaldi ASR Toolkit [home]
Phonetisaurus G2P [GitHub repo]
OpenFST Library [home]
ELAN: annotation tool for audio and video recordings [home]
More info in the “Speech and Multimedia Data” topic section above!

The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.

Natural Language Processing, NLTK, Computational Linguistics

Natural Language Toolkit (NLTK) Project Home [link]
NLTK Book, Python3 Edition [index] [navigation panel]
NLTK HOWTOs [link]
LING 1330/2330 Intro to CL

Python References

Python 3 Quick Reference [PDF]
Python 3 Notes
- FAQ
- Text samples