Learning Resources by Topic
This is not just a link dump. These resources are carefully curated
textbook stand-ins, and you are fully
expected to learn from them! There are multiple types:
- Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
- Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
- Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
- Software installation links. Download and install on your machine.
- Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
- References -- for looking things up.
Linguistic Data, Open Access, Data Publishing
- Linguistics Data Repositories, Guides at UC San Diego Libraries [link]
- Linguistic Linked Open Data [link]
- Linguistic Data Consortium (LDC) [link]
- The Open Handbook of Linguistic Data Management. (2022) MIT Press. Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister (Eds.) [full book]
- Lauren Collister. (2022) Copyright and Sharing Linguistic Data. [chapter]
- Na-Rae Han. (2022) Transforming Data. [chapter]
- Copyright and Intellectual Property Toolkit by Lauren Collister [link]
- D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
- Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]
Corpus Linguistics
- Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [PDF]
- NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- NLTK Corpora Index [link] [GitHub repo]
- FSNLP Ch.4 Corpus-Based Work Links [link]
- Corpus-based Linguistics Links [link]
- Corpus Resource Database (CoRD) [link]
- Lillian Lee’s Corpus Datasets index [link]
- TEI: A Gentle Introduction to XML [link]
- json.org: Introducing JSON [link], JSON example (vs. XML) [link]
Linguistic Annotation, Ontology, and Knowledge Engineering
- Adding Linguistic Annotation, Geoffrey Leech (2005). In Martin Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice. [link]
- Natural Language Annotation for Machine Learning [Chapter 1, full ebook]
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- WordNet: a lexical database for English [link]
- Universal Dependencies [project home]
- AMR: Abstract Meaning Representation [project home]
- WebAnno annotation tool [home, GitHub]
- Using Praat for Linguistic Research (by Will Styler) [tutorial]
- Corpus Phonetics Tutorial (by Eleanor Chodroff) [tutorial]
- The Language Archive (home of ELAN) [home]
- ELAN Linguistic Annotator [home], BU ASL Corpus as ELAN [link]
- (DataCamp) Spoken Language Processing in Python [course]
- Jurafsky & Martin (2020) Speech and Language Processing Ch. 26 Automatic Speech Recognition and Text-to-Speech [PDF]
- Less technical overview of ASR:
- More info on software, tools (Python libraries, MFA, SoX, etc.) in the “Tools” section below!
Statistics References
Data Processing Fundamentals: Python’s numpy, pandas, and visualization libraries
- Python Data Science Handbook. (2016) O’Reilly Media [book]
- (DataCamp) Career Track: Data Scientist with Python (includes all courses below and more) [track]
- (DataCamp) Introduction to Python, Ch.4 NumPy [course] [Ch.4 Numpy]
- (DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [course]
- (DataCamp) Data Manipulation with pandas [course]
- (DataCamp) Joining Data with pandas [course]
- (DataCamp) Cleaning Data with Python [course]
- (DataCamp) Introduction to Data Visualization with Matplotlib [course]
- (DataCamp) Introduction to Data Visualization with Seaborn [course]
- Visualization: pandas 0.20.3 documentation [link]
- Chris Albon’s Notes on ML & AI, “Data Wrangling” [link]
- 19 Essential Snippets in Pandas [link]
Machine Learning
- Python Data Science Handbook. (2016) O’Reilly Media [book]
- Movie Reviews Sentiment Analysis with Scikit-Learn [link]
- Topic Modeling with Scikit Learn [link]
- (DataCamp) Supervised Learning with scikit-learn [tutorial]
- (DataCamp) Unsupervised Learning in Python [course]
- (DataCamp) NLP Fundamentals in Python [course]
- Scikit-Learn cheat sheet [Towards Data Science], [DataCamp], [PDF]
Big Data Essentials
- CRC: Center for Research Computing at Pitt [link] [h2p user guide] [OnDemand, Help doc]
- Why and How to Use Pandas with Large Data (but not big data…) [link]
- A Beginner’s Guide to Big O Notation [link]
- Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]
Advanced NLP
(We only touched on this topic in class, for your own future learning!)
- spaCy: Industrial-Strength Natural Language Processing in Python [link]
- (DataCamp) Skill track: NLP in Python (includes below and more) [track]
- (DataCamp) Advanced NLP with spaCy [course]
- (DataCamp) Sentiment Analysis in Python [course]
- (DataCamp) Feature Engineering for NLP in Python [course]
- TensorFlow
- BERT + PyTorch
Below focuses more on the software tools side of resources.
Git and GitHub
- Git download & installation [link]
- LSA 2019 Reproducible Research Workshop tutorials: Part 1 Intro to Git, Part 2 Linking Git with GitHub
- GitHub Help: Fork a repo [link]
- How to get started with Git and GitHub [YouTube]
- git - the simple guide [link]
- Tutorials: Become a git guru. (Uses BitBucket instead of GitHub, ignore parts on SVN) [link]
- GitHub authentication: How to set up HTTPS Personal Access Tokens [GitHub Help page, Tutorial]
Markdown
- GitHub Guides: Mastering Markdown [link]
- Chrome browser Markdown Viewer extension [link]
Anaconda and Jupyter Notebook
- Anaconda Python download & installation: current Python version is 3.9. [link]
- Don’t want Anaconda? If you already have another Python distribution installed, you can simply add on
jupyter
via pip3
. Follow the directions on this page, under OPTION 2 for Python.
- Using IDLE in Anaconda Python
- Windows 10: follow this setup guide
- Mac OS: open up Terminal, type
idle3
then enter.
- Lynda.com Tutorial: Work with Jupyter Notebooks, how to launch, basics, Markdown [link]
- Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]
- Jupyter nbviewer (alternative to GitHub’s view) [link]
- (Mac only) Bash vs. Zsh [link]
- Software Carpentry Lesson: The Unix Shell [link]
- Using Unix for Linguistic Research (by Will Styler) [link]
- Regular Expressions Tutorial [link]
- The best introduction ever to
grep
, by softpanorama.org [link]
- (Mac only) Installing GNU grep [link], installing pcre grep [link]
Text Editor
- Atom [link] recommended for all systems.
- Also good: Notepad++ [link] (Windows only) and Sublime Text [link] (all platforms).
- On the command-line side,
nano
is easiest to use. It is already on Macs; on Windows, it comes with Git-Bash.
- Praat: doing phonetics by Computer [home]
- praat-textgrids: Praat TextGrid manipulation in Python [library link]
- Parselmouth – Praat in Python, the Pythonic way [GitHub link]
- SpeechRecognition: Speech recognition module for Python, supporting several engines and APIs [GitHub link]
- SoX [home] and FFmpeg [home], quick tutorial on both [link]
- Montreal Forced Aligner [home], [GitHub repo]
- Kaldi ASR Toolkit [home]
- Phonetisaurus G2P [GitHub repo]
- OpenFST Library [home]
- ELAN: annotation tool for audio and video recordings [home]
- More info in the “Speech and Multimedia Data” topic section above!
The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.
Natural Language Processing, NLTK, Computational Linguistics
Python References