Learning Resources by Topic
This is not just a link dump. These resources are carefully curated
textbook stand-ins, and you are fully
expected to learn from them! There are multiple types:
- Online tutorials. Watch, practice and learn. I pre-screened and narrowed down to very essential & relevant contents only, so you can stop wondering if you should learn the whole thing!
- Articles. Read them -- they will be referenced in lectures and used in classroom discussions.
- Book and book chapters. Python Data Science Handbook neatly aligns with our data science focus and doubles up as a reference book. Parts of the NLTK Book will also be referenced.
- Software installation links. Download and install on your machine.
- Bookmark pages. These are lists of useful links compiled by someone else, which often contain pointers to data sets or resources. Explore them and use them as needed; you should become familiar with what's on them.
- References -- for looking things up.
Linguistic Data, Open Access, Data Publishing
- Linguistics Data Repositories, Guides at OSH [link]
- Linguistic Linked Open Data [link]
- Linguistic Data Consortium (LDC) [link]
- Data Management Plans for Linguistic Research, Workshop at 2017 LSA Summer Institute [link] [slides]
- Justin Kitzes. (2018) The Basic Reproducible Workflow Template. In Justin Kitzes, Daniel Turek, Fatma Deniz (Eds.) The Practice of Reproducible Research. [link]
- D-Scholarship @ Pitt: Institutional Repository at the University of Pittsburgh [link]
- Copyright and Intellectual Property Toolkit by Lauren Collister [link]
- TEI: A Gentle Introduction to XML [link]
- json.org: Introducing JSON [link], JSON example (vs. XML) [link]
Corpus Linguistics
- Stefan Th. Gries and John Newman. (2013) Creating and using corpora. In Podesva, Robert J., and Devyani Sharma. (Ed.), Research Methods in Linguistics. [PDF]
- NLTK Book Ch.2 Accessing Text Corpora and Lexical Resources [chapter]
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- NLTK Corpora Index [link] [GitHub repo]
- FSNLP Ch.4 Corpus-Based Work Links [link]
- Corpus-based Linguistics Links [link]
- Corpus Resource Database (CoRD) [link]
Linguistic Annotation, Ontology, and Knowledge Engineering
- NLTK Book Ch.11 Managing Linguistic Data [chapter]
- Adding Linguistic Annotation, Geoffrey Leech [link]
- Natural Language Annotation for Machine Learning [Chapter 1, full ebook]
- WordNet: a lexical database for English [link]
- Universal Dependencies [project home]
- AMR: Abstract Meaning Representation [project home]
- WebAnno annotation tool [home, GitHub]
Statistics References
- Python for Data Science: Inferential Statistics [webpage]
- Rice Virtual Lab in Statistics [webpage]
Data Processing Fundamentals: Python’s numpy, pandas, and visualization libraries
- Python Data Science Handbook. (2016) O’Reilly Media [book]
- (DataCamp) Introduction to Python for Data Science, Ch.4 NumPy [tutorial]
- (DataCamp) Intermediate Python for Data Science. Focus on Matplotlib, Numpy & Pandas. [tutorial]
- (DataCamp) pandas Foundations [tutorial]
- (DataCamp) Manipulating DataFrames with pandas [tutorial]
- Visualization: pandas 0.20.3 documentation [link]
- Chris Albon’s Notes on ML & AI, “Data Wrangling” [link]
- 19 Essential Snippets in Pandas [link]
Data Mining & Machine Learning
- Twitter text mining tutorials: [The Code Way], [Adil Moujahid], [Marco Bonzanini]
- Scrapy tutorial [link]
- Mapping the United Swears of America by Jack Grieve [link]
- Python Data Science Handbook. (2016) O’Reilly Media [book]
- Movie Reviews Sentiment Analysis with Scikit-Learn [link]
- Topic Modeling with Scikit Learn [link]
- (DataCamp) Supervised Learning with scikit-learn [tutorial]
- (DataCamp) Unsupervised Learning in Python [tutorial]
- (DataCamp) NLP Fundamentals in Python [tutorial]
Big Data Essentials
- Why and How to Use Pandas with Large Data (but not big data…) [link]
- A Beginner’s Guide to Big O Notation [link]
- Learn Big Data Analytics using Top YouTube Videos, TED Talks & other resources [link]
- spaCy: Industrial-Strength Natural Language Processing in Python [link]
- CRC: Center for Research Computing at Pitt [link] [h2p] [hub]
Below focuses more on the software tools side of resources.
Git and GitHub
- Git download & installation [link]
- LSA 2019 Reproducible Research Workshop tutorials: Part 1 Intro to Git, Part 2 Linking Git with GitHub
- GitHub Help: Fork a repo [link]
- How to get started with Git and GitHub [YouTube]
- git - the simple guide [link]
- Tutorials: Become a git guru. (Uses BitBucket instead of GitHub, ignore parts on SVN) [link]
Markdown
- GitHub Guides: Mastering Markdown [link]
- Chrome browser Markdown Viewer extension [link]
Anaconda and Jupyter Notebook
- Anaconda Python download & installation: use version 3.7. [link]
- Don’t want Anaconda? If you already have another Python distribution installed, you can simply add on
jupyter
via pip3
. Follow the directions on this page, under OPTION 2 for Python.
- Lynda.com Tutorial: Introduction to Jupyter Notebook, basics, Markdown, How to Launch. (Skip “mathematical typesetting” video.) [link]
- Jupyter Notebook Tutorial: The Definitive Guide on DataCamp (more advanced) [link]
- Software Carpentry Lesson: The Unix Shell [link]
- Thirty Useful Unix Commands [PDF]
- Regular Expressions Tutorial [link]
- The best introduction ever to
grep
, by softpanorama.org [link]
Text Editor
- Atom [link] recommended for all systems.
- Also good: Notepad++ [link] (Windows only) and Sublime Text [link] (all platforms).
- On the command-line side,
nano
is easiest to use. It is already on Macs; on Windows, it comes with Git-Bash.
The topics below are not among the focus areas of this course, but parts of them will be relevant. They are provided for reference.
Natural Language Processing, NLTK, Computational Linguistics
Python References