Data Science for Linguists 2022

Course home for
     LING 1340/2340

HOME
• Policies
• Term project guidelines
• Learning resources by topic
• Schedule table

LING 1340/2340 Data Science for Linguists

Spring 2022, University of Pittsburgh

Description

Data science is a fast-growing professional and academic discipline that is highly interdisciplinary in nature. Its practice centers on domain expertise: this course will introduce linguistics majors to core methods and practices in data science as it pertains to linguistic inquiry. Students will first learn the fundamentals of structuring, manipulating and sharing various forms of linguistic data; be given hands-on training on practical aspects of data processing, including handling large quantities of text data (“big data”) and creating statistical language models through machine learning; and get acquainted with the emerging field of knowledge engineering and ontology. Additionally, they will be given a chance to apply data-intensive methods to a term project of their choice. Upon successful completion of this course, students will be able to (1) identify the best methods for representing and analyzing linguistic data for a given purpose, (2) transform and process linguistic data in large volumes, and (3) understand how statistics-driven text analytics and machine learning methods operate.

Prerequisites

The course assumes that the students have an introductory knowledge of linguistics as well as basic competency in Python, a general-purpose programming language. The prerequisites therefore are:

Knowledge of statistics is highly recommended but not required. Uninitiated students will need to quickly pick up some basic aspects as they come up.

Textbooks

Python Data Science Handbook (2016, O’Reilly Media) is probably the closest thing to a textbook we will have. It will however be utilized more as a reference book. The scope of this course goes beyond core data science skills, for which articles and other materials will be assigned as needed. All throughout, we will be using various resources available on the web: see this Learning Resources page for a list.

Required Software

We will be using Python 3: Continuum’s Anaconda distribution in particular. It ships with Jupyter Notebook as a popular Python IDE, which we will be using extensively. Another key piece of software is Git, which enables version control and collaboration. In addition, we will learn unix tools and Bash shell; a text editor is also required.

Required Hardware

The software applications above should install and run from your own personal laptop, which you are expected to bring to every class meeting. Your laptop should run one of these OS’s: Mac OS X (10.6 or later), Windows (10 or 11), and Linux (any distribution). Mobile and cloud-based OS’s are not supported – iPads and Chromebooks are not suitable platforms for this class.

Course Requirements, Grading and Policies

Please see the Course Policies page.

Class Schedule

Please see the Class Schedule page.