Course overview

Welcome to CMPINF 2120! The ability to collect, store and process large amounts of detailed data in a variety of fields has led to a surge in the use of data in various decision making tasks, ranging from governmental policy making to drafting players in sports. Data literacy is thus important and in this first introductory course we will focus on shifting the traditional mode of deterministic (yes/no) thinking to probabilistic thinking. In this course, we will review concepts from applied probability and statistics and explore how they can be used in building a data-driven infrastructure with applications ranging from understanding a variety of everyday phenomena (e.g., descriptive modeling) to making decisions based on data (e.g., predictive modeling). In particular, we will focus on the principles and best practices in dealing with data, including understanding (a) the bias-variance tradeoff, (b) how to avoid overfitting, (c) how to choose the most appropriate model for your data and (d) how to evaluate your model’s performance.

While the main focus of the course is on supervised learning, we will also introduce unsupervised learning and in particular the problem of clustering. We will also explore the concept of Monte Carlo simulations and resampling, and how they can be used to make predictions for systems that are too complicated to be solved in closed form. We will also provide an overview of analytical methods for specialized form of data including time series and relational data.

Upon successful completion of this course, the student will be able to:

  1. Demonstrate a deep understanding of the fundamental concepts of statistical learning and modeling, including limitations associated with data and methodologies
  2. Be able to implement, deploy and evaluate a range of different learning models in Python
  3. Be able to apply data science concepts and methods to solve real-world problems and communicate the results of their analysis
We will be covering the following topics:
  • Probability and Statistics
  • Regression and Classification
  • Complexity and Overfitting
  • Survival Analysis and Bayesian Inference
  • Situate computing and information practices within a socio-cultural context
  • Simulation and Sampling
  • Unsupervised Learning
  • Matrix Factorization
  • Networks and Graphs
  • Time Series Analysis
  • Neural Networks

Instructor

Name: Arjun Chandrasekhar
Email: arjunc@pitt.edu
Student Hours:
  • Tuesday 11-12:30pm via Zoom (Passcode CMPINF2120)
  • Or by appointment
    • I really cannot overemphasize this enough. If my student hours don't work for you then email me to set up something else. Please don't punt on getting the help you need because you reasonably but mistakenly interpreted the stated student hours to be set in stone!

Course Links