Pennsylvania State University

Pennsylvania State University

Data Science for Scientists and Scholars

Study Guide

Please note that the precise schedule is subject to change. The lecture slides (and lecture notes, if any) are updated after the lecture.


Lecture 1. Data science - what, why, and how?

Transdisciplinary foundations of data science. Descriptive, predictive, and causal data science. Data science in practice. Formulating and answering questions using data. Course overview and course mechanics. Getting started with Google Colab.

Required readings

Recommended materials

Optional Materials


Lecture 2. Descriptive statistics. Populations versus samples. Centrality measures - mean, median, mode; Measures of spread, their interpretation.

Required Readings

  • Chapter 3, Sections 3.1-3.4 from Shah, Chirag (2020) A hands-on introduction to data science, Cambridge University Press.
  • Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

Optional Materials

  • Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 3. Probability, random variables, and distributions Sample spaces, simple events, compound events, conditional probability, independence, Bayes rule. Probability theory as a bridge between descriptive and inferential statistics.

Required Readings

  • Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

Optional Materials

  • Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 4. Probability Distributions: Discrete

Bernoulli, binomial, multinomial, Poisson distributions

Required Readings

  • Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

  • Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

  • Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 5. Probability Distributions: Continuous

Normal, Standard Normal distributions, Calculating Probabilities

Required Readings

  • Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

  • Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

  • Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 6. Estimation

From statistics to probability - from sample to population. Sampling distributions. Point estimates. Large Sample Estimation. Error and confidence of estimates of means and proportions. Differences in means and differences in proportions.

Required Readings

  • Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

  • Read: Chapter 5, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

  • Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 7. Estimation

Small Sample estimates. Confidence intervals. Elements of Hypothesis Testing. Type 1 and Type 2 errors. When do we reject a null hypothesis and what does it mean to reject a null hypothesis? Statistical Significance. p-values. Interpretation of p-values.

Required Readings

  • Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
  • Lecture Slides, Vasant Honavar

Recommended readings

Optional Materials


Lecture 8. Introduction to Python for Data Science. Guest Lecture: Neil Ashtekar

Basics of Python for Data Science. pandas, numpy, matplotlib, seaborn, scipy and other useful libraries

Required Readings

Recommended materials

Optional Materials


Lecture 9. Python Statistics Modules and How to Use them. Guest Lecture: Neil Ashtekar

Python libraries for Descriptive Statistics, Estimation, Hypothesis testing.

Required Readings

Recommended materials

Optional Materials


Lecture 10. Additional Python Data Science Tools for Tabular, text, image and other types of data Guest Lecture: Neil Ashtekar

Required Readings

Recommended materials

Optional Materials


Lecture 11. Assembling data for data science projects Anatomy of a data science project. Coming up with question(s). Assembling data. Avoiding common pitfalls. Importance of digging into the story behind the data. Ilustrative case studies.

Required Readings

  • Review: Lecture Slides, Vasant Honavar
  • Read: Chapter 3, Data Science Design Manual, S. Skiena.

Recommended materials

Optional Materials


Lecture 12. Predictive Modeling Using Machine Learning

What is machine learning? Why do we care about machine learning? What can we do with machine learning? Applications of machine learning. Types of machine learning problems. A simple machine learning algorithm - the K nearest neighbor classifier. Distance measures. K nearest neighbor regression.

Required readings

Recommended readings

Optional Readings


Lecture 13. Deeper dive into data and data representation

Types of attributes: Nominal (categorical), ordinal, real-valued. Properties of attributes: distinctiveness, order, meaningfulness of differences, meaningfulness of ratios.

Types of data: Tabular data, ordered data (e.g., text, DNA sequences, SMILES representation of molecules), graph data (e.g., social networks, molecular structures), geo-spatial data.

Probability and random variables revisited. Joint distributions of (multiple) random variables. Conditional distributions. Marginalization. Bayes rule.

Required readings

Recommended readings

Optional Readings


Lecture 14: Probabilistic Generative Models: Naive Bayes

Probabilistic generative models. Bayes Optimal classifier (Minimum error classifier). Naive Bayes Classifier. Applications of Naive Bayes Classifiers. Tabular, Sequence Text, and Image Data Classification. Multi-variate, multi-nomial, and Gaussian Naive Bayes. Avoiding overfitting - robust probability estimates.

Required readings

Recommended Readings Optional Readings


Lecture 15: Evaluating predictive models

Evaluation of classifiers. Accuracy, Sensitivity, Specificity, Correlation Coefficient. Tradeoffs between sensitivity and specificity. When does a classifier outperform another? ROC curves. Estimating performance measures; confidence interval calculation for estimates; cross-validation based estimates of model performance; leave-one-out and bootstrap estimates of performance; comparing two models; comparing two learning algorithms.

Required Readings

Recommended readings

Optional readings


Lecture 16: Probabilistic Generative Models: Bayes Networks

Bayes Networks. Compact representation of Joint Probability distributions using Bayes Networks. Conditional independence and d-separation. Factorization of joint probability distributions based on conditional independence assumptions encoded by a Bayes Network. Probabilistic inference using Bayes Networks. Exact inference in polytrees. Approximate inference using stochastic simulation. Learning Bayes Networks from data: Learning parameters. Learning Structure. A simple algorithm using conditional independence queries.

Required readings

Recommended Readings

Optional Readings


Lecture 17: Decision Trees and Random Forests

Modeling dependence between attributes. The decision tree classifier. Introduction to information theory. Information, entropy, mutual information, and related concepts. Algorithm for learning decision tree classifiers from data.

Pruning decision trees. Pitfalls of entropy as a splitting criterion for multi-valued splits and ways to avoid the pitfalls. Incorporating attribute measurement costs and misclassification costs into decision tree induction.

Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification.

Why a forest is better than a single tree. The random forest algorithm for classification. Why random forest works.

Required Readings

Recommended Readings

Optional Readings


Lecture 18: Linear Classifiers

Linear Classifiers. Threshold logic unit (perceptron). Linear separability. Perceptron Learning algorithm. Winner-Take-All Networks. Alternative loss functions for the perceptron. Gradient-based minimization of loss functions.

Required Readings

Recommended Readings


Lecture 19: Maximum Margin Classifiers and Kernel Machines

Maximum Margin Classifier (Linear Support Vector Machine). Kernel functions for handling non-linear decision surfaces. Addressing the computational and generalization challenges posed by high dimensional kernel-induced feature spaces. Kernel trick. Properties of Kernel Functions. Examples of Kernel Functions. Constructing Kernel Functions. Distinguishing good kernels from bad ones. Applications of Kernel Machines. Text classification, Image classification, etc.

Required Readings

Recommended Readings

Optional Readings


Lectures 20-21 Approximating Real Valued Functions With Neural Networks

Approximating linear functions (linear regression). Locally weighted linear regression.

Introduction to neural networks for nonlinear function approximation. Nonlinear function approximation using multi-layer neural networks. Universal function approximation theorem. The generalized delta rule (GDR) (the backpropagation learning algorithm).

Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.

Required readings


Lecture 22 Deep Learning

Introduction to deep learning. Pros and cons. Stacked autoencoders and representation learning. Convolutional auto-encoders. Image classification using deep neural networks. Sequence classification using deep neural networks. Deep neural networks as generative models. Generative adversarial networks. Transformers for NLP and related applications. Large language models and related generative models. Hype versus reality.

Review and wrap-up.

Required readings


Lecture 23 Predictive Modeling in Practice

Predictive modeling from heterogeneous data. Predictive modeling from high-dimensional data. Feature selection. Dimensionality reduction.

Required readings


Lecture 24: Causal Modeling

What is a cause? Why study causal inference? Causation versus association; seeing, versus doing. Why data are not always enough for drawing sound causal conclusions. Causal effect defined. Pitfalls of causal inference from observational data. Randomized experiments. Causal effect estimation under identifiability assumptions. Causal graphs for expressing qualitative causal assumptions.

Required Readings


Lecture 25: Causal Modeling: Structural causal models

Interpreting causal graphs. d-separation. do operator for modeling interventions. Confounding defined in terms of causal graphs and do operator. Causal effect identifiability. do calculus. Necessary and sufficient criteria for causal effects identifiability.

Required Readings

  • Review: Lecture Slides.
  • Read: Chapter 2, Chapter 3 (Sections 3.1-3.5) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.


Lecture 26: Causal Modeling: Linear causal models, causal transportability

Linear causal models. When regression can and cannot be used to find causal effects. Relation between causal coefficients and regression coefficients. Identifying causal effects from observations with respect to a linear causal model. Identifiability conditions.

Transportability of causal effects and related problems

Required Readings

  • Review: Lecture Slides.
  • Read: Chapter 2, Chapter 3 (Sections 3.8) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.

Recommended Readings


Lecture 27: Summary: Data Science for Researchers and Scholars

Summary of the course.