Data Science Research Methods Course Penn State University

Data Sciences Program

Course Information

Course Materials

AI Resources

Quick Links

Data Science for Scientists and Scholars

Study Guide

Please note that the precise schedule is subject to change. The lecture slides (and lecture notes, if any) are updated after the lecture.

Lecture 1. Data science - what, why, and how?

Transdisciplinary foundations of data science. Descriptive, predictive, and causal data science. Data science in practice. Formulating and answering questions using data. Course overview and course mechanics. Getting started with Google Colab.

Required readings

Review: Course Syllabus
Review: Course Policies
Read: Chapter 1 of Shah, Chirag (2020) A hands-on introduction to data science, Cambridge University Press.
Read: Chapter 1 of Skiena, Steve (2017). Data Science Design Manual. Springer
Review: Lecture slides, Vasant Honavar
Leek, Jeffrey, and Roger Peng. 2015. “What Is the Question?” Science 347 (6228): 1314–15.
Review: Python ecosystem
Review: The Python Language
Review: Hands-on Python Tutorial (for those who need it).
Read: Introduction to Google Colab

Recommended materials

Peng, Roger D, and Elizabeth Matsui. 2015. Data Science for Python tutorial

Optional Materials

Chapters 1-3 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).

Lecture 2. Descriptive statistics. Populations versus samples. Centrality measures - mean, median, mode; Measures of spread, their interpretation.

Required Readings

Chapter 3, Sections 3.1-3.4 from Shah, Chirag (2020) A hands-on introduction to data science, Cambridge University Press.
Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Chapters 1-3. David Spiegelhalter. The Art of Statistics. Basic Books.
The Art of Data Science: A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 3. Probability, random variables, and distributions Sample spaces, simple events, compound events, conditional probability, independence, Bayes rule. Probability theory as a bridge between descriptive and inferential statistics.

Required Readings

Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Chapter 8. David Spiegelhalter. The Art of Statistics. Basic Books.
The Art of Data Science: A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 4. Probability Distributions: Discrete

Bernoulli, binomial, multinomial, Poisson distributions

Required Readings

Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 5. Probability Distributions: Continuous

Normal, Standard Normal distributions, Calculating Probabilities

Required Readings

Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 6. Estimation

From statistics to probability - from sample to population. Sampling distributions. Point estimates. Large Sample Estimation. Error and confidence of estimates of means and proportions. Differences in means and differences in proportions.

Required Readings

Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Read: Chapter 5, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.

Lecture 7. Estimation

Small Sample estimates. Confidence intervals. Elements of Hypothesis Testing. Type 1 and Type 2 errors. When do we reject a null hypothesis and what does it mean to reject a null hypothesis? Statistical Significance. p-values. Interpretation of p-values.

Required Readings

Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
Lecture Slides, Vasant Honavar

Recommended readings

Read: Chapter 7. Section 7.3, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Read: Emmert-Streib, F. and Dehmer, M., 2019. Understanding statistical hypothesis testing: The logic of statistical inference. Machine Learning and Knowledge Extraction, 1(3), pp.945-962.

Optional Materials

Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Farcomeni, A., 2008. A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Statistical methods in medical research, 17(4), pp.347-388.

Lecture 8. Introduction to Python for Data Science. Guest Lecture: Neil Ashtekar

Basics of Python for Data Science. pandas, numpy, matplotlib, seaborn, scipy and other useful libraries

Required Readings

Review: Python ecosystem
Review: The Python Language
Review: Hands-on Python Tutorial (for those who need it).
Read: Introduction to Google Colab
Work Through: Data Science Tutorial

Recommended materials

Peng, Roger D, and Elizabeth Matsui. 2015. Data Science for Python tutorial

Optional Materials

Chapters 1-3 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).

Lecture 9. Python Statistics Modules and How to Use them. Guest Lecture: Neil Ashtekar

Python libraries for Descriptive Statistics, Estimation, Hypothesis testing.

Required Readings

Review: Python ecosystem
Review: The Python Language
Review: Hands-on Python Tutorial (for those who need it).
Read: Introduction to Google Colab
Work Through: Statistics Fundamentals in Python: How to describe your data
Work Through: Data Science - Intro to Statistics

Recommended materials

Peng, Roger D, and Elizabeth Matsui. 2015. Data Science for Python tutorial

Optional Materials

Chapters 1-3 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).

Lecture 10. Additional Python Data Science Tools for Tabular, text, image and other types of data Guest Lecture: Neil Ashtekar

Required Readings

Review: Python ecosystem
Review: The Python Language
Review: Hands-on Python Tutorial (for those who need it).
Read: Introduction to Google Colab
Work Through: Statistics Fundamentals in Python: How to describe your data

Recommended materials

Peng, Roger D, and Elizabeth Matsui. 2015. Data Science for Python tutorial

Optional Materials

Chapters 1-3 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).

Lecture 11. Assembling data for data science projects Anatomy of a data science project. Coming up with question(s). Assembling data. Avoiding common pitfalls. Importance of digging into the story behind the data. Ilustrative case studies.

Required Readings

Review: Lecture Slides, Vasant Honavar
Read: Chapter 3, Data Science Design Manual, S. Skiena.

Recommended materials

Chapters 3 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).
The Art of Data Science: A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC.

Optional Materials

Data Wrangling in Python, Kazil, J., and Jarmul, K, O'Reilly (2016).

Lecture 12. Predictive Modeling Using Machine Learning

What is machine learning? Why do we care about machine learning? What can we do with machine learning? Applications of machine learning. Types of machine learning problems. A simple machine learning algorithm - the K nearest neighbor classifier. Distance measures. K nearest neighbor regression.

Required readings

Sections 1.1 and 1.2 from Daume III (2015) A Course on Machine Learning.
Section 2.1 from James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014) An introduction to statistical learning: with application in R, Springer.
Lecture slides, Vasant Honavar

Recommended readings

Videos 1, 2 from the Introduction to Machine Learning in Python with Scikit-Learn video series, Kevin Markham.
Chapter 1 from Muller, A., and Guido, S. (2016). Introduction to Machine Learning with Python. O'Reilly.
Cunningham, P. and Delany, S.J., 2021. k-Nearest neighbour classifiers-A Tutorial. ACM computing surveys (CSUR), 54(6), pp.1-25.
Slaney, M., & Casey, M. (2008). Locality-sensitive hashing for finding nearest neighbors [lecture notes]. IEEE Signal processing magazine, 25(2), 128-131.
Reference: Scikit Learn Tutorial

Optional Readings

Probabilistic machine learning and artificial intelligence. Ghahramani, Z. Nature 521, 452–459 (28 May 2015) doi:10.1038/nature1454
Chapter 5 Python Data Science Handbook, Jake Vanderplas, O'Reilly (2017).

Lecture 13. Deeper dive into data and data representation

Types of attributes: Nominal (categorical), ordinal, real-valued. Properties of attributes: distinctiveness, order, meaningfulness of differences, meaningfulness of ratios.

Types of data: Tabular data, ordered data (e.g., text, DNA sequences, SMILES representation of molecules), graph data (e.g., social networks, molecular structures), geo-spatial data.

Probability and random variables revisited. Joint distributions of (multiple) random variables. Conditional distributions. Marginalization. Bayes rule.

Required readings

Review of probablity theory for machine learning, Samuel Ieong.
Lecture slides, Vasant Honavar

Recommended readings

Reference: Scikit Learn Tutorial

Optional Readings

Probabilistic machine learning and artificial intelligence. Ghahramani, Z. Nature 521, 452–459 (28 May 2015) doi:10.1038/nature1454
Chapter 6, Mathematics for machine learning.Deisenroth, M.P., Faisal, A.A. and Ong, C.S., 2020. Cambridge University Press.

Lecture 14: Probabilistic Generative Models: Naive Bayes

Probabilistic generative models. Bayes Optimal classifier (Minimum error classifier). Naive Bayes Classifier. Applications of Naive Bayes Classifiers. Tabular, Sequence Text, and Image Data Classification. Multi-variate, multi-nomial, and Gaussian Naive Bayes. Avoiding overfitting - robust probability estimates.

Required readings

Chapter 9 (Sections 9.1 through 9.5) of Daume III (2015) A Course on Machine Learning.
Tom MurphyNaive Bayes Classifiers
Lecture Slides. Vasant Honavar

Recommended Readings

Kang, D-K., Silvescu, A. and Honavar, V. (2006) RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science., Berlin: Springer-Verlag. pp. 45-54, 2006.
Yan, C., Terribilini, M., , Wu, F., Jernigan, R.L., Dobbs, D. and Honavar, V. (2006) Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics, 2006.
Kang, D-K., Fuller, D., and Honavar, V. Learning Misuse and Anomaly Detectors from System Call Frequency Vector Representation. IEEE International Conference on Intelligence and Security Informatics. Springer-Verlag Lecture Notes in Computer Science, Springer-Verlag. Vol. 3495. pp. 511-516, 2005.

Optional Readings

P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997.

Lecture 15: Evaluating predictive models

Evaluation of classifiers. Accuracy, Sensitivity, Specificity, Correlation Coefficient. Tradeoffs between sensitivity and specificity. When does a classifier outperform another? ROC curves. Estimating performance measures; confidence interval calculation for estimates; cross-validation based estimates of model performance; leave-one-out and bootstrap estimates of performance; comparing two models; comparing two learning algorithms.

Required Readings

Sections 5.5, 5.6, 5.7, 5.8, and 5.9 from Daume III (2015) A Course on Machine Learning.
Lecture Slides, Vasant Honavar

Recommended readings

Baldi, P., Brunak, S., Chauvin, Y. and Nielsen, H. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics Vol. 16. pp. 412-424.
Confidence intervals and hypothesis testing Confidence Intervals and Hypothesis Tests.

Optional readings

F Provost, T Fawcett, R Kohavi (1998). A case against accuracy estimation of machine learning algorithms. In: proceedings of the fifteenth International Conference on Machine Learning.
Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech. Report HPL2003-4.
Stąpor, K. (2018). Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations In International Conference on Computer Recognition Systems (pp. 12-21). Springer, Cham.

Lecture 16: Probabilistic Generative Models: Bayes Networks

Bayes Networks. Compact representation of Joint Probability distributions using Bayes Networks. Conditional independence and d-separation. Factorization of joint probability distributions based on conditional independence assumptions encoded by a Bayes Network. Probabilistic inference using Bayes Networks. Exact inference in polytrees. Approximate inference using stochastic simulation. Learning Bayes Networks from data: Learning parameters. Learning Structure. A simple algorithm using conditional independence queries.

Required readings

Heckerman, David, and Michael P. Wellman. Bayesian networks. Communications of the ACM 38, no. 3 (1995): 27-31
Krieg, M.L., 2001. A Tutorial on Bayesian Belief Networks.
Lecture Slides. Vasant Honavar

Recommended Readings

Darwiche, Adnan. Bayesian networks. Foundations of Artificial Intelligence 3 (2008): 467-509.
Heckerman, D. (2008). A tutorial on learning with Bayesian networks. Innovations in Bayesian networks: Theory and applications, 33-82.

Optional Readings

Bielza, Concha, and Pedro Larrañaga. Bayesian networks in neuroscience: a survey. Frontiers in computational neuroscience 8 (2014): 131.
Needham, C.J., Bradford, J.R., Bulpitt, A.J. and Westhead, D.R., 2007. A primer on learning in Bayesian networks for computational biology. PLoS computational biology, 3(8), p.e129.
Aguilera, P.A., Fernández, A., Fernández, R., Rumí, R. and Salmerón, A., 2011. Bayesian networks in environmental modelling. Environmental Modelling & Software, 26(12), pp.1376-1388.
Arora, Paul, Devon Boyne, Justin J. Slater, Alind Gupta, Darren R. Brenner, and Marek J. Druzdzel. Bayesian networks for risk prediction using real-world data: a tool for precision medicine. Value in Health 22, no. 4 (2019): 439-445.

Lecture 17: Decision Trees and Random Forests

Modeling dependence between attributes. The decision tree classifier. Introduction to information theory. Information, entropy, mutual information, and related concepts. Algorithm for learning decision tree classifiers from data.

Pruning decision trees. Pitfalls of entropy as a splitting criterion for multi-valued splits and ways to avoid the pitfalls. Incorporating attribute measurement costs and misclassification costs into decision tree induction.

Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification.

Why a forest is better than a single tree. The random forest algorithm for classification. Why random forest works.

Required Readings

Chapter 1 of Daume III (2015) A Course on Machine Learning.
Decision Tree Tutorial, by H. Hamilton, E. Gurak, L. Findlater, and W. Olive
Lecture Slides., Vasant Honavar.
Sections 13.1, 13.3 of Daume III (2015) A Course on Machine Learning.

Recommended Readings

L. Breiman. Random forests. Machine Learning, 45(1): 5–32,2001

Optional Readings

Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. pp. 880-887.
Fayyad, U. and Irani, K.B. (1992). On the handling of continuous valued attributes in decision tree generation. Machine Learning vol. 8. pp. 87-102.
Domingos, P. (1999). The Role of Occam's Razor in Knowledge Discovery. Vol. 3, no. 4., pp. 409-425.
A Mathematical Theory of Communication, C. Shannon.
Silvescu, A., and Honavar, V. (2001). Temporal Boolean Network Models of Genetic Networks and Their Inference from Gene Expression Time Series. Complex Systems.. Vol. 13. No. 1. pp. 54-.
Codrington, C. W. and Brodley, C. E., On the Qualitative Behavior of Impurity-Based Splitting Rules: The Minima-Free Property. Tech. Rep. 97-05. Dept. of Computer Science. Cornell University.
Atramentov, A., Leiva, H., and Honavar, V. (2003). A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag. Lecture Notes in Computer Science. Vol. 2835, pp. 38-56.

Lecture 18: Linear Classifiers

Linear Classifiers. Threshold logic unit (perceptron). Linear separability. Perceptron Learning algorithm. Winner-Take-All Networks. Alternative loss functions for the perceptron. Gradient-based minimization of loss functions.

Required Readings

Chapter 4 of Daume III (2015) A Course on Machine Learning.
Lecture Slides, Vasant Honavar
Honavar, V. Threshold Logic Units.
Honavar, V. Perceptron Learning Algorithm.
Honavar, V. Multi-category Generalizations of Perceptron Algorithm.

Recommended Readings

Chapters 3 and 4 from Neural Networks: A Systematic Introduction, Rojas, R. (1996).

Lecture 19: Maximum Margin Classifiers and Kernel Machines

Maximum Margin Classifier (Linear Support Vector Machine). Kernel functions for handling non-linear decision surfaces. Addressing the computational and generalization challenges posed by high dimensional kernel-induced feature spaces. Kernel trick. Properties of Kernel Functions. Examples of Kernel Functions. Constructing Kernel Functions. Distinguishing good kernels from bad ones. Applications of Kernel Machines. Text classification, Image classification, etc.

Required Readings

Chapter 11 of Daume III (2015) A Course on Machine Learning.
Lecture Slides.. Vasant Honavar

Recommended Readings

Convex Optimization, Max Welling.
Kernel Support Vector Machines, Max Welling.

Optional Readings

Support Vector Machines, Bernhard Scholkopf and Alex Smola.

Lectures 20-21 Approximating Real Valued Functions With Neural Networks

Approximating linear functions (linear regression). Locally weighted linear regression.

Introduction to neural networks for nonlinear function approximation. Nonlinear function approximation using multi-layer neural networks. Universal function approximation theorem. The generalized delta rule (GDR) (the backpropagation learning algorithm).

Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.

Required readings

Chapter 10 of Daume III (2015) A Course on Machine Learning.
Lecture Slides.. Vasant Honavar
Honavar, V. Multi-layer networks.
Honavar, V. Radial Basis Function Networks.

Lecture 22 Deep Learning

Introduction to deep learning. Pros and cons. Stacked autoencoders and representation learning. Convolutional auto-encoders. Image classification using deep neural networks. Sequence classification using deep neural networks. Deep neural networks as generative models. Generative adversarial networks. Transformers for NLP and related applications. Large language models and related generative models. Hype versus reality.

Review and wrap-up.

Required readings

Ng et al., 2015. Deep Learning Tutorial
Le, Q. (2015). A tutorial on Deep Neural Networks Part 1: Nonlinear Classifiers and The Backpropagation Algorithm
Le, Q. (2015). A tutorial on Deep Neural Networks Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. and Bharath, A. (2018).Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1), pp.53-65.
Ghojogh, Benyamin, and Ali Ghodsi. Attention mechanism, transformers, BERT, and GPT: tutorial and survey. (2020).
Lecture Slides.. Vasant Honavar

Lecture 23 Predictive Modeling in Practice

Predictive modeling from heterogeneous data. Predictive modeling from high-dimensional data. Feature selection. Dimensionality reduction.

Required readings

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J. and Liu, H., 2017. Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6), pp.1-45.
Li, Yun, Tao Li, and Huan Liu. Recent advances in feature selection and its applications. Knowledge and Information Systems 53 (2017): 551-577.
Van Der Maaten, L., Postma, E.O. and van den Herik, H.J., 2009. Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10(66-71), p.13.
Review: Lecture Slides.. Vasant Honavar

Lecture 24: Causal Modeling

What is a cause? Why study causal inference? Causation versus association; seeing, versus doing. Why data are not always enough for drawing sound causal conclusions. Causal effect defined. Pitfalls of causal inference from observational data. Randomized experiments. Causal effect estimation under identifiability assumptions. Causal graphs for expressing qualitative causal assumptions.

Required Readings

Review: Lecture Slides.
Read: Chapter 1, Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.
Watch: Judea Pearl, The New Science of Cause and Effect
Read: Hernan, M. and Robins, J.M., Chapter 1, Chapter 2. Causal Inference: What if. Boca Raton: Chapman and Hill/CRC, 2020.

Lecture 25: Causal Modeling: Structural causal models

Interpreting causal graphs. d-separation. do operator for modeling interventions. Confounding defined in terms of causal graphs and do operator. Causal effect identifiability. do calculus. Necessary and sufficient criteria for causal effects identifiability.

Required Readings

Review: Lecture Slides.
Read: Chapter 2, Chapter 3 (Sections 3.1-3.5) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.

Lecture 26: Causal Modeling: Linear causal models, causal transportability

Linear causal models. When regression can and cannot be used to find causal effects. Relation between causal coefficients and regression coefficients. Identifying causal effects from observations with respect to a linear causal model. Identifiability conditions.

Transportability of causal effects and related problems

Required Readings

Review: Lecture Slides.
Read: Chapter 2, Chapter 3 (Sections 3.8) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.

Recommended Readings

Bareinboim, E., and Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27), 7345-7352.

Lecture 27: Summary: Data Science for Researchers and Scholars

Summary of the course.

Pennsylvania State University

Course Information

Course Materials

AI Resources

Quick Links