Data Science for Scientists and Scholars
|
Study Guide
Please note that the precise schedule is subject to change. The lecture slides (and lecture notes, if any) are updated after the lecture.
Lecture 1. Data science - what, why, and how?
Transdisciplinary foundations of data science. Descriptive, predictive, and causal data science. Data science in practice. Formulating and answering questions using data. Course overview and course mechanics. Getting started with Google Colab.
Required readings
Recommended materials
Optional Materials
Lecture 2. Descriptive statistics. Populations versus samples. Centrality measures - mean, median, mode; Measures of spread, their interpretation.
Required Readings
-
Chapter 3, Sections 3.1-3.4 from Shah, Chirag (2020) A hands-on introduction to data science, Cambridge University Press.
- Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
Optional Materials
-
Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Lecture 3. Probability, random variables, and distributions Sample spaces, simple events, compound events, conditional probability, independence, Bayes rule. Probability theory as a bridge between descriptive and inferential statistics.
Required Readings
- Read: Chapter 2 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
Optional Materials
-
Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Lecture 4. Probability Distributions: Discrete
Bernoulli, binomial, multinomial, Poisson distributions
Required Readings
- Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
-
Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Optional Materials
-
Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Lecture 5. Probability Distributions: Continuous
Normal, Standard Normal distributions, Calculating Probabilities
Required Readings
- Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
-
Read: Chapter 4, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Optional Materials
-
Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Lecture 6. Estimation
From statistics to probability - from sample to population. Sampling distributions. Point estimates. Large Sample Estimation. Error and confidence of estimates of means and proportions. Differences in means and differences in proportions.
Required Readings
- Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
-
Read: Chapter 5, Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Optional Materials
-
Reference: Kapstein, M. and van den Heuvel, E. Statistics for Data Scientists. Springer.
Lecture 7. Estimation
Small Sample estimates. Confidence intervals. Elements of Hypothesis Testing. Type 1 and Type 2 errors. When do we reject a null hypothesis and what does it mean to reject a null hypothesis? Statistical Significance. p-values. Interpretation of p-values.
Required Readings
- Read: Chapter 5, Section 5.1 of Skiena, Steve (2017). Data Science Design Manual. Springer
- Lecture Slides, Vasant Honavar
Recommended readings
Optional Materials
Lecture 8. Introduction to Python for Data Science. Guest Lecture: Neil Ashtekar
Basics of Python for Data Science. pandas, numpy, matplotlib, seaborn, scipy and other useful libraries
Required Readings
Recommended materials
Optional Materials
Lecture 9. Python Statistics Modules and How to Use them. Guest Lecture: Neil Ashtekar
Python libraries for Descriptive Statistics, Estimation, Hypothesis testing.
Required Readings
Recommended materials
Optional Materials
Lecture 10. Additional Python Data Science Tools for Tabular, text, image and other types of data Guest Lecture: Neil Ashtekar
Required Readings
Recommended materials
Optional Materials
Lecture 11. Assembling data for data science projects Anatomy of a data science project. Coming up with question(s). Assembling data. Avoiding common pitfalls. Importance of digging into the story behind the data. Ilustrative case studies.
Required Readings
- Review: Lecture Slides, Vasant Honavar
- Read: Chapter 3, Data Science Design Manual, S. Skiena.
Recommended materials
Optional Materials
Lecture 12. Predictive Modeling Using Machine Learning
What is machine learning? Why do we care about machine learning? What can we do with machine learning? Applications of machine learning. Types of machine learning problems. A simple machine learning algorithm - the K nearest neighbor classifier. Distance measures. K nearest neighbor regression.
Required readings
Recommended readings
- Videos 1, 2 from the Introduction to Machine Learning in Python with Scikit-Learn video series, Kevin Markham.
- Chapter 1 from Muller, A., and Guido, S. (2016). Introduction to Machine Learning with Python. O'Reilly.
- Cunningham, P. and Delany, S.J., 2021. k-Nearest neighbour classifiers-A Tutorial. ACM computing surveys (CSUR), 54(6), pp.1-25.
- Slaney, M., & Casey, M. (2008). Locality-sensitive hashing for finding nearest neighbors [lecture notes]. IEEE Signal processing magazine, 25(2), 128-131.
- Reference: Scikit Learn Tutorial
Optional Readings
Lecture 13. Deeper dive into data and data representation
Types of attributes: Nominal (categorical), ordinal, real-valued. Properties of attributes: distinctiveness, order, meaningfulness of differences, meaningfulness of ratios.
Types of data: Tabular data, ordered data (e.g., text, DNA sequences, SMILES representation of molecules), graph data (e.g., social networks, molecular structures), geo-spatial data.
Probability and random variables revisited. Joint distributions of (multiple) random variables. Conditional distributions. Marginalization. Bayes rule.
Required readings
Recommended readings
Optional Readings
Lecture 14: Probabilistic Generative Models: Naive Bayes
Probabilistic generative models. Bayes Optimal classifier (Minimum error classifier).
Naive Bayes Classifier. Applications of Naive Bayes Classifiers. Tabular, Sequence Text, and Image Data Classification. Multi-variate, multi-nomial, and Gaussian Naive Bayes. Avoiding overfitting - robust probability estimates.
Required readings
Recommended Readings
-
Kang, D-K., Silvescu, A. and Honavar, V. (2006) RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science., Berlin: Springer-Verlag. pp. 45-54, 2006.
- Yan, C., Terribilini, M., , Wu, F., Jernigan, R.L., Dobbs, D. and Honavar, V. (2006) Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics, 2006.
- Kang, D-K., Fuller, D., and Honavar, V. Learning Misuse and Anomaly Detectors from System Call Frequency Vector Representation. IEEE International Conference on Intelligence and Security Informatics. Springer-Verlag Lecture Notes in Computer Science, Springer-Verlag. Vol. 3495. pp. 511-516, 2005.
Optional Readings
Lecture 15: Evaluating predictive models
Evaluation of classifiers. Accuracy, Sensitivity, Specificity, Correlation Coefficient. Tradeoffs between sensitivity and specificity. When does a classifier outperform another? ROC curves. Estimating performance measures; confidence interval calculation for estimates; cross-validation based estimates of model performance; leave-one-out and bootstrap estimates of performance; comparing two models; comparing two learning algorithms.
Required Readings
Recommended readings
Optional readings
Lecture 16: Probabilistic Generative Models: Bayes Networks
Bayes Networks. Compact representation of Joint Probability distributions using Bayes Networks. Conditional independence and d-separation. Factorization of joint probability distributions based on conditional independence assumptions encoded by a Bayes Network. Probabilistic inference using Bayes Networks. Exact inference in polytrees. Approximate inference using stochastic simulation. Learning Bayes Networks from data: Learning parameters. Learning Structure. A simple algorithm using conditional independence queries.
Required readings
Recommended Readings
Optional Readings
-
Bielza, Concha, and Pedro Larrañaga. Bayesian networks in neuroscience: a survey. Frontiers in computational neuroscience 8 (2014): 131.
- Needham, C.J., Bradford, J.R., Bulpitt, A.J. and Westhead, D.R., 2007. A primer on learning in Bayesian networks for computational biology. PLoS computational biology, 3(8), p.e129.
- Aguilera, P.A., Fernández, A., Fernández, R., Rumí, R. and Salmerón, A., 2011. Bayesian networks in environmental modelling. Environmental Modelling & Software, 26(12), pp.1376-1388.
-
Arora, Paul, Devon Boyne, Justin J. Slater, Alind Gupta, Darren R. Brenner, and Marek J. Druzdzel. Bayesian networks for risk prediction using real-world data: a tool for precision medicine. Value in Health 22, no. 4 (2019): 439-445.
Lecture 17: Decision Trees and Random Forests
Modeling dependence between attributes. The decision tree classifier. Introduction to information theory. Information, entropy, mutual information, and related concepts. Algorithm for learning decision tree classifiers from data.
Pruning decision trees. Pitfalls of entropy as a splitting criterion for multi-valued splits and ways to avoid the pitfalls. Incorporating attribute measurement costs and misclassification costs into decision tree induction.
Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification.
Why a forest is better than a single tree.
The random forest algorithm for classification. Why random forest works.
Required Readings
Recommended Readings
Optional Readings
-
Zhang, J. and Honavar, V. (2003). Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data. In: Proceedings of the International Conference on Machine Learning (ICML-03). Washington, DC. pp. 880-887.
-
Fayyad, U. and Irani, K.B. (1992). On the handling of continuous valued attributes in decision tree generation. Machine Learning vol. 8. pp. 87-102.
-
Domingos, P. (1999). The Role of Occam's Razor in Knowledge Discovery.
Vol. 3, no. 4., pp. 409-425.
- A Mathematical Theory of Communication, C. Shannon.
-
Silvescu, A., and Honavar, V. (2001). Temporal Boolean Network Models of Genetic Networks and Their Inference from Gene Expression Time Series. Complex Systems.. Vol. 13. No. 1. pp. 54-.
-
Codrington, C. W. and Brodley, C. E., On the Qualitative Behavior of Impurity-Based Splitting Rules: The Minima-Free Property. Tech. Rep. 97-05. Dept. of Computer Science. Cornell University.
-
Atramentov, A., Leiva, H., and Honavar, V. (2003).
A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments.. In: Proceedings of the Thirteenth International Conference on Inductive Logic Programming. Berlin: Springer-Verlag. Lecture Notes in Computer Science. Vol. 2835, pp. 38-56.
Lecture 18: Linear Classifiers
Linear Classifiers. Threshold logic unit (perceptron).
Linear separability. Perceptron Learning algorithm. Winner-Take-All Networks.
Alternative loss functions for the perceptron. Gradient-based minimization of loss functions.
Required Readings
Recommended Readings
Lecture 19: Maximum Margin Classifiers and Kernel Machines
Maximum Margin Classifier (Linear Support Vector Machine). Kernel functions for handling non-linear decision surfaces. Addressing the computational and generalization challenges posed by high dimensional kernel-induced feature spaces. Kernel trick. Properties of Kernel Functions. Examples of Kernel Functions. Constructing Kernel Functions. Distinguishing good kernels from bad ones. Applications of Kernel Machines. Text classification, Image classification, etc.
Required Readings
Recommended Readings
Optional Readings
Lectures 20-21 Approximating Real Valued Functions With Neural Networks
Approximating linear functions (linear regression).
Locally weighted linear regression.
Introduction to neural networks for nonlinear function approximation.
Nonlinear function approximation using multi-layer neural networks.
Universal function approximation theorem. The generalized delta rule (GDR) (the backpropagation learning algorithm).
Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima.
Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.
Required readings
Lecture 22 Deep Learning
Introduction to deep learning. Pros and cons. Stacked autoencoders and representation learning. Convolutional auto-encoders. Image classification using deep neural networks. Sequence classification using deep neural networks. Deep neural networks as generative models. Generative adversarial networks. Transformers for NLP and related applications. Large language models and related generative models. Hype versus reality.
Review and wrap-up.
Required readings
-
Ng et al., 2015. Deep Learning Tutorial
-
Le, Q. (2015). A tutorial on Deep Neural Networks Part 1: Nonlinear Classifiers and The Backpropagation Algorithm
- Le, Q. (2015). A tutorial on Deep Neural Networks Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks
- Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. and Bharath, A. (2018).Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1), pp.53-65.
-
Ghojogh, Benyamin, and Ali Ghodsi. Attention mechanism, transformers, BERT, and GPT: tutorial and survey. (2020).
-
Lecture Slides.. Vasant Honavar
Lecture 23 Predictive Modeling in Practice
Predictive modeling from heterogeneous data. Predictive modeling from high-dimensional data. Feature selection. Dimensionality reduction.
Required readings
- Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J. and Liu, H., 2017. Feature selection: A data perspective. ACM computing surveys (CSUR), 50(6), pp.1-45.
- Li, Yun, Tao Li, and Huan Liu. Recent advances in feature selection and its applications. Knowledge and Information Systems 53 (2017): 551-577.
- Van Der Maaten, L., Postma, E.O. and van den Herik, H.J., 2009. Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10(66-71), p.13.
-
Review: Lecture Slides.. Vasant Honavar
Lecture 24: Causal Modeling
What is a cause? Why study causal inference? Causation versus association; seeing, versus doing. Why data are not always enough for drawing sound causal conclusions. Causal effect defined. Pitfalls of causal inference from observational data. Randomized experiments. Causal effect estimation under identifiability assumptions. Causal graphs for expressing qualitative causal assumptions.
Required Readings
- Review: Lecture Slides.
- Read: Chapter 1, Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.
- Watch: Judea Pearl, The New Science of Cause and Effect
- Read: Hernan, M. and Robins, J.M., Chapter 1, Chapter 2. Causal Inference: What if. Boca Raton: Chapman and Hill/CRC, 2020.
Lecture 25: Causal Modeling: Structural causal models
Interpreting causal graphs. d-separation. do operator for modeling interventions. Confounding defined in terms of causal graphs and do operator. Causal effect identifiability. do calculus. Necessary and sufficient criteria for causal effects identifiability.
Required Readings
- Review: Lecture Slides.
- Read: Chapter 2, Chapter 3 (Sections 3.1-3.5) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.
Lecture 26: Causal Modeling: Linear causal models, causal transportability
Linear causal models. When regression can and cannot be used to find causal effects. Relation between causal coefficients and regression coefficients. Identifying causal effects from observations with respect to a linear causal model. Identifiability conditions.
Transportability of causal effects and related problems
Required Readings
- Review: Lecture Slides.
- Read: Chapter 2, Chapter 3 (Sections 3.8) Pearl, J., Glymour, M. and Jewell, N.P., 2016. Causal inference in statistics: A primer. John Wiley & Sons.
Recommended Readings
Lecture 27: Summary: Data Science for Researchers and Scholars
Summary of the course.
|