Principles of Machine Learning

COLLEGE OF
INFORMATION SCIENCES AND TECHNOLOGY

Weekly Study Guide

Course Materials

AI Resources

Quick Links

Principles of Machine Learning

Study Guide

Please Note: Lecture notes will be updated after the lecture.

Week 1 (January 9, 2017)

Overview of the course. Computational models of intelligence. Overview of machine learning. Why should machines learn? Operational definition of learning. Taxonomy of machine learning.

Review of probability theory and random variables. Probability spaces. Ontological and epistemological commitments of probabilistic representations of knowledge. Bayesian (subjective view of probability) -- Probabilities as measures of belief conditioned on the agent's knowledge. Possible world interpretation of probability. Axioms of probability. Conditional probability. Bayes theorem. Random Variables. Discrete Random Variables as functions from event spaces to Value sets. Possible world interpretation of random variables. Review of probability theory, random variables, and related topics (continued). Joint Probability distributions. Conditional Probability Distributions. Conditional Independence of Random variables. Pair-wise independence and independence.

Required readings

Overview of the Course.
Course Policies
Lecture slides, Vasant Honavar
(Optional, for those needing a refresher on probability) Chapters 1 and 4 from Introduction to Probability -- Charles Grinstead and Laurie Snell.
Sections 1.1, 1.2 Introduction to Machine Learning, Alex Smola and Swaminathan Vishwanathan, Cambridge University Press, 2008.
Does Machine Learning Really Work?, Tom Mitchell.

Recommended Readings

Overview of Artificial Intelligence (PDF), Vasant Honavar.
Chapters 1 and 2 from Probability -- The Logic of Science by E.T. Jaynes.
An article on Probability Theory by Tom Loredo
Automated Learning and Discovery: State-Of-The-Art and Research Topics in a Rapidly Growing Field Sebastian Thrun, Christos Faloutsos, Tom Mitchell and Larry Wasserman. Technical Report CMU-CS-CALD-98-100. 1998.
Probabilistic machine learning and artificial intelligence. Ghahramani, Z. Nature Nature 521, 452–459 (28 May 2015) doi:10.1038/nature1454

Additional Information

AAAI Machine Learning Topics Page
Jaynes, E.T. Probability Theory: The Logic of Science, Cambridge University Press, 2003.
Cox, R.T. The Algebra of Probable Inference, The Johns Hopkins Press, 1961.
Boole, G. The Laws of Thought, (First published: 1854). Prometheus Books, 2003.
Feller, W. An Introduction to Probability Theory and its Applications. Vols 1, 2. New York: Wiley. 1968.

Week 2 (Beginning January 16, 2017)

Decision theoretic models of classification. Bayes optimal classifier. Naive Bayes Classifier. Applications of Naive Bayes Classifiers - Sequence and Text Classification. Maximum Likelihood Probability Estimation. Properties of Maximum Likelihood Estimators. Limitations of Maximum Likelihood Estimators. Bayesian Estimation. Conjugate Priors. Bayesian estimation in the multinomial case using Dirichlet priors. Maximum A posteriori Estimation. Representative applications of Naive Bayes classifiers.

Evaluation of classifiers. Accuracy, Precision, Recall, Correlation Coefficient, ROC curves.

Evaluation of classifiers -- estimation of performance measures; confidence interval calculation for estimates; cross-validation based estimates of hypothesis performance; leave-one-out and bootstrap estimates of performance; comparing two hypotheses; hypothesis testing; comparing two learning algorithms.

Required readings

Section 1.3.1 from Introduction to Machine Learning, Alex Smola and Swaminathan Vishwanathan, Cambridge University Press, 2008.
Sections 9.1-9.4 Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press.
Lecture Slides. Vasant Honavar
Baldi, P., Brunak, S., Chauvin, Y. and Nielsen, H. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics Vol. 16. pp. 412-424.
Confidence Interval and Hypothesis Testing.
Kochanski, G. (1995). Confidence Intervals and Hypothesis Testing.
F Provost, T Fawcett, R Kohavi (1998). A case against accuracy estimation of machine learning algorithms. In: proceedings of the fifteenth International Conference on Machine Learning.
Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech. Report HPL2003-4.
Hand, D. Measuring classifier performance: a coherent alternative to the area under the ROC curveMachine Learning, Vol. 77, pp. 103-123.

Recommended Readings

P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997.
Rish, I. An Empirical Study of Naive Bayes Classifier, In: Proc. ICML 2001.
D. D. Lewis. Naive Bayes at forty: The independence assumption in information retrieval. In ECML-98: Proceedings of the Tenth European Conference on Machine Learning, pages 4--15, Chemnitz, Germany, April 1998. Springer.
Zhang, J., Kang, D-K., Silvesu, A. and Honavar, V. (2006). Learning Compact and Accurate Naive Bayes Classifiers from Attribute Value Taxonomies and Data Journal of Knowledge and Information Systems.
McCallum, A. and Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification.. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. Technical Report WS-98-05. AAAI Press. 1998.
Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger Tackling the Poor Assumptions of Naive Bayes Text Classifiers Proceedings of the Twentieth International Conference on Machine Learning. 2003.
Kang, D-K., Silvescu, A. and Honavar, V. (2006). RNBL-MN: A Recursive Naive Bayes Learner for Sequence Classification. In: Proceedings of the Tenth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006). Lecture Notes in Computer Science.. Berlin: Springer-Verlag.

Additional Information

Langley, P., Iba, W., and Thompson, K. (1992). An Analysis of Naive Bayes Classifier. In: Proceedings of AAAI. 1992.
Langley, P. and Sage, S. (1999). Tractable average-case analysis of naive Bayesian classifiers.. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 220-228). Bled, Slovenia: Morgan Kaufman.
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, School of Computer Science, Carnegie Mellon University, March 1996.
Yang, Y. and G. I. Webb (2003). On Why Discretization Works for Naive-Bayes Classifiers. In Proceedings of the 16th Australian Conference on AI (AI 03)Lecture Notes AI 2903, pages 440-452. Berlin: Springer-Verlag.
George H. John, Pat Langley, Estimating Continuous Distributions in Bayesian Classifiers Proceedings of the 1995 Conference on Machine Learning.
Susana Eyheramendy, David Lewis, David Madigan On the Naive Bayes Model for Text Categorization. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics., Bishop, C.M. and Frey, B. (Ed). 2003.

Week 3 (Beginning Jan 23 2017)

Introduction to Artificial Neural Networks and Linear Discriminant Functions. Threshold logic unit (perceptron) and the associated hypothesis space. Connection with Logic and Geometry. Weight space and pattern space representations of perceptrons. Linear separability and related concepts. Perceptron Learning algorithm and its variants. Convergence properties of perceptron algorithm. Winner-Take-All Networks.

Required Readings

Lecture Slides. Vasant Honavar
Section 4.1 from Bishop, C. (2006). Pattern Recognition and Machine Learning.
Honavar, V. Threshold Logic Units.
Honavar, V. Perceptron Learning Algorithm.
Honavar, V. Multi-category Generalizations of Perceptron Algorithm.

Recommended Readings

Dietterich, T. (1998). Approximate Statistical Tests for Comparing Supervised Classification Algorithms Neural Computation. 10(7):1895-1923.
Adam J. Grove, Nick Littlestone, and Dale Schuurmans. General convergence results for linear discriminant updates. In COLT-97, pages 171--183, 1997.
Welling, M. (2005). Fisher Linear Discriminant Analysis.

Additional Information

Nilsson, N. J. Mathematical Foundations of Learning Machines. Palo Alto, CA: Morgan Kaufmann (1992).
Minsky, M. amd Papert, S. Perceptrons: Introduction to Computational Geometry. Cambridge, MA: MIT Press (1988).
McCulloch, W. Embodiments of Mind. Cambridge, MA: MIT Press.

Week 4 (Beginning January 30, 2017)

Generative versus Discriminative Models for Classification. Bayesian Framework for classification revisited. Naive Bayes classifier as a generative model. Relationship between generative models and linear classifiers. Additional examples of generative models. Generative models from the exponential family of distributions. Generative models versus discriminative models for classification.

Required Readings

Lecture Slides. Vasant Honavar
Sections 4.2 and 4.3 from Bishop, C. (2006), Pattern Recognition and Machine Learning.
Section 8.5-8.8, Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press.
Rubinstein, D and Hastie, T. Discriminative vs Informative Learning. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1997.
Mitchell, T. (2005). Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Draft book chapter.

Recommended Readings

Rubinstein, D and Hastie, T. Discriminative vs Informative Learning. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1997.
Bouchard, G. and Triggs, B. (2004). The tradeoff between Generative and Discriminative Classifiers, Proceedings of Computational Statistics (Compstat 04).
Raina, R., Shen, Y., Ng, A., and McCallum, A. (2003). Classification with Hybrid Generative/Discriminative Models. In Proceedings of the IEEE Conference on Neural Information Systems (NIPS 2003).

Additional Information

Yakhnenko, O., Silvescu, A. and Honavar, V. (2005). Discriminatively Trained Markov Models for Sequence ClassificationIn: IEEE Conference on Data Mining (ICDM 2005). Houston, Texas. IEEE Press. pp. 498-505.
Lasserre, J., C. M. Bishop, and T. Minka (2006). Principled hybrids of generative and discriminative models. In: Proceedings 2006 IEEE Conference on Computer Vision and Pattern Recognition, New York.
Ulusoy, I. and C. M. Bishop (2005b). Generative versus discriminative models for object recognition. In Proceedings IEEE International Conference on Computer Vision and Pattern Recognition, CVPR., San Diego.
Wallach, H. M. (2004). Conditional Random Fields: An Introduction
Lafferty, J., McCallum, A. Pereira, F. (2000). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. McCallum says: "Don't bother reading the section on parameter estimation---use BFGS instead of Iterative Scaling; e.g. see [McCallum UAI 2003]")
Ng, A. and Jordan, M. (2002) On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes, Proceedings of the IEEE Conference on Neural Information Systems (NIPS 2002).

Week 5 (Beginning February 6, 2017)

Topics in Computational Learning Theory

Probably Approximately Correct (PAC) Learning Model. Efficient PAC learnability. Sample Complexity of PAC Learning in terms of cardinality of hypothesis space (for finite hypothesis classes). Some Concept Classes that are easy to learn within the PAC setting.

Efficiently PAC learnable concept classes. Sufficient conditions for efficient PAC learnability. Some concept classes that are not efficiently learnable in the PAC setting.

Required readings

Lecture Slides. V. Honavar
Overview of the Probably Approximately Correct (PAC) Learning Framework. D. Haussler, 1995.
Kearns, M. 1998. Efficient Noise Tolerant Learning from Statistical Queries. Journal of the ACM. Vol. 45, pp. 983-1006.

Recommended Readings

Cesa-Bianchi, N., Dichterman, E., Fischer, P., Shamir, E., Simon, H. 1999. Sample-Efficient Strategies for Learning in the Presence of Noise. Journal of the ACM. Vol. 46. pp. 684-719.
Goldreich, O. and Goldwasser, S. 1998. Property testing and its connection to Learning and approximation. Journal of the ACM. Vol. 45. pp. 653-750.
Khardon, R. and Roth, D. 1997. Learning to Reason. Journal of the ACM. Vol, 44. pp. 697-725.
Valiant, L. 2000. A Neuroidal Architecture for Cognitive Computation. Journal of the ACM. Vol. 47. pp. 854-882.
Maass, W. 1994. Efficient Agnostic PAC Learning With Simple Hypotheses. . Proceedings of the Seventh Annual Conference on Computational Learning Theory. 1994. pp. 67-75.
Benedek. G. and Itai, A. Dominating Distributions and Learnability. In: Annual Workshop on Computational Learning Theory. 1992.

Additional Information

Kearns, M. and Vazirani, U. (1994). An Introduction to Computational Learning Theory, MIT Press.
Computational Learning Theory Sally Goldman.

Week 6 (beginning February 13, 2017).

Making hard-to-learn concept classes efficiently learnable -- transforming instance representation and hypothesis representation. Occam Learning Algorithms. Mistake bound analysis of learning algorithms. Mistake bound analysis of online algorithms for learning Conjunctive Concepts. Optimal Mistake Bounds. Version Space Halving Algorithm. Randomized Halving Algorithm. Learning monotone disjunctions in the presence of irrelevant attributes -- the Winnow and Balanced Winnow Algorithms. Multiplicative Update Algorithms for concept learning and function approximation. Weighted majority algorithm. Applications.

Required readings

Lecture Slides. V. Honavar
Computational Complexity
The weighted Majority Algorithm, Littlestone, N., and Warmuth, M. Information and Computation Vol. 108: 212-261. 1994.
Empirical Support for Winnow and Weighted Majority Algorithms, Blum, A. In: Proceedings of the Twelfth International Conference on Machine Learning, pages 64--72. Morgan Kaufmann, 1995.
Applying Winnow to Context-Sensitive Spelling Correction Golding, A., and Roth, D. Machine Learning, 34(1-3):107--130, 1999.
M.H. Yang, D. Roth, and N. Ahuja. A SNoW-based face detector. NIPS(12), pages 855-861, 2000.

Recommended Readings

A.J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. In Proc. 10th Annu. Conf. on Comput. Learning Theory, pages 171--183, 1997.
Tong Zhang. Regularized winnow methods. In Advances in Neural Information Processing Systems 13, pages 703-709, 2001.
Helmbold, D.P., Schapire, R.E., Singer, Y., Warmuth, M.K. On-line portfolio selection using multiplicative updates. Mathematical Finance, vol. 8 (4), pp.325-347, 1998.

Additional Information

Kearns, M. and Vazirani, U. (1994). An Introduction to Computational Learning Theory, MIT Press.

Week 7 (Beginning February 20, 2017)

PAC learnability of infinite concept classes. Vapnik-Chervonenkis (VC) Dimension. Some properties of VC dimension. Example of VC dimension calculations. Sample complexity expressed in terms of VC dimension. Learnability of concepts when the size of the target concept is unknown. PAC-learnability in the presence of noise - attribute noise, label noise, malicious noise.

Required readings

Lecture Slides. V. Honavar
Computational Learning Theory Lecture Notes Sally Goldman. (Chapters 4,5,7,14)

Week 8 (Beginning February 27, 2017)

Ensemble Classifiers. Techniques for generating base classifiers; techniques for combining classifiers. Committee Machines and Bagging. Boosting. The Adaboost Algorithm. Theoretical performance of Adaboost. Boosting in practice. When does boosting help? Why does boosting work? Boosting and additive models. Loss function analysis. Boosting of multi-class classifiers. Boosting using classifiers that produce confidence estimates for class labels. Boosting and margin. Variants of boosting - generating classifiers by changing instance distribution; generating classifiers by using subsets of features; generating classifiers by changing the output code. Further insights into boosting.

Learning under helpful distributions. Kolmogorov Complexity and Universal distribution. Learnabiliy under universal distribution implies learnability under every enumerable distribution.

Required readings

Lecture Slides. V. Honavar
Chapter 14 (Sections 14.1-14.3) from Pattern Recognition and Machine Learning, C. Bishop.
Freund, R. (1999) A Short Introduction to Boosting Journal of the Japanese Society for Artificial Intelligence, Vol 14, pp. 771-780. (English Translation by Naoki Abe).
Friedman, J., Hastie, T., and Tibshirani, R. (2000). A Statistical View of Boosting, Annals of Statistics, Vol. 35, pp. 337-407.
Meir, R. and Ratsch, G. (2002). An Introduction to Boosting and Leveraging. Advanced Lectures on Machine Learning. Lecture Notes in Computer Science, pp. 118-183, Berlin: Springer-Verlag.
Baur, E., and Kohavi, R. (1999) An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants Machine Learning. Vol. 36. pp. 105-142.
Breiman, L. (1994). Bagging Predictors. Tech. Rep. 421, Department of Statistics, University of California, Berkeley, CA.

Recommended Readings

Efficient Margin Maximization Using Boosting G. Ratsch and MK Warmuth, JMLR 2005.
Freund, R. (1996). Game Theory, Online Prediction, and Boosting In: Proceedings of the Conference on Computational Learning Theory (COLT 1996).
Schapire, R. and Singer, Y. (1999). Improved Boosting Algorithms Using Confidence-Rated Predictions, Machine Learning Vol. 37.

Spring Break

Week 9 (Beginning March 12, 2017), Week 10 (Beginning March 19, 2017)

Maximum Margin Classifiers. Empirical risk and Risk bounds for linear classifiers. Vapnik's bounds on Misclassification rate (error rate). Minimizing misclassification risk by maximizing margin. Formulation of the problem of finding margin maximizing separating hyperplane as an optimization problem. Kernel Machines. Kernel Functions. Properties of Kernel Functions. Kernel Matrices. How to tell a good kernel from a bad one. How to construct kernels.

From Kernel Machines to Support Vector Machines.

Introduction to Lagrange/Karush-Kuhn-Tucker Optimization Theory. Optimization problems. Linear, quadratic, and convex optimization problems. Primal and dual representations of optimization problems. Convex Quadratic programming formulation of the maximal margin separating hyperplane finding problem. Characteristics of the maximal margin separating hyperplane. Implementation of Support Vector Machines.

Required Readings

Lecture Slides.. Vasant Honavar
Convex Optimization, Max Welling.
Kernel Support Vector Machines, Max Welling.
A Tutorial on Optimization by Martin Osborne (chapters 1-7).
Sections 6.1, 6.2 and 7.1 from Bishop, C. (2006). Pattern Recognition and Machine Learning.
Skolkopf, B. (2000). Statistical Learning and Kernel Methods. Technical Report MSR-TR-2000-23. Microsoft Research. 2000.
A Tutorial on Support Vector Machines. Nello Christianini, International Conference on Machine Learning (ICML 2001).
Graepel, T., Herbrich, R., & Williamson, R. C. (2001). From margin to sparsity. In Advances in Neural Information System Processing 13.
M. A. Hearst, B. Schvlkopf, S. Dumais, E. Osuna, and J. Platt. Trends and controversies - support vector machines. IEEE Intelligent Systems, 13(4):18-28, 1998.
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr, and Haussler, D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97: 262-267: 2000.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (ECML-98), 1998.
Platt, J. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods --- Support Vector Learning, pages 185-208, Cambridge, MA, 1999. MIT Press.

Recommended Readings

J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998.
Scheinberg, K. (2006). An Efficient Implementation of an Active Set Method for SVMs Journal of Machine Learning Research, Vol. 7. pp. 2237-2257.
Mangasarian, O. (2006). Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization. Journal of Machine Learning Research, Vol. 7. pp. 1517-1530.
Laskov, P., Gehl, C., Kruger, S., Muller, K-R. (2006) Incremental Support Vector Learning: Analysis, Implementation and Applications , Journal of machine learning research, Vol. 7. pp. 1909-1936.
Hsu, C-W., Lin, C-J. (2002). A Simple Decomposition Method for Support Vector Machines, Machine Learning, Vol. 46, pp. 291-314.
Sollich, P. (2002). Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities, Machine learning, Vol. 46, pp. 21-52.
Fung, G. and Mangasarian, O. (2005). Multicategory Proximal Support Vector Machine Classifiers , Machine Learning Vol. 59, pp. 77-97.
Leslie, C., Eskin, E., Weston, J., and Noble, W.S. (2002). Mismatch String Kernels for Protein Classification. Neural Information Processing Systems 2002.
Hsu, C-W., and Lin, C-J. (2002). A Comparison of methods for multi-class Support Vector Machines, IEEE Transactions on Neural Networks, Vol. 13, pp. 415-425.
Duan, K-B., and Keerthi, S. (2005). Which is the best multiclass SVM Method? - An Empirical Study, Springer-Verlag Lecture Notes in Computer Science Vol. 3541, pp. 278-285.
Fung, G.M., Mangasarian, O., and Shavlik, J. (2006). Knowledge Based Support Vector Machines. NIPS.
Le, Q., and Smola, A.J. (2006). A simpler knowledge-based Support Vector Machine.

Additional Information

V. Vapnik, A. Lerner (1963). Pattern recognition using generalized portrait method, Automation and Remote Control 24 774--780.
V. Vapnik, A. Chervonenkis (1964). A note on one class of perceptrons. Automation and Remote Control, Vol. 25.
Mangasarian, O. (1965). Linear and Nonlinear Separation of Patterns by Linear Programming, Operations Research, Vol. 13. pp. 444-452.
Cristianini, N. and Shawe-Taylor, J. (2000). Support Vector Machines. London: Cambridge University Press.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Classification. London: Cambridge University Press.
Muller, A.R., Mika. S., Ratsch, G., Tsuda, K., and Skolkopf. B. (2001). An Introduction to Kernel Methods for Pattern Classification. IEEE Transactions on Neural Networks. Vol. 12, pp. 188-201.
Edgar Osuna, Robert Freund, and Federico Girosi. Support vector machines: Training and applications. Technical Report AIM-1602, 1997.
S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. Improvements to platt's SMO algorithm for SVM classifier design. Technical report, Dept of CSA, IISc, Bangalore, India, 1999.
S.S. Keerthi and E.G. Gilbert, Convergence of a generalized SMO algorithm for SVM classifier design, Technical Report CD-00-01, Dept. of Mechanical and Production Eng., National University of Singapore, 2000.
Yi Li and Philip M. Long, The Relaxed Online Maximum Margin Algorithm, Machine Learning Vol. 46, pp. 361-, 2002.
Manewitz, L. and Yousef, M. (2001) One-class Support Vector Machine for Document Classification. Journal of Machine learning research. Vol. 2. pp. 139-154.
J. Platt, N. Cristianini, J. Shawe-Taylor, Large Margin DAGs for Multiclass Classification, in: Advances in Neural Information Processing Systems 12, pp. 547-553, MIT Press, (2000).
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y. (2005). Large Margin Methods for Structured and Interdependent Variables Journal of Machine Learning Research Vol. 6. pp. 1456-1484.
Zelenko, D., Aone, C., and Richardella, A. (2003). Kernel Methods for Relation Extraction Journal of Machine Learning Research. Vol. 3. pp. 1083-1106.
Jebara, T., Kondor, R., and Howard, A. (2004). Probability Product Kernels Journal of Machine Learning Research. Vol. 5. pp. 819-844.

Week 11 (Beginning March 26, 2017) through Week 13 (Beginning April 10, 2017)

Probabilistic Graphical Models. Bayesian Networks.

Independence and Conditional Independence. Exploiting independence relations for compact representation of probability distributions. Introduction to Bayesian Networks. Semantics of Bayesian Networks. D-separation. D-separation examples. Answering Independence Queries Using D-Separation tests. Probabilistic Inference Using Bayesian Networks. Bayesian Network Inference. Approximate inference using stochastic simulation (sampling, rejection sampling, and liklihood weighted sampling

Learning Bayesian Networks from Data. Learning of parameters (conditional probability tables) from fully specified instances (when no attribute values are missing) in a network of known structure (review).

Learning Bayesian networks with unknown structure -- scoring functions for structure discovery, searching the space of network topologies using scoring functions to guide the search, structure learning in practice, Bayesian approach to structure discovery, examples.

Learning Bayesian network parameters in the presence of missing attribute values (using Expectation Maximization) when the structure is known; Learning networks of unknown structure in the presence of missing attribute values.

Some special classes of probabilistic graphical models. Markov models, mixture models.

Probabilistic Relational Models

Required readings

Lecture Slides.. Vasant Honavar
Chapter 14 from the Artificial Intelligence: A Modern Approach textbook by Russell and Norvig.
Chapter 8 from. Sections 8.2, 8.3, 8.4 (8.4.1-8.4.6), Chapter 11 (section 11.1) from Pattern Recognition and Machine Learning, by C. Bishop.
A Tutorial on Learning with Bayesian Networks, David Heckerman. Tech. Rep. MSR-TR-95-06. Microsoft Research.
Approximating Discrete Probability Distributions with Dependence Trees. Chou, C.K. and Liu, C.N. IEEE Transactions on Information Theory. 14(3), 1968. pp. 462-467.
Learning Bayesian belief networks: An approach based on the MDL principle., W. Lam and F. Bacchus, Computational Intelligence, 10(4), 1994.
Bayesian Network Classifiers Friedman, N., Geiger, D., and Goldszmidt, M. Machine Learning 29: pp. 131-163. 1997.
An Introduction to MCMC for Machine Learning, Andrieu et al., Machine Learning, 2001.

Recommended Readings

Dependency Modeling Course, University of Helsinki
Bayesian Networks and Decision-Theoretic Reasoning for Artificial Intelligence, Daphne Koller and Jack Breese. Tutorial Given at AAAI-97.
A Logical Notion of Conditional Independence: Properties and Applications by A. Darwiche.

Dependency Modeling Course

Inference in Bayesian Networks -- A Procedural Guide Huang, C., A. Darwiche. Journal of Approximate Reasoning. Vol 15. pp. 225-263.
http://www.cs.ubc.ca/~nando/papers/mlintro.pdfIntroduction to MCMC for Machine Learning
Being Bayesian about Network Structure - A Bayesian Approach To Structure Discovery in Bayesian Networks N. Friedman and D. Koller. Machine Learning Vol. 50. pp. 95-125. 2003.
Tractable Learning of Large Bayes Net Structures from Sparse Data, Goldernberg, A. and Moore, A. (2004). In Proceedings of the International Conference on Machine Learning, 2004.
Learning Bayesian Network Structure from Massive Data Sets - The Sparse Candidate Algorithm In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) 1999.
Inferring Cellular Networks Using Probabilistic Graphical Models Science Vol. 303, pp. 799-805.
Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties.J. Suzuki, IEICE Transactions on Fundamentals, vol. E82, No. 10., pp. 2237 2245
Comparing Model Selection Criteria for Belief Networks Tim Van Allen, Russ Greiner, 2000.
Module Networks: Identifying Regulatory Modules and their Condition-Specific Regulators from Gene Expression Data, Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., and Friedman, N. Nature Genetics Vol. 34, pp. 166-176. 2003.
Learning Bayesian Network Classifiers for Credit Scoring Using Markov Chain Monte Carlo Search, B. Baesens et al., 2001.
Operations for Learning with Graphical Models Wray Buntine. Journal of Artificial Intelligence Research. Vol. 2. pp. 159-225. 1994.

Week 14 (Beginning April 17, 2017) Function Approximation and Deep Neural Networks

Bayesian Recipe for function approximation and Least Mean Squared (LMS) Error Criterion. Introduction to neural networks as trainable function approximators. Function approximation from examples. Minimization of Error Functions. Derivation of a Learning Rule for Minimizing Mean Squared Error Function for a Simple Linear Neuron. Momentum modification for speeding up learning. Introduction to neural networks for nonlinear function approximation. Nonlinear function approximation using multi-layer neural networks. Universal function approximation theorem. Derivation of the generalized delta rule (GDR) (the backpropagation learning algorithm).

Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima, using domain-specific constraints (e.g., translation invariance in visual pattern recognition), exploiting hints, using neural networks for function approximation and pattern classification. Relationship between neural networks and Bayesian pattern classification. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.

Lazy Learning Algorithms. Instance based Learning, K-nearest neighbor classifiers, distance functions, locally weighted regression. Relative advantages and disadvantages of lazy learning and eager learning.

Introduction to deep learning, Stacked Auto-encoders.

Required readings

Chapter 5, Pattern Recognition and Machine Learning, C. Bishop.
Lecture Slides.. Vasant Honavar
Honavar, V. Function Approximation from Examples.
Honavar, V. Multi-layer networks.
Honavar, V. Radial Basis Function Networks.
C. G. Atkeson, S. A. Schaal and Andrew W, Moore, Locally Weighted Learning, AI Review,Volume 11, Pages 11-73 (Kluwer Publishers) 1997

Recommended Readings

T. G. Dietterich, H. Hild, and G. Bakiri. A comparative study of ID3 and backpropagation for English text-to-speech mapping. In Proceedings of 7th IMLW, Austin, 1990. Morgan Kaufmann.
Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel.Handwritten digit recognition with a backpropagation neural network. In D. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 396--404. Morgan Kaufmann, San Mateo, CA, 1990.
Thrun and T.M. Mitchell. Learning one more thing. Technical Report CMU-CS-94-184, Carnegie Mellon University, Pittsburgh, PA 15213, 1994.
R. Williams, and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity. In Backpropagation: Theory, Architectures, and Applications, Chauvin and Rumelhart, Eds., LEA, 1995, pp. 433-485.
Solomon, R. and J. L. van Hemmen (1996). Accelerating backpropagation through dynamic self-adaptation. Neural Networks 9 (4), 589--601.
Craven, M. and Shavlik, J. Using Neural Networks for Data Mining. Future Generation Computer Systems 13:211-229.
R. Setiono, W.K. Leow and J.M. Zurada. Extraction of rules from artificial neural networks for nonlinear regression, IEEE Transactions on Neural Networks, 2002.
Poggio et al. 1995. Regularization Theory and Neural Network Architectures Neural Computation, 7:219--269, 1995.
Fahlman, S. and Lebiere, C. 1991. Cascade Correlation Architecture Technical Report CMU-CS-90-100 Carnegie Mellon University, August 1991.
Poggio et al. 1989. A Theory of Networks for Approximation and Learning Technical Report 1140, MIT Artificial Intelligence Laboratory, 1989.
Ng et al., 2015. Deep Learning Tutorial

Pennsylvania State University

Weekly Study Guide

Course Materials

AI Resources

Quick Links