Machine Learning

CS4780/CS5780

Overview
- Administrative stuff
- Supervised learning setup:
- Feature vectors, Labels
- 0/1 loss, squared loss, absolute loss
- Train / Test split
Reading:
- 00 Introduction
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9

2017 Videolecture:
- #1: CS4780 Lecture #1: Introduction!
k-nearest neighbors
- Hypothesis classes
- Nearest Neighbor Classifier
- Sketch of Covert and Hart proof
(that 1-NN converges to at most 2xBayes Error in the sample limit)
- Curse of Dimensionality
Reading:
- MLaPP: sec. 1.1, 1.2, 1.4.2, 1.4.3, 1.4.9
- Start reading: 2.1 -- 2.5
- videos of different values of k
- Video describing nearest neighbors
- Probably the best explanation of nearest neighbors

2017 Videolectures:
- #4 kNN I
- #5 kNN II (proof of convergence)
- #8 curse of dimensionality
Perceptron
- Linear Classifiers
- Absorbing bias into a d+1 dimensional vector
- Perceptron convergence proof
Reading:
- The Perceptron Wiki page
- MLaPP 8.5.4
- Article in the New Yorker on the Perceptron

Lectures:
- #9 Perceptron Algorithm
- #10 Perceptron convergence proof
Estimating Probabilities from data
- MLE
- MAP
- Bayesian vs. Frequentist statistics
Reading:
- Ben Taskar’s notes on MLE vs MAP
- Tom Mitchell’s book chapter on Estimating Probabilities
- Youtube videos on
MLE and MAP and the Dirichlet distribution

Lectures:
- #11 MLE and MAP
Naive Bayes
- Naive Bayes Assumption
- Why is estimating probabilities difficult and high dimensions?

Reading:
- Ben Taskar’s notes on Naïve Bayes
- Tom Mitchell’s book chapter on Naive Bayes (chapters 1-3)
- Youtube videos on
Naive Bayes
- Xiaojin Zhu’s notes on
Multinomial Naive Bayes
- Mannings’ description of
Multinomial Naive Bayes

Julia Code from MLE / MAP demo. You can try it out in Julia Box.

Lectures:
- #12 MLE and MAP Example / Naive Bayes
- #13 Naive Bayes - parameter estimation
- #14 Naive Bayes with continuous variables
Logistic Regression
- Logistic Regression formulation
- Relationship of LR with Naive Bayes

Reading:
- MLAPP 8, 8.1, 8.2, 8.3.1, 8.3.2, 8.3.4, 8.6
- Ben Taskar’s notes on Logistic Regression
- Tom Mitchell’s book chapter on Naive Bayes and Logistic Regression
- Youtube videos on Logistic Regression
- Logistic Regression Wiki

Lectures:
- #15 Logistic Regression
Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
Reading:
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3

- Nice blogpost on Gradient Descent, Adagrad, Newton’s method

Lectures:
- #16: Gradient Descent
Linear Regression:
- Assumption of linear regression with Gaussian Noise
- Ordinary Least Squares (OLS) = MLE
- Ridge Regression = MAP
Reading:
- MLAPP 7-7.5.1 + 7.5.4
- OLS wiki page
- Youtube video on OLS

Lectures:
- #17: Linear Regression
Gradient Descent:
- Gradient Descent (GD)
- Taylor’s Expansion
- Proof that GD decreases with every step if step size is small enough
- some tricks to set the step size
- Newton’s Method
Reading:
- MLAPP 13.3-13.3.1, 8.3.2, 8.3.6, 13.5.3

- Nice blogpost on Gradient Descent, Adagrad, Newton’s method

Lectures:
- #16: Gradient Descent
Linear SVM:
- What is the margin of a hyperplane classifier
- How to derive a max margin classifier
- That SVMs are convex
- The final QP of SVMs
- Slack variables
- The unconstrained SVM formulation
Reading:
- MLAPP: 14.5 - 14.5.2.2
- Ben Taskar’s Notes on SVMs

Lectures:
-
#18: Support Vector Machines
Empirical Risk Minimization:
- Setup of loss function and regularizer
- classification loss functions: hinge-loss, log-loss, zero-one loss, exponential
- regression loss functions: absolute loss, squared loss, huber loss, log-cosh
- Properties of the various loss functions
- Which ones are more susceptible to noise, which ones are loss
- Special cases: OLS, Ridge regression, Lasso, Logistic Regression
ML Debugging, Over- / Underfitting:
- k-fold cross validation
- regularization
- How to debug ML algorithms
- How to recognize high variance scenarios
- What to do about high variance
- how to recognize high bias
- what to do about high bias
Reading:
- Ben Taskar’s description of under- and overfitting
- MLaPP: 1.4.7
- Andrew Ng’s lecture on ML debugging

Lectures:
- #25: ML Debugging, Kernels
Bias / Variance Tradeoff:
- What is Bias?
- What is Variance?
- What is Noise?
- How error decomposes into Bias, Variance, Noise
Reading:
- Ben Taskar’s Notes (recommended!!)
- Excellent
Notes by Scott Foreman-Roe
- EoSLR Chapter 2.9
- MLAPP: 6.2.2

Lectures:
- #23: Bias-variance tradeoff
-
#24: Bias-variance tradeoff (cont.), regularization, cross validation
Kernels (reducing Bias):
- How to kernelize an algorithm.
- Why to kernelize an algorithm.
- RBF Kernel, Polynomial Kernel, Linear Kernel
- What happens when you change the RBF kernel width.
- What is required for the kernel trick to apply
1. the weight vector must be a linear combination of the inputs
2. all inputs are only accessed through inner products
- The kernel trick allows you to perform classification indirectly (!) in very high dimensional spaces

Kernels (Lecture Continued):
- Constructing new kernels
- Kernel SVM
Reading:
- Ben Taskar’s Notes on SVMs
- MLAPP: 14-14.2.1, 14.4, 14.4.1, 14.5.2, 14.4.1
- Derivation of kernel Ridge regression by Max Welling
- Kernel Cookbook by David Duvenaud
- Laurent El Ghaoui’s lectures on duality

Lectures:
- #26: Kernels continued
- #27: Even more Kernels
- #28: Kernel regression, Kernel SVM
Gaussian Processes / Bayesian Global Optimization:
- Properties of Gaussian Distributions
- Gaussian Processes (Assumptions)
- GPs are kernel machines
Reading:
- MLAPP: 15
- Matlab code for bayesian optimization
- Genetic Programming bike demo (the “other” GP)

Lectures:
-
#29: Comparison of kNN and kernalized SVM, Gaussian distribution
-
#30: Gaussian Process Regression
k-nearest neighbors data structures (not covered in SP2018)
- How to construct and use a KD-Tree
- How to construct and use a Ball-tree
- What are the advantages of KD-T over B-T and vice versa?
Decision / Regression Trees:
- ID3 algorithm
- Gini Index
- Entropy splitting function (aka Information Gain)
- Regression Trees
Reading:
- Decision Tree wiki page
- Ben Taskar’s old notes
- MLAPP: 16.2

Lecturess:
-
#33: Decision Trees, Gini-Index and Entropy
Bagging:
- What is Bagging, how does it work, why does it reduce variance
- Bootstrapping
- What the weak law of large numbers has to do with bagging
- Bootstrapping leads to correctly (yet not independent) distributed samples
- Random Forests:
o what are the parameters
o what is the algorithm, what is so nice about it
Reading:
- MLAPP: 16.2 - 16.2.6
- Paper on Random Forest and Gradient Boosting that contains easy to understand pseudo-code.

Lectures:
- #34: Decision tree regression, bagging and bootstrapping
-
#35: Bagging, Random Forests, Boosting
Boosting:
- What is a weak learner (what is the assumption made here)
- Boosting reduces bias
- The Anyboost algorithm
- The Gradient boost algorithm
- The Adaboost algorithm
- The derivation to compute the weight of a weak learner

Reading:
- Paper on Random Forest and Gradient Boosting
- MLAPP: 16.4, 16.4.3, 16.4.8 (I believe there may be a
mistake in line 6 of Algorithm 16.2 (it should be 2*(I()-1) instead of just I() )
Additional reading:
- Freund & Schapire’s tutorial on Boosting
- Ben Taskar’s slides on Adaboost

Lectures:
-
#36: Boosting (cont.)
Artificial Neural Networks / Deep Learning:
- What artificial neural networks (ANN) are
- Forward propagation
- Backward propagation
- Why you need a squashing function
Reading:
- Intro to backprop
- MLAPP 28.1-28.5
- Tricks & Tips how to make ANN perform better

- Neural Network Playground