Course
Description Handouts
Data Sets ML Links
New
Time:
New Place: Upson 5130
Instructor: Rich Caruana (caruana@cs.cornell.edu)
Office: Upson 4157
Office Hours: Tue
4:30-5:30, Wed 2:30-3:30
In this course we will compare the performance of different machine learning algorithms on a variety of test problems. The goal is to figure out what learning methods work best on each problem. The learning methods we might use include:
Support Vector Machines (SVMs)
Artificial Neural Nets (ANNs)
Nearest Neighbor Methods (e.g., kNN)
Decision Trees (DTs)
Splines (e.g., MARS: Multivariate Adaptive Regression Splines)
Logistic Regression
Rule Learning
We will try to run each algorithm nearly optimally by tuning each algorithm's parameters. We will measure performance using a variety of performance measures such as accuracy, squared error, precision and recall, and ROC. We will use sound statistical tests to analyze the results. In other words, we are going to dry to do the comparisons as thoroughly as we can.
There are many issues that arise when doing this kind of empirical research. Some of these are:
How do we optimize each learning method?
What data sets should we use?
What do we do with missing values? Some methods such as decision trees handle missing values easily, but most learning methods don't handle missing values. How do we do a fair comparison between methods on data sets that have missing values?
What performance measures should we use to compare methods?
Some of the better performance measures such as ROC are only defined for two-class problems. Also, some learning methods such as SVMs are best suited to two-class problems. What do we do with problems that have more than two classes?
Some data sets are large enough that we can hold out a large final test set. Many are not. How do we structure cross-validation so that we can optimize each method and still do a final comparison between methods?
What statistical tests should we use to compare methods?
Should we use bagging or boosting?
What do we do if a data set is too big or the experiments too costly for some of the learning methods?
???
There will be a half dozen lectures and papers to read to bring us all up to speed, but most of the classes will be more like group meetings than like lectures. We'll use off-the-shelf code for most learning methods so that we don't have to implement everything from scratch.
If all goes well, we'll publish a group paper on the results with each of us as co-authors.
This is a 700-level course. You should not take this course if you do not already have some experience in machine learning (e.g., CS478, CS578, or equivalent) or statistical modeling. Please contact the instructor if you aren't sure that your background is adequate.
Textbooks: There are no required textbooks. The following texts might prove useful: