CS578 Empirical Methods in Machine Learning and Data Mining Course Project Predictions Due Saturday December 7, 2001 The goal of this project is to apply decision trees, neural nets, and/or k-nearest neighbor to a data set using any/all of the methods we studied in the course to improve performance. These include: - bagging and boosting - cross validation - model averaging (combining predictions from two or more models or learning methods) - early stopping - feature re-coding - feature selection - feature weighting - distance metric hacking - ... You will be given two data sets, a train set and a final test set. The train set will contain 5,000 cases. You can do anything you want with the training set. We strongly encourage you to use cross validation to create your own test sets from this training set so that you get unbiased estimates of the performance of the methods you try. There are no missing values in the data. The test set will not contain targets! You will run your final model on the test set and email the predictions to alexn at cs cornell edu. We will compute the performance of your method from these predictions and use this as part of the grading. The test set will contain 15,000 cases so that we can reliably estimate performance. You may work on the project in groups of 1-4 students. If you work in a group, briefly document who does what. For example: "X was responsible for decision trees, implementing cross validation for all the experiments, and preprocessing the data. Y did neural nets and k-NN and looked at feature weighting in k-NN (which helped, but not enough to make k-NN competitive with trees and neural nets). Z implemented feature selection and bagging, and generated most of the graphs in this report. We all decided how to do cross validation before starting the project, and which model performed best at the end of the project. The project will be graded as follows: 50% TECHNICAL APPROACH: How well did you tackle the problem? What method(s) did you use to optimize performance? How well did you do them? How well did you interpret the results? The project is open ended and you are expected to think about how to find/train good models in the allotted time. You can't try all possible combinations of methods. It is important to create a plan for tackling the problem, and adjust the plan as you collect intermediate results. 25% WRITE-UP: Is your report clear, concise, and complete? The write-up should outline your plan for tackling the problem, and summarize the performance of all the models you trained. The write-up should clearly state what model you think is best and how the final model is trained. You must include estimates of roughly how well you think the final model should perform on the final test set (based on the performance you observe on your test sets). 25% PERFORMANCE ON THE FINAL TEST SETS: We'll measure the accuracy, RMS, and ROC Area of your predictions. Because a model that optimizes accuracy might not be optimal for ROC Area or RMS, you will submit different predictions for accuracy, for RMS, and for ROC Area. It is OK to if the predictions you submit for accuracy are the same as the ones you submit for RMS and ROC Area. To submit your predictions, send one email to alexn at cs cornell edu with a subject line of "CS578 Predictions for Final Project". The email should have three attached files. Each file should be named groupname.{acc|rms|roc}. The groupname can be anything you want as long as it is unlikely to get duplicated. Each file should observe the following format: - The 1st line of each file contains names of the project members. - The next line should be blank. - The next 15,000 lines should be the predictions we should use to compute accuracy, RMS, or ROC Area. The predictions should be one entry per line: the probability your model predicts for class 1. IMPORTANT! You must return predictions to us in the same order as the cases in the unlabeled final test sets! Sample Email: ... Subject: CS578 Predictions for Final Project ... attachment: carnic.acc attachment: carnic.rms attachment: carnic.roc Sample File: Rich Caruana, Alex Niculescu 0.66 0.09 0.38 ... 14,995 more of these 0.99 0.59 Comments about the format: - Probabilities can use any reasonable number of significant digits. - The probability to give us is the probability the item is class 1! - The order of the predictions is critical!