CS578 Fall 2002 Empirical Methods in Machine Learning and Data Mining Homework Assignment #3 Due: Thursday, November 21, 2002 The goal of this assignment is to experiment with artificial neural nets trained with backpropagation, early stopping, and 5-fold cross validation. For this assignment use the same data set used in HW2. The data is still available from the web page if you want to get a clean copy. The goal is to predict the same boolean variable (col 1) from the 143 inputs (cols 2-144). You may implement backprop yourself, or use a commercial/public domain implementation. Note that it probably will take more time to install, learn to use, and modify someone else's implementation than to program backprop yourself, so we encourage you to code backprop yourself. In fact, implementing bp yourself counts as extra credit. If you decide to use someone else's package, it is up to you to make sure that it will support the experiments needed for this assignment. One public domain package you might want to consider that runs on a variety of platforms is SNNS: Stuttgart Neural Network Simulator. There also is a Matlab toolbox for neural nets that is supposed to be pretty good. EXPERIMENTS: 0: Scale each attribute so that the min value of the attribute is is 0 and the max value is 1: new_val = (val-min)/(max-min). Your code from HW2 might help you here. 1: For neural nets, you need train sets (backprop sets), early stopping sets (technically still part of the train set), and test sets. Use 5-fold cross validation for the train/test sets. The early stopping set should be held out of the train set. One way to do this is to split the data into 5 folds. Do backprop on folds 1-3 (3/5 of the train data), use fold 4 for early stopping (1/4 of the train data), and test on fold 5. Repeat this process 5 times for 5-fold CV. There are other ways to do this. Carefully explain how you choose to do 5-fold CV. A diagram or table would be helpful. 2: Train fully-connected feedforward neural nets using vanilla backpropagation with momentum. Every backprop implementation defines learning rate and momentum somewhat differently, and the definitions often vary when using batch mode (updating once per epoch -- full pass through the training set) or when updating per pattern, so you'll have to experiment with the parameter settings to find values that work well with your code. You can use batch mode, per pattern, or per group of pattern updating. (If the nets are fully trained after less than 100 passes through the train set you're probably training too fast. If the nets are taking more than 10^5 passes through the train set you're probably training slower than necessary.) 3: Compute the accuracy and RMSE on the train, early stopping, and test sets. Show graphs of performance vs number of epochs for the train and early stopping sets. The performances on the test sets should be reported at the early stopping point. Is the early stopping point for accuracy the same as the early stopping point for RMSE? 4: Do some quick experiments to show how performance varies with the train set size. Plot a learning curve. 5: Experiment with different numbers of hidden units. You might try 1,2,4,8,16,32,64,... or even 1,4,16,64,... What size net yields best generalization performance? EXTRA CREDIT -- do one or more of the following:: - 5-fold CV leaves you with 5 or more trained neural nets. compare the average prediction of these nets with the performance of each of the nets alone to see which works better. to do this experiment right you'll need to either use an extra level of cross validation, or hold out a final test set. (HINT: if you might do this extra credit, hold out the final test set(s) *before* doing the assigned experiments above so that you don't have to repeat the runs!) - do a study of the effect of altering the learning rate and momentum on the generaization performance of the nets - try nets with two or more hidden layers. do they perform better than nets with one hidden layer? are they harder to train? - compare weight decay with early stopping. does one perform better than the other? is one easier to use than the other? - do feature selection to find a subset of the features that seems to perform better than using all the features - do a sensitivity analysis to figure out what inputs the trained nets use most. sensitivity analysis can be done by looking at derivatives of the output of the net with respect to the inputs, or by experimenting with injecting noise into the inputs one at a time - take variable type into account when coding inputs to the nets. - implement vanilla backprop with momentum for fully-connected feedforward neural nets containing one hidden layer and trained with squared error Hand in a brief summary of the results with enough documentation so that we can see what you did and how you did it. Do not write a paper or anything like that. This is homework, not a class project. You'll probably want to use the neural net code later in the class project, so effort spent now to write good code or become familiar with whatever package you use should pay off later. Have fun.