CS578 Fall 2001 Empirical Methods in Machine Learning and Data Mining Optional Homework Assignment #4 Due: Wednesday, December 19, 2001 This is an optional hw assignment for those who did not do well on one of the earlier assignments. The grade on the optional assignment will be combined with the lowest grade received on a previous assignment as follows: new_grade = 0.75*optional + 0.25*previous. Thus the grade on this assignment almost replaces your previous lowest grade. The goal of the assignment is to implement hierarchical clustering and apply it to the protein data we discussed in class. EXPERIMENTS: 0: The data set (available through the web page) contains the pairwise distances between proteins in upper diagonal format. 1: Write a program that performs greedy agglomerative clustering on this data. Cluster the data starting with 641 singleton clusters until only one cluster remains. Merge clusters that have minimum mean pointwise distance between them. 2: Plot the mean internal pointwise distance of the clusters as they are formed, the mean distance between the two clusters that are merged, and the mean pointwise distance between clusters, as a function of the number of clusters (or the number fo merges). 3: Compute the correlation fractal dimension of the data using the methods discussed in class: plot the number of pairs of points vs. radius and determine the slope of the graph. Is the graph straight? Are there plateaus? What size cluster seems natural for this data? 4: Discuss the results. How many clusters seem natural for this data set? Do the clustering results agree with the fd results? EXTRA CREDIT -- do one or more of the following: - apply multidimensional scaling to the data set and compare the results with clustering and fd analysis above - draw the clustering from above as a tree using your own code or any tree drawing package you can find on the web - cluster the data using nearest neighbor instead of average distance, plotting the nearest distance between clusters as the clusters are merged. calculate the mean internal distance of the clusters as above, and plot this as well. does merging based on nearest neighbor change the average size of clusters? - do something to make your code efficient such as incremental updating of cluster distances, using minimal spanning trees, using clever data structures, etc. Have fun.