CS578 Fall 2001
Empirical Methods in Machine Learning and Data Mining 
Optional Homework Assignment #4
Due: Wednesday, December 19, 2001

This is an optional hw assignment for those who did not do well on one
of the earlier assignments.  The grade on the optional assignment will
be combined with the lowest grade received on a previous assignment as
follows: new_grade = 0.75*optional + 0.25*previous.  Thus the grade on
this assignment almost replaces your previous lowest grade.

The goal of the assignment is to implement hierarchical clustering and
apply it to the protein data we discussed in class.

EXPERIMENTS:

 0: The data set (available through the web page) contains the
    pairwise distances between proteins in upper diagonal format.

 1: Write a program that performs greedy agglomerative clustering
    on this data.  Cluster the data starting with 641 singleton
    clusters until only one cluster remains.  Merge clusters that
    have minimum mean pointwise distance between them.

 2: Plot the mean internal pointwise distance of the clusters as
    they are formed, the mean distance between the two clusters
    that are merged, and the mean pointwise distance between
    clusters, as a function of the number of clusters (or the
    number fo merges).

 3: Compute the correlation fractal dimension of the data using 
    the methods discussed in class: plot the number of pairs of
    points vs. radius and determine the slope of the graph.  Is
    the graph straight?  Are there plateaus?  What size cluster
    seems natural for this data?

 4: Discuss the results.  How many clusters seem natural for this 
    data set?  Do the clustering results agree with the fd results?


EXTRA CREDIT -- do one or more of the following:

 - apply multidimensional scaling to the data set and compare the
   results with clustering and fd analysis above
 - draw the clustering from above as a tree using your own code or
   any tree drawing package you can find on the web
 - cluster the data using nearest neighbor instead of average
   distance, plotting the nearest distance between clusters as
   the clusters are merged.  calculate the mean internal distance
   of the clusters as above, and plot this as well.  does merging
   based on nearest neighbor change the average size of clusters?
 - do something to make your code efficient such as incremental
   updating of cluster distances, using minimal spanning trees,
   using clever data structures, etc.

Have fun.