Machine Learning is a major computer science topic these days. And while it can seem really powerful, it is often just the application of statistics to large amounts of data. As computers become both ubiquitous and more powerful, many applications — from science to business to entertainment — are generating huge amounts of data.
In this assignment you will work with one of the most popular (though not necessarily most efficient) tools for analyzing large amounts of data: k-means clustering. In k-means clustering, you organize the data into a small number (k) of clusters of similar values. For example, in the picture to the right, there are a lot of points in 3-dimensional space. These points are color-coded into five clusters, with all the points in a cluster being near to one another.
If you look at the wikipedia entry for k-means clustering, you can see that it has many, many applications. It has been used in market analysis, recommender systems (e.g. how Netflix recommends new movies to you), and computer vision, just to name a few. It is an essential tool to have if you are working with data. With the skills and knowledge you have been acquiring in CS 1110, you are now able to implement it yourself!
This assignment is a little longer than the previous one. The code in any individual method is not that complicated, but they all often call each other as helpers. Therefore, it is important to have the “big picture” in mind at all times while working on this assignment, so that you know how everything works together. To keep you from getting overwhelmed, we highly recommend that you follow our recommended deadlines throughout the assignment.
Authors: W. White, D. Murphy, T. Westura, S. Marschner, L. Lee
Learning Objectives
This assignment is designed to help you understand the following concepts.
- It gives you practice with writing your own classes.
- It gives you practice with writing code spread across many files.
- It demonstrates the importance of class invariants.
- It gives you practice with using data encapsulation in classes.
- It gives you practice with both both 2-dimensional and 3-dimensional data.
- It gives you experience with structuring your code with helper functions.
- It demonstrates how to implement a powerful data-analysis algorithm on real data.
Table of Contents
- Academic Integrity and Collaboration
- K-Means Clustering
- Assignment Source Code
- Assignment Instructions
- Finishing Touches
Academic Integrity and Collaboration
This is a very old assignment that has been resurrected with some interesting changes. With that said, there is some relevant code out in the wild. Please avoid looking at solutions online, or looking at code from other students this semester, or semesters previous.
In this assignment, it is highly unlikely that your code for this assignment will look exactly like someone else’s. We will be using Moss to check for instances of plagiarism. We also ask that you do not enable violations of academic policy. Do not post your code to Pastebin, GitHub, or any other publicly accessible site.
Collaboration Policy
You may do this assignment with one other person. If you are going to work together, then form your group on CMS as soon as possible. If you do this assignment with another person, you must work together. It is against the rules for one person to do some programming on this assignment without the other person sitting nearby and helping.
With the exception of your CMS-registered partner, we ask that you do not look at anyone else’s code or show your code to anyone else (except a CS1110 staff member) in any form whatsoever. This includes posting your code on Ed Discussions to ask for help. It is okay to post error messages on online, but not code. If we need to see your code, we will ask for it.
K-Means Clustering
In this section, we describe what cluster analysis is, and how we use k-means clustering to implement it.
Cluster analysis is the act of grouping together similar points into groups. For example, one of the applications of clustering is understanding ecological systems. Suppose that we have a large dataset of locations (longitude-latitude pairs) where whales are spotted in the ocean. Clustering this data into groups can give us a sense of where different whale community populations are centered.
When we say point, it could be a 2-dimensional point, a 3-dimensional point, a 4-dimensional point, and so on. In fact, we could have points of arbitrary dimension, particularly if the data is not spatial. For example, one of the data sets for this assignment is a fictional market analysis of candies. Each candy is a 4-dimensional point that represents how sweet, sour, nutty, and crunchy it is.
Distances between Points
When we cluster points, we use the Euclidean Distance to measure the distance between them. You may already know how to compute the distance between two 2-dimensional points:
\[d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2}\]or between 3-dimensional points:
\[d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2 + (p_z - q_z)^2}.\]These are special cases of the general definition: given two $n$-dimensional points $\textbf{p} = (p_1, p_2, \ldots, p_n)$ and $\textbf{q} = (q_1, q_2, \ldots, q_n)$, the Euclidean distance between them is:
\[d(\textbf{p},\textbf{q}) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2}\]For example, suppose we have two points: $\textbf{p} = (0.1,0.2,0.3,0.4)$ and $\textbf{q} = (0.0,0.2,0.3,0.2)$. Then:
\[d(\textbf{p},\textbf{q}) = \sqrt{(0.1-0.0)^2+(0.2-0.2)^2+(0.3-0.3)^2+(0.4-0.2)^2} = 0.7071\ldots\]Cluster Centroids
Given a set of points, we can identify its centroid. The centroid is the center of mass of the cluster. This means it is the average, or mean, of all of the points in the cluster. The centroid might happen to be one of the points in the cluster, but it does not have to be. For example, the picture to the right is an example of a 4-point cluster (black circles) with a centroid (red circle) that is not one of the points in the cluster.
To compute a centroid, we average. Remember that to average, you add up a collection of things and divide by the total number of items. The same is true for points; we add up all the points, and then “divide” the result by the total number of points. Adding two points results in a new point that is the coordinate-wise addition of the two input points:
\[\textbf{p}+\textbf{q} = (p_1+q_1, p_2+q_2, \ldots, p_n+q_n)\]When we divide a point by a number, the division is done coordinate-wise, as follows:
\[\textbf{p}/a = (p_1/a, \>p_2/a,\> \ldots,\> p_n/a)\]The Clustering Algorithm
The idea of k-means is quite simple: each point should belong to the cluster whose mean (centroid) it is closest to. But, the centroid of a cluster depends on which points you put into it. This creates a kind of chicken-and-egg problem, which is solved using an iterative approach: first assign points to clusters, then compute new centroids for these clusters, then re-assign points using the new centroids, and so on.
To make this general idea into a precise algorithm we can implement, we break down k-means clustering into several steps. Throughout, let $n$ be the dimension of the points.
1. Pick $k$ Centroids
Our goal is to divide the points into $k$ adjacent groups. The first step is to guess $k$ centroids. Throughout these instructions, we will refer to the centroids as $\textbf{m}_1,\textbf{m}_2,\ldots, \textbf{m}_k$. In this case the letter m stands for mean, or centroid, which is where the algorithm’s name comes from.
Remember that the centroids are themselves $n$-dimensional points. Hence, for any mean $\textbf{m}_j$, where $1 <= j <= k$, we have
\[\textbf{m}_j = (m_{j,1},m_{j,2},\ldots,m_{j,n})\]So each $m_{j,i}$ represents the ith coordinate of the jth centroid.
To pick $k$ centroids, all we do is choose $k$ points from our original dataset. We will choose these points at random, as if drawing cards from a deck. This is referred to as the Forgy Method, and it is one of the most popular ways to start k-means clustering.
2. Partition the Dataset
Now that we have $k$ centroids, we assign every point in the data set to a centroid. We do this by matching each point to the nearest centroid $\textbf{m}_j$, and then have the points assigned to each centroid form a new cluster. This is known as partitioning because each point will belong to one and only one cluster.
If there happens to be a tie between which centroids are closest to a given point, we arbitrarily break the tie in favor of the $\textbf{m}_j$ with smallest $j$. If we put the centroids in a list, this would be the centroid that comes first in the list.
There will then be $k$ sets $S_1, S_2, \ldots, S_k$, where set $S_j$ is the cluster consisting of all the points associated with centroid $\textbf{m}_j$.
3. Recompute the Means
Once we have the sets $S_1, S_2, \ldots, S_k$, we need to compute a new mean $\textbf{m}_j$ for each $S_j$. This new mean is just the average of all of the points in that set. Let
\[S_j = \lbrace \textbf{q}_{1},\textbf{q}_{2},\ldots,\textbf{q}_{c_j}\rbrace\]be the set of points associated with centroid $\textbf{m}_j$, where $c_j$ is the number of points inside of $S_j$. Then the new mean is the $n$-dimensional point
\[\textbf{m}_j = \frac{1}{c_j}(\textbf{q}_{1}+\textbf{q}_{2}+\cdots+\textbf{q}_{c_j}).\]The formula above uses the rules for adding points and dividing points by numbers that we discussed above. If you do not understand this formula, please talk to a staff member immediately.
Now, because we have recomputed the means, some points might be closer to the centroid for a cluster they are not currently in than to the centroid for the cluster they are in. To fix this, we repartition the dataset by going back to step 2 and proceeding as before.
4. Repeat Steps 2 and 3 Until Convergence
The clustering algorithm continues this process: recompute the means and repartition the dataset. We want to keep doing this until the means stop changing. If all the means are unchanged after step 3, we know that partitioning again will produce the same clusters, which will have the same means, so there is no point in continuing. When this happens, we say that the algorithm has converged.
Sometimes it can take a very long time to converge — thousands of thousands of steps. Therefore, when we implement k-means clustering, we often add a time-out factor. This is a maximum number of times to repeat steps 2 and 3. If the algorithm does not converge before this maximum number of times, we stop anyway, and use the clusters that we do have.
Assignment Source Code
The assignment code is much more complicated than it has been in previous assignments. To make sure that you have everything, we have provided with two zip files.
- cluster.zip: The package for the assignment application code
- datasets.zip: Sample data files for clustering
You should download the zip archive cluster.zip
from the link above. Unzip it
and put the contents in a new directory. This time, you will see that this
directory contains a lot of files. Many of these files are not all that
important; they are similar to a3app.py
in that they drive a GUI application
that you do not need to understand. In fact, you only need to pay attention to
three files:
- cluster/a6dataset.py: This module contains the
Dataset
class. You will need to complete it to be able to start visualizer. - cluster/a6cluster.py: This module contains the
Cluster
class. You will need to complete it to visualize a single cluster and its centroid. - cluster/a6algorithm.py: This module contains the
Algorithm
class. You will need to complete it to visualize the clustering process.
There is one other file that you should read, even though it is not intended to be modified.
- cluster/a6test.py: This module contains test cases for this assignment.
These are not the only files, but these are the ones to read or attention to. Do not download the individual files. Download the zip files provided. That way you can guarantee you have everything.
Understanding the Application
Classes are ideal for representing complex mathematical objects. For example, we saw in lecture how to use classes to represent rectangles or times of day. There are several interesting mathematical objects in the description above (points, centroids, sets/clusters) and we could use classes to represent all of these.
Deciding which of these things to create classes for is the core of object-oriented software design. For this assignment we have made these decisions for you, since it is not always easy to hit the right balance of complexity against structure the first time, and we want you to spend your time implementing rather than redesigning.
For our implementation of k-means, we decided not to use classes to represent
points and centroids, because Python’s built-in lists serve well enough. In
particular, you should not use the Point3
class in introcs
. That class
only supports 3-dimensional points, but we would like to support points in any
ddimension.
In the end, we decided to create classes for three important concepts in k-means.
First there is the entire dataset, represented by an instance of the class
Dataset
. Second, there is a cluster of data, represented by an instance of
the class Cluster
. Finally, there is the set of clusters created by k-means,
represented by an instance of the class Algorithm
.
Points as Lists
Throughout this assignment, a point will be represented as a list of numbers; both ints and floats are allowed. This is true for centroids as well. For example, here is an example of a possible centroid if we were working with 4-dimensional points:
\[\textbf{m}_j \mbox{ is represented as the list } [0.1, 0.5, 0.3, 0.7]\]In this example, if x
stores the name of the list, then $m_{j,2}$
(using the notation from a previous setion) is
x[1]
, or 0.5, while $m_{j,4}$ is x[3]
, or 0.7. Note that our mathematical
notation starts at 1, while list indices start at 0.
Class Dataset
The k-means algorithm starts with a dataset—a collection of points that
will be grouped into clusters. This dataset is stored in an instance of the
class Dataset
. This class is very simple; it just holds the list of points,
so its most important (hidden) attribute is _contents
,
which is a list of points in the dataset.
For example, if a dataset contains the points [0.1,0.5,0.3,0.7] and [0.2,0.0,0.2,0.1],
then the attribute _contents
is
[[0.1,0.5,0.3,0.7], [0.2,0.0,0.2,0.1]]
The points are all lists of numbers, and they all have to be the same length.
Remember that this length is the dimension of the data. We keep track of this
dimension in a second attribute called _dimension
.
The fact that all the points are supposed to have the right dimension is expressed by a class invariant:
Each element of _contents has length equal to _dimension.
You might argue that storing the dimension separately is redundant, since you
could always look at the points in _contents
to find it. However, we would like the dimension to be defined even before we
add any points to the dataset.
One of the things that you might notice about these attributes is that they all
start with an underscore. As we mentioned in class, that is a coding convention
to indicate that the attributes should not be accessed directly outside of the
class Dataset
. In particular, neither Cluster
nor Algorithm
should access
these attributes at all. When those two classes do need to access these
attributes, they should do it through getters and setters. One of the things
you will do in this assignment is implement these getters and setters.
Important: If a method of any class accesses the hidden attributes of an object in another class, you will lose points. A lot of points.
Class Cluster
The next idea in the algorithm that we have defined a class for is a cluster
$S_j$. To support this type, we defined the class Cluster
. Objects of this
class will actually hold two different values: the centroid $\textbf{m}_j$ and the set $S_j$.
We do this because the centroid and cluster are associated with one another, so
it makes sense to store them in a single object. We need two attributes
corresponding to these ideas:
_centroid
: The point $\textbf{m}_j$ represented as a list._indices
: A list of integers indexing the points in the cluster.
Note that we are not storing the points belonging to a cluster; we are only
storing the indices of the points in the dataset. This simplifies Cluster
because it does not need to worry about maintaining lists of lists. Code that
works with clusters does need to be able to get to the points, however.
Therefore, the final attribute is _dataset
,
which is a reference to the dataset.
Once again, these attributes start with an underscore, so we want getters and
(and sometimes setters) if we are going to access them in the class Algorithm
(though it is okay if a method in Cluster
accesses one of its own attributes
with an underscore in it). This is the purpose of methods like getCentroid
and
addIndex
. There is also a special getter getContents
that is used by the
visualizer.
In addition to the getters and setters, the class Cluster
is the logical
place to put code that does computations involving a single cluster. Reading the
description of the k-means algorithm above, we find that the two core operations
with clusters are computing the distance from a point to a cluster’s centroid
and finding a new centroid for the cluster. These operations are handled by the
two most important methods of the Cluster
distance
: A method to find the distance from a point to the cluster’s centroidupdate
: A method to compute a new centroid for a cluster
Finally, this class contains the special Python methods __str__
and __repr__
.
This methods are to allow you to print clusters, which can be very informative
when debugging.
Class Algorithm
The k-means algorithm needs to work with several clusters at once. Hence it does
not make sense to put this code in the Cluster
class. Instead, we put this code
in a separate class called Algorithm
. The data stored in this class is simple:
a list of all clusters in the algorithm, and as with Cluster
, a reference back
to the dataset.
_dataset
: The dataset we are clustering_clusters
: The list of current clusters $S_1, \ldots, S_k$
Once again, we have getters and setters for these attributes. But the important
methods of class Algorithm
are the core of the k-means algorithm.
_partition
: A method to partition the dataset, implementing step 2 of the algorithm._nearest
: A helper method for_partition
to find the nearest cluster to a point._update
: A method to update all the cluster centroids, implementing step 3
Finally there are the methods that orchestrate the whole process:
step
: A method that executes one step of the process, updating and re-partitioning.run
: A method that computes a clustering, from start to finish
So a user of the class who just wants to cluster a dataset into $k$ clusters
would create a Dataset
, add all the points to it, create a Algorithm
with
$k$ clusters, and then call run
for the algorithm object.
You will notice that some of the methods have names starting with an underscore.
The meaning is the same as with attributes: these methods are meant to be used
internally rather than called directly by code outside the Algorithm
class.
Precondition Enforcement
Once again, we want you to get into the habit of asserting preconditions. Unlike the last assignment, we are not making a distinction between enforced and normal preconditions. If it is a precondition, you should enforce it with an assert statement.
Some of the preconditions are quite complete this time. For example, method
addPoint
in Dataset
requires that point
is a list of numbers. There are
two different things to check here:
point
is a list- Every element of
point
is a number (int or float)
The second part requires a for-loop (or equivalent) to check, and we do not know
how to mix for-loops and asserts. The best way to handle this precondition is to
write a helper function that returns True
if these are both true, and False
if not. We have a stub for such a helper is_point
for you in a6dataset.py
.
You are expected to complete this function. But when it is complete, you can
use it in addPoint
, by simply writing
assert is_point(point)
As with Assignment 4, we are not requiring
any error messages with your assert statements. Note that this is not the only
part of the precondition in addPoint
to enforce, but it is hardest part.
We have provided the stubs for two more functions to help with enforcing
preconditions. There is is_point_list
in a6dataset.py
and valid_seeds
in
a6algorithm.py
. You are expected to complete these functions according to
their specification. You should then use them to enforce the preconditions for
your methods.
With the exception of self
, you are expected to enforce all preconditions
for the methods in this assignment.
Running the Application
Because there are so many files involved, this application is handled a little
differently from previous assignments. To run the application, keep all of the
files inside of the folder cluster
. Do not rename this folder. To run the
program, change the directory in your command shell to just outside of the
folder cluster
and type
python cluster --view
In this case, Python will run the entire folder. What this really means is
that it runs the script in __main__.py
. This script imports each of the other
modules in this folder to create a complex application.
The script __main__.py
is actually quite sophisticated. If you look at the
code in this file, you will see that actually manages multiple applications
underneath one roof. Try typing
python cluster --test
the application will run test cases (provided in a6test.py
) on all of your
classes. You will want to rely on this, because the visualizer only works on
two or three-dimensional data.
Finally, if you type
python cluster file.csv 5
the application will load the CSV (comma separated values) file file.csv
and perform k-means clustering with 5 clusters, and print out the results.
Using the Test Cases
As we said, executing the application with the --test
option will run test
cases on the classes Dataset
, Cluster
and Algorithm
. These test cases are
designed so that you should be able to test your code in the order that you
implement it. However, if you want to “skip ahead” on a feature, you are
allowed to edit a6test.py
to remove a test. Those tests are simply there for
your convenience.
This test script is fairly long, but if you learn to read what this script is doing, you will understand exactly what is going on in this assignment and it will be easier to understand what is going wrong when there is a bug. However, one drawback of this script is that (unlike a grading program) it does not provide a lot of detailed feedback. You are encouraged to sit down with a staff member to talk about this test script in order to understand it better.
With that said, this test script is not complete. It does not have full
coverage of all the major cases, and it may miss some bugs in your code. It is
just enough to ensure that basic features are working. You may lose points
during grading even if you pass all the tests in this file (our grading program
has a lot more tests). Therefore, you may want to add additional tests as you
debug. However, we do not want you to submit the file a6test.py
when you are
done, even if you made modifications to the file.
Using the Visualizer
K-means clustering is cooler when you can what is actually going on. If you run
the application with the --view
option, you will get a visualizer that will
help you see the application in action. It is much more limited than the unit
test because you can only visualize 2-dimension points. It will also only work
properly once you have completed the entire assignment. However, even if you
have not completed the assignment, it is still useful and will show some
information.
When you start the visualizer for the first time, it will look like the window
shown below. You will see a empty plot of a 2-dimensional axis, as well as some
buttons and menus to the right. You will also get an error message telling you
to finish Dataset
.
As the error message says, this visualizer is not going to do anything until you
start implementing the various parts of this assignment. Once you have most of
the assignment completed (up to Part C of Algorithm
), you
can use the visualizer to perform a k-means clustering on a two-dimensional data
set.
Once you have everything working, click on the drop-down menu to select a dataset,
like sample1
. These options correspond to the CSV files in the data
directory.
Your initialization step will pick $k$ random centroids and partition the data.
The result will look something like the following:
Note that each cluster is a different color. The points in the cluster are crosses, while the solid circle is the centroid.
You can use the pull-down menu to change the k-value. There is a limit on the number of k values you can choose because there are only so many different colors that we could think of. Whenever you change the k, it will reinitialize the centroids. It will also reinitialize the centroids when you hit the reset button.
The overlay
option changes how the centroid is drawn. It if False
, the
centroid is a simple, small circle. If it is True
, then it grows to a large
transparent circle containing (most of) the dataset.
The overlay option requires that you finish the getRadius
method in Cluster
.
However, it is not perfect, and you should not assume that your getRadius
is
wrong if a few points stray outside of the overlay.
Each time you hit the step button, it will perform step 3 of the k-means
algorithm. You can watch the centroids change as you hit the button. When
the algorithm converges, you will see the word True
after Finished and
the step button will no longer do anything. Below is a picture of a finished
computation with final versions of the clusters.
This visualizer will work with valid any CSV file (see below). You can load a
new file by selecting the <select file> under Dataset. For
example, try the file basic-2d.csv
in datasets.zip
. This is a small set of
2-dimensional points. For data files that are more than two dimensional (like
the candy data sets), the visualizer will only use the first two values for
each point. For data sets that are one dimensional (like flat.csv
) it will
use 0.5 for the y values, creating a line in the middle of the window.
Finally, we have extended the visualizer to work on three-dimensional data. To enable this option call the view with an additional argument 3:
python cluster --view 3
This will launch a three-dimensional cluster visualizer like the one shown below. It behaves exactly like the two-dimensional view except that you can rotate the plotted data on the left. To rotate the data view, simply click on the graph and drag your mouse.
Using CSV Files
While working on this assignment, you might want to work with your own datasets.
The cluster
package can process any dataset you want, of any dimension you
want, so long as it is in a proper CSV file. A CSV file is a comma separated
value file, and these can be generated by most spreadsheet programs like Excel.
CSV files represent simple tables. They have a header row with the names of each
column, but then every row after the first only contains values. The files in
datasets.zip
are examples of CSV files.
The dataset CSV file must have a particular format. All of the values in the
non-header rows must be numbers. The only exception is if you have a column
named COMMENTS
. You can write whatever you want in this column; it is a way
of labeling your data. We have done that in the candy datasets.
To process a file, type the command
python cluster file.csv k
(where k
is a number and file.csv
is a name of a CSV file). This will
perform k-means clustering with k
clusters on the file file.csv
. It will
print the result on your screen. If you would like to output the results to a
new CSV file (which will have additional columns for the centroid and cluster
number), you can do that with the --output
option, like this
python cluster file.csv k --ouput output.csv
You can also use this command with the --view
option to open the visualizer
on a custom dataset.
Assignment Instructions
In the files a6dataset.py
, a6cluster.py
, and a6algorithm.py
we have
provided the definitions for the classes discussed above. You have the class
invariants and all method specifications, but almost all the method bodies are
just stubs. Your task is to implement all the stubbed methods in these classes.
We recommend starting with Dataset
, since that is needed by the other classes.
Then implement Cluster
, and test that all its methods are working correctly.
Finally implement and test Algorithm
. When that is working correctly, your
assignment is finished.
Micro-Deadlines
This is a longish assignment and you only have the same two weeks that you did for Assignment 4. While a lot of code is straight-forward (particularly for getters and setters), you need to understand how the classes fit together. That can be very overwhelming if you wait until the last minute on this assignment.
Therefore, at the end of each part, we have a suggested completion date. While this is not enforced, we recommend that you try to hit these deadlines. If you cannot make these deadlines, it might be a sign that you are having difficulty and need some extra help. In that case, you should see a staff member as soon as possible.
November 2 Task 0 November 4 Part A Dataset November 5 Part B Dataset November 9 Task A Cluster November 10 Task B Cluster November 11 Task A Algorithm November 12 Task B Algorithm November 13 Task C Algorithm November 14 Task D Algorithm
Task 0: The Assert Functions
One of the primary things that you will be doing in this assignment is enforcing preconditions and class invariants. Obviously that will require a lot of assert statements. However, the criteria that you need to assert can get really complicated. How do we enforce a precondition that a variable is a list of points?
To simplify the process, we have provided two functions in a6dataset.py
:
is_point
and is_point_list
. These functions will be used in the assert
statements for Task 1. Unfortunately, right now
they are just stubs. You will need to complete these. Read the specification
and do just that.
The important thing to understand about these functions is that they have no
preconditions themselves. They can take any value as an argument. So you
have to be prepared to return False
on the call is_point('Fred')
.
Testing it Out
Both functions is_point
and is_point_list
have their own test procedures
in a6test.py
, so you can test them out separately (but read
Using the Test Cases for how to execute the test script
properly). If you see output that says
function is_point appears correct
function is_point_list appears correct
then you can be pretty confident about this part.
We highly recommend that you finish these functions by Saturday, November 2. This gets you started on the assignment right away and helps you determine if you might run into any difficulties. It also puts you in a good place to finish the assignment at a somewhat leisurely pace.
Task 1: The Dataset Class
The Dataset class is split into two parts: Part A and Part B. That is make
testing a little bit easier. Only Part A is required to start testing and use
the visualizer. However, Part B is necessary to fully complete the class.
Each part has its own test procedure in a6test.py
, so that you can fully test
one part before moving on to the next.
Part A: Object Initialization
The first thing you should do before working on this class is read the class invariant.
# IMMUTABLE ATTRIBUTES
# Attribute _dimension: The point dimension for this dataset
# Invariant: _dimension is an int > 0.
#
# MUTABLE ATTRIBUTES
# Attribute _contents: The dataset contents
# Invariant: _contents is a table of numbers (float or int), possibly empty.
# Each element of _contents is a list of size _dimension
Note that these attributes begin with an underscore. That means they can be
accessed inside the code for Dataset
, but nowhere else. Access by any other
class (Cluster
or Algorithm
) requires a getter or setter.
This class has is largely defined by six methods: the initializer, four getters
and a setter (we are ignoring the method __str__
for right now). The
specifications of all of these methods are included in the file a6dataset.py
.
The getters are the most straight-forward. Most of the time, you just need to
return the attribute. The only interesting function is the initializer, and
that is because it needs to make sure the class invariant is true when it is
done.
Pay careful attention to the preconditions of all of these methods. Recall that preconditions are how we preserve the invariants, so we must assert these preconditions to prevent the invariants from being violated. You should use the functions that you implemented in Task 0 to help you with this.
When you implement the initializer, note that the specification says that the
class is supposed to keep a copy of the parameter contents
. In this context
this means that you need to set the _contents
attribute to a new list that
contains the points in contents
. Furthermore, because the points themselves
are lists, you need to perform a deep copy of this list.
Testing it Out
The initializer and getters/settets in class Dataset
have their own test
procedures in a6test.py
, so you can test them out separately (but read
Using the Test Cases for how to execute the test script
properly). If you see output that says
Part A of class Dataset appears correct'
then you can be pretty confident about this part.
You can also try your code in the visualizer. Run the visualizer and load a data file as described above. The points will only be displayed as black crosses, and the algorithm controls will not work, but you will be able to see the data.
If you click on the Reset button, you will see some error messages in the
terminal saying FAILED VISUALIZATION
. This is perfectly normal. These
error messages are the visualizer letting us know that it was not able to form
clusters yet and that it shifted down to testing only the Dataset part of your
code.
You should try to finish this part of Dataset
by
Saturday, November 2.
This will give you an experience with a (mostly completed class) and set you
up for the more difficult tasks to come.
Part B: The __str__
Method
There is one more method in Dataset
that we want you to implement: the __str__
method. The purpose of this method is to aid with debugging. When you are trying
to understand how the other classes are processing your dataset, it helps to
know which points are actually in the dataset (and where they are located).
Read the specification of this method carefully. The idea is that you are creating
a string so that, if you print the string, it shows each point on its own
line, with the point index at the start of the line. For example, a Dataset
with three points might look something like this:
0: [1.0,2.0]
1: [3.0,4.0]
2: [5.0,6.0]
However, you should not use print
anywhere in this method. Instead, you
are creating a string that someone else can print out should they wish. To
create a multiline string, you need to make use of the newline character 'n'
,
like this:
'0: [1.0,2.0]\n1: [3.0,4.0]\n2: [5.0,6.0]'
Another important thing to keep in mind is that points can technically contain
both int
and float
values. But this method will always show the values as
floats (e.g. it type casts the value before converting it to a string. So
if a dataset contains the points [1.0, 2, 3.5]
and [-3, 0.5, 6]
(in that
order) this method will return
'0: [1.0,2.0,3.5]\n1: [-3.0,0.5,5.0]'
Testing it Out
The test cases for this method are in a separate test procedure than those of Part A. Run the test script and look for the line
Part B of class Dataset appears correct'
Once you see that, you can have confidence in this part of the assignment.
We highly recommend that you finish this last method of Dataset
by
Tuesday, November 5.
This will allow you to take a break to work on
Assignment 5, which is actually due in the
middle of this assignment. It also puts you in a good place to finish the
remainder of the assignment at a somewhat leisurely pace.
With that said, even though this is only one method, some people may actually find it the hardest method in the entire assignment. Furthermore, nothing else in this assignment depends on this method (it is for debugging help only). Therefore if you are stuck on this problem, skip it for now and come back to it later. Either work on the next task, or start work on Assignment 5.
Task 2: The Cluster Class
The class Cluster
is a little more involved that Dataset
, so we have broken
it up into two parts: Part A and Part B. Each part has its own test procedure
in a6test.py
, so you can fully test one part before moving on to the next.
Part A: Getters and Setters
The class invariant for the Cluster class mentions the following instance attributes:
# IMMUTABLE ATTRIBUTES
# Attribute _dataset: The Dataset for this cluster
# Invariant: _dataset is an instance of Dataset
#
# Attribute _centroid: The centroid of this cluster
# Invariant: _centroid is a point (list of int/float) whose length is
# equal to to the dimension of _dataset.
#
# MUTABLE ATTRIBUTES
# Attribute _indices: the indices of this cluster's points in the dataset
# Invariant: _indices is a list of ints. For each element ind in _indices,
# 0 <= ind < _dataset.getSize()
Once again, these attributes begin with an underscore, which means that you will
need to implement getters and setters. Once again these are straight-forward.
The only unusual one is getContents
. The return value is supposed to be a list
containing the points themselves, not the indices of the those points. You
will need to loop over the indices, creating copies of the corresponding points
from the database and collecting those copies in a newly created list.
You will also need to implement the initializer in this part of the assignment.
When you are done, remember to assert all your preconditions.
Look very carefully at the precondition for dset
. In particular, use
isinstance
, not type
. As for the other preconditions, feel free to use the
functions provided by a6dataset.py
.
Part B: distance, getRadius, and update
The last three methods are more complex methods that are necessary for the
k-means algorithm. To implement distance
, refer to the definition of euclidean
distance given earlier in this document. The
function getRadius
will loop over all of the points in the cluster, compute
their distance to the centroid and return the largest value.
The function update
is the most complex function in this class. You will need
to first go through all the points, adding up their coordinates. You need the
sum of the first coordinates of all the points in the cluster, the sum of all
the second coordinates, and so on. After that, you will need to divide by the
number of points in the cluster. Finally, you need to decide whether the centroid
has changed. The function
numpy.allclose
is mentioned in the specification. This function takes two
lists of numbers and tells you whether the corresponding numbers from the two
lists are all very close in value.
When you are done, remember to assert all your preconditions.
Testing it Out
Both parts A and B have their own test procedures, so you can test them separately. Run the tests after completing each part. If everything is working, you will get the appropriate message for each part.
Once you complete both parts, you will see that the visualizer has a lot more functionality. When you load up a data file and click the Reset button, you will see all points in the data set as blue markers in the plot window. In addition, you will see a small blue circle with black border representing an initial centroid. Finally, you can turn the small blue circle into a big one by choosing the overlay option.
There will still be the error message FAILED VISUALIZATION
printed out
somewhere in the command shell. This is normal.
Ideally, you should be done with Part A of Cluster
by
Friday, November 8.
This is not a particularly hard part of the assignment and can be done quickly.
Note that this is a day after Assignment 5
is due.
We highly recommend that you finish Part B of Cluster
by
Sunday, November 10.
Part B is one of the (two) hardest parts of the assignment. Not only will
finishing it give you experience with a more complicated class, but update
is
also a very important for-loops exercise.
Task 3: The Algorithm Class
The last class to implement is the Algorithm
class. As before, you need to
implement the methods following their specifications. The methods in this class
do a lot, but because of all the work you did in the lower-level classes, the
code for these methods is very reasonable.
Once again, we have broken this assignment up into several parts. Each part has
its own test procedure in a6test.py
.
Part A: The Function valid_seeds
When we start the algorithm, we need to begin with a collection of seeds. The seeds are our choices for the initial centroids as described above. A seed list is a collection of integers representing the positions of these centroids in the dataset. In order to be a valid seed list, the values must all be less than the size of the dataset, and no integer can appear twice.
In the class Algorithm
we allow the user to start with a random collection of
seeds, or to specify their own seed list. But that means we need to enforce the
precondition that the seed list is valid. That is the purpose of this function.
We recommend you finish this function by Monday, November 11. This is a straight-forward function, not much harder than one you might find on a lab. In fact, it is simple enough that you might want to take the time to get a hard start on the next function.
Part B: The Initializer
The initializer has the job of setting up the clusters to their initial states
so that the first iteration of the algorithm can proceed. Each cluster needs its
centroid initialized, but will start off empty (no data points) until the first
partitioning step. You will need to create a list of Cluster
objects, and each
one needs to be created using a point for its centroid.
If a seed list is specified, we use this list to choose the centroids. Once
again, remember to assert all preconditions either
using the function valid_seeds
or the functions from a6dataset.py
.
If the seed list is not present, then we need to pick the centroids at random
from the dataset. It is important to note that we do not want to select the same
point twice, so simply selecting $k$ random indices into the dataset will not do
it. Fortunately, in the random
package there is a function random.sample
that selects a number of items from
a list. The returned list has no duplicates in it. You should use this function
to select which points from the dataset will become the seeds of clusters.
There is only one getter to implement in Part B. Since Algorithm
accesses
other classes, but no classes access it, it does not need a lot of getters.
We recommend that you finish this part by Monday, November 11. This part is very short, so you should be able to do it with little difficulty at this point. In fact, you might even be able to combine it with Part A to recover some time if you are behind.
Part C: Partitioning
In the next step, you must implement the two methods necessary to partition the
data set among the different clusters. When implementing the _nearest
method,
you should use a loop over the list of clusters, since they are the possible
candidates. During the execution of your loop, you may want to keep track of
both the nearest cluster found so far and the distance to the nearest cluster
found so far. An interesting question (with more than one good answer) is what
you should initialize these values to.
The _partition
method separates all the points into clusters. It can be very
simple since the distance computation is already implemented in Cluster
.
Try to finish Part C of Algorithm
by
Wednesday, November 13.
This is the second hardest part of the assignment (after update
in Cluster
).
Once you are done with this, the rest of the assignment is very
straight-forward.
Part D: Clustering
The final part of the assignment is to complete the clustering algorithm, using
partitioning to compute the clusters at each step until they change. This
algorithm is broken up into two functions: run
and step
.
The method run
is easy as it just involves calling step
repeatedly. The only
issue is that you have to know when to stop. You do not need a while-loop to do
this (as we will not have covered these in time). Just loop maxstep
times. If
the computation converges early, exit the loop with a return statement. However,
note that this method takes an argument, so you should remember to
assert your preconditions.
The function step
is where all the work is done. It is obviously going to use
partition
as helper. But it also relies on another helper method: _update
.
The _update
method involves updating all the clusters, which is easy because
the core operation is already implemented in Cluster
. The only complication
is keeping track of whether all the centroids remained unchanged. One approach
is to do something similar to what you did for finding the nearest cluster.
Simply keep track of whether a changed centroid has been seen so far.
The step
method is simple once you have _update
and _partition
. If you do
it right, it is only two lines.
This will complete the assignment and if you have followed the microdeadlines, you should finish in time. By this point you should have completed most of the hard work and the method implementations should actually be getting easier.
Testing it Out
Once you complete each part of Algorithm
, run the unit test. If everything
is working, you will get the appropriate message, just like with Cluster
.
You can also test this part in the visualizer. When you load a data file, you should be able to run the entire clustering process. You can see the initial clusters once the Clustering initializer is working. Once part D is done, you can see the process of refining clusters step by step. You should be able to visually confirm the operation of the algorithm; after each iteration, each point should have the same color as the nearest centroid.
You might find it interesting to run k-means clustering on the examples several times, and see how the means move around. Note that if you have too few clusters, points that seem far apart end up in the same cluster. But if you have too many clusters, you get clusters that seem to lie too close to each other. Cluster analysis can be finicky.
Finishing Touches
Once you have everything working you should go back and make sure that your program meets the class coding conventions. In particular, you should check that the following are all true:
- You have indented with spaces, not tabs ({Pulsar} handles this automatically).
- Functions are each separated by two blank lines.
- Lines are short enough (~80 characters) that horizontal scrolling is not necessary.
- Docstrings are only used for specifications, not general comments.
- Your name(s) and netid(s) are in the comments at the top of the modules.
We only want you to upload the three files you modified: a6dataset.py
,
a6cluster.py
and a6algorithm.py
. We do not want any other files, even if
you added more tests to a6test.py
. Upload these files to CMS by the due date:
Friday, November 15.
Completing the Survey
In addition to turning in the assignment, we ask that you complete the survey posted in CMS. Once again, the surveys will ask about things such as how long you spent on the assignment, your impression of the difficulty, and what could be done to improve it. Please try to complete the survey within a day of turning in this assignment. Remember that participation in surveys comprise 1% of your final grade.