Processing math: 100%

Tuesday, Thursday
9:05 (ILR 315)
11:15 (Call Aud)

CS 1110: Introduction to Computing Using Python

Fall 2018

Main

About:
   Announcements
   Course Staff
   Times & Places
   Calendar

Materials:
   Lectures
   Texts
   Python
   Text Shell
   Videos

Assessment:
   Grading
   Assignments
   Labs
   Exams

Resources:
   CMS
   Piazza
   AEWs
   FAQ
   Python API
   Introcs API
   Python Tutor

Style Guide
Terminology

Academic Integrity

Assignment 6:
Clustering

Due to CMS by Wednesday, November 14th at 11:59 pm.

Big Data is a major computer science topic these days. As computers become both ubiquitous and more powerful, many applications — from science to business to entertainment — are generating huge amounts of data. This data is too large to process by conventional means. That means we need new tools (such as new programming languages) and new algorithms to handle it all. The CS department has several upper level courses (such as CS 4780: Machine Learning for Intelligent Systems) dedicated to this topic.

In this assignment you will work with one of the most popular (though not necessarily most efficient) tools for analyzing large amounts of data: k-means clustering. In k-means clustering, you organize the data into a small number (k) of clusters of similar values. For example, in the picture to the right, there are a lot of points in 3-dimensional space. These points are color-coded into five clusters, with all the points in a cluster being near to one another.

If you look at the wikipedia entry for k-means clustering, you can see that it has many, many applications. It has been used in market analysis, recommender systems (e.g. how Netflix recommends new movies to you), and computer vision, just to name a few. It is an essential tool to have if you are working with data. With the skills and knowledge you have been acquiring in CS 1110, you are now able to implement it yourself!

Important: We know that this assignment straddles the exam. This is unfortunate timing of the semester. In addition, it is the only assignment in which you will get practice writing classes before we ask you to write classes on the exam. We highly recommend that you follow the recommended strategy. This puts micro-deadlines on each of the assignment, and ensures that you have finished enough of the assignment before the exam.

Learning Objectives

This assignment has several important objectives.

It gives you practice with writing your own classes.
It gives you practice with writing code spread across many files.
It illustrates the importance of class invariants.
It gives you practice with using data encapsulation in classes.
It gives you practice with both both 1-dimensional and 2-dimensional lists.
It gives you experience with structuring your code with helper functions.
It demonstrates how to implement a powerful data-analysis algorithm on real data.

Authors: W. White, D. Murphy, T. Westura, S. Marschner, L. Lee

Academic Integrity and Collaboration
K-Means Clustering

Distances between Points
Cluster Centroids
The Algorithm
Classes in our Implementation

How to Work on the Assignment
Recommended Implementation Strategy

The Dataset Class
The Cluster Class
The KMeans Algorithm Class

Finishing the Assignment

Academic Integrity and Collaboration

As we explained in class, this assignment is returning after a bried hiatus. There are several changes (not the least of which it is now in Python 3), but there is some relevant code out in the wild. Please avoid looking at solutions online, or looking code from other students this semester, or semesters previous.

In this assignment, it is highly unlikely that your code for this assignment will look exactly like someone else's. We will be using Moss to check for instances of plagiarism. We also ask that you do not enable violations of academic policy. Do not post your code to Pastebin, GitHub, or any other publicly accessible site.

Collaboration Policy

You may do this assignment with one other person. If you are going to work together, then form your group on CMS as soon as possible. If you do this assignment with another person, you must work together. It is against the rules for one person to do some programming on this assignment without the other person sitting nearby and helping.

With the exception of your CMS-registered partner, we ask that you do not look at anyone else's code or show your code to anyone else (except a CS1110 staff member) in any form whatsoever. This includes posting your code on Piazza to ask for help. It is okay to post error messages on Piazza, but not code. If we need to see your code, we will ask for it.

K-Means Clustering

In this section, we describe what cluster analysis is, and how we use k-means clustering to implement it. Cluster analysis is the act of grouping together similar points into groups. For example, one of the applications of clustering is epidemiological analysis. Suppose that you had data with 2-dimensional longitude-latitude pairs of where birds carrying different strains of avian flu were found. Clustering this data into groups can give insight into different regions where a flu strain appeared.

When we say point, it could be a 2-dimensional point, a 3-dimensional point, a 4-dimensional point, and so on. In fact, we could have points of arbitrary dimension, particularly if the data is not spatial. For example, one of the data sets for this assignment is a fictional market analysis of candies. Each candy is a 4-dimensional point that represents how sweet, sour, nutty, and crunchy it is.

Distances between Points

When we cluster points, we use the Euclidean Distance to measure the distance between them. You may already know how to compute the distance between two 2-dimensional points: $d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2}$ or between 3-dimensional points: $d(\mathbf{p},\mathbf{q}) = \sqrt{(p_x - q_x)^2 + (p_y - q_y)^2 + (p_z - q_z)^2}.$ These are special cases of the general definition: given two $n$ -dimensional points $\textbf{p} = (p_1, p_2, \ldots, p_n)$ and $\textbf{q} = (q_1, q_2, \ldots, q_n)$ , the Euclidean distance between them is:

$d(\textbf{p},\textbf{q}) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\cdots+(p_n-q_n)^2}$

For example, suppose we have two points: $\textbf{p} = (0.1,0.2,0.3,0.4)$ and $\textbf{q} = (0.0,0.2,0.3,0.2)$ . Then:

$d(\textbf{p},\textbf{q}) = \sqrt{(0.1-0.0)^2+(0.2-0.2)^2+(0.3-0.3)^2+(0.4-0.2)^2} = 0.7071\ldots$

Cluster Centroids

Given a set of points, we can identify its centroid. The centroid is the “center of mass” of the cluster, which is to say, the average, or mean, of all of the points in the cluster. The centroid might happen to be one of the points in the cluster, but it does not have to be. For example, the picture to the right is an example of a 4-point cluster (black circles) with a centroid (red circle) that is not one of the points in the cluster.

To compute a centroid, we average. Remember that to average, you add up a collection of things and divide by the total. The same is true for points; we add up all the points, and then “divide” the result by the total number of points. Adding two points results in a new point that is the coordinate-wise addition of the two input points:

$\textbf{p}+\textbf{q} = (p_1+q_1, p_2+q_2, \ldots, p_n+q_n).$

When we divide a point by a number, the division is done coordinate-wise, as follows:

$\textbf{p}/a = (p_1/a, \>p_2/a,\> \ldots,\> p_n/a)$

The Algorithm

The idea of k-means is quite simple: each point should belong to the cluster whose mean (centroid) it is closest to. But, the centroid of a cluster depends on which points you put into it. This creates a kind of chicken-and-egg problem, which is solved using an iterative approach: first assign points to clusters, then compute new centroids for these clusters, then re-assign points using the new centroids, and so on.

To make this general idea into a precise algorithm we can implement, we break down k-means clustering into several steps. Throughout, let $n$ be the dimension of the points.

1. Pick $k$ Centroids

Our goal is to divide the points into $k$ adjacent groups. The first step is to guess $k$ centroids. Throughout these instructions, we will refer to the centroids as $\textbf{m}_1,\textbf{m}_2,\ldots, \textbf{m}_k$ . (The letter m stands for “mean”, or centroid, which is where the algorithm's name comes from.)

Remember that the centroids are themselves $n$ -dimensional points. Hence, for any mean $\textbf{m}_j$ , where $1 <= j <= k$ , we have

$\textbf{m}_j = (m_{j,1},m_{j,2},\ldots,m_{j,n})$

So each $m_{j,i}$ represents the ith coordinate of the jth centroid.

To pick $k$ centroids, all we do is choose $k$ points from our original dataset. We will choose these points at random, as if drawing cards from a deck. This is referred to as the Forgy Method, and it is one of the most popular ways to start k-means clustering.

2. Partition the Dataset

Now that we have $k$ centroids, we assign every point in the data set to a centroid. We do this by matching each point to the nearest centroid $\textbf{m}_j$ , and then have the points assigned to each centroid form a new cluster. This is known as partitioning because each point will belong to one and only one cluster.

If there happens to be a tie between which centroids are closest to a given point, we arbitrarily break the tie in favor of the $\textbf{m}_j$ with smallest $j$ . If we put the centroids in a list, this would be the centroid that comes first in the list.

There will then be $k$ sets $S_1, S_2, \ldots, S_k$ , where set $S_j$ is the cluster consisting of all the points associated with centroid $\textbf{m}_j$ .

3. Recompute the Means

Once we have the sets $S_1, S_2, \ldots, S_k$ , we need to compute a new mean $\textbf{m}_j$ for each $S_j$ . This new mean is just the average of all of the points in that set. Let

$S_j = \lbrace \textbf{q}_{1},\textbf{q}_{2},\ldots,\textbf{q}_{c_j}\rbrace$

be the set of points associated with centroid $\textbf{m}_j$ , where $c_j$ is the number of points inside of $S_j$ . Then the new mean is the $n$ -dimensional point

$\textbf{m}_j = \frac{1}{c_j}(\textbf{q}_{1}+\textbf{q}_{2}+\cdots+\textbf{q}_{c_j}).$

The formula above uses the rules for adding points and dividing points by numbers that we discussed above. If you do not understand this formula, please talk to a staff member immediately.

Now, because we have recomputed the means, some points might be closer to the centroid for a cluster they are not currently in than to the centroid for the cluster they are in. To fix this, we repartition the dataset by going back to step 2 and proceeding as before.

4. Repeat Steps 2 and 3 Until Convergence

The clustering algorithm continues this process: recompute the means and repartition the dataset. We want to keep doing this until the means stop changing. If all the means are unchanged after step 3, we know that partitioning again will produce the same clusters, which will have the same means, so there is no point in continuing. When this happens, we say that the algorithm has converged.

Sometimes it can take a very long time to converge — thousands of thousands of steps. Therefore, when we implement k-means clustering, we often add a “time-out factor”. This is a maximum number of times to repeat steps 2 and 3. If the algorithm does not converge before this maximum number of times, we stop anyway, and use the clusters that we do have.

Classes Used in Our Implementation

Classes are ideal for representing complex mathematical objects. For example, we saw in class how to use classes to represent rectangles or times of day. There are several interesting mathematical objects in the description above (points, centroids, sets/clusters) and we could use classes to represent all of these.

Deciding which of these things to create classes for is the core of object-oriented software design. For this assignment we have made these decisions for you, since it is not always easy to hit the right balance of complexity against structure the first time, and we want you to spend your time implementing rather than redesigning.

For our implementation of k-means, we decided not to use classes to represent points and centroids, because Python's built-in lists serve well enough. In particular, you should not use the Point class in introcs. That class only supports 3-dimensional points, but we would like to support points in any dimension.

In the end, we decided to create classes for three important concepts in k-means. First there is the entire dataset, represented by an instance of the class Dataset. Second, there is a cluster of data, represented by an instance of the class Cluster. Finally, there is the set of clusters created by k-means, represented by an instance of the class ClusterGroup.

Points are Lists

Throughout this assignment, a point will be represented as a list of numbers; both ints and floats are allowed. This is true for centroids as well. For example, here is an example of a possible centroid if we were working with 4-dimensional points:

$\textbf{m}_j \mbox{ is represented as the list } [0.1, 0.5, 0.3, 0.7].$

In this example, if x stores the name of the list, then $m_{j,2}$ (using the notation from a previous section) is x[1], or 0.5, while $m_{j,4}$ is x[3], or 0.7. Note that our mathematical notation starts at 1, while list indices start at 0.

Class `Dataset`

The k-means algorithm starts with a dataset—a collection of points that will be grouped into clusters. This dataset is stored in an instance of the class Dataset. This class is very simple; it just holds the list of points, so its most important attribute is:

_contents: A list of points inside this dataset

For example, if a dataset contains the points (0.1,0.5,0.3,0.7) and (0.2,0.0,0.2,0.1), then the attribute _contents is

     [[0.1,0.5,0.3,0.7], [0.2,0.0,0.2,0.1]]

The points are all lists of numbers, and they all have to be the same length; remember that this length is the dimension of the data. We keep track of this dimension in a second attribute:

_dimension: The dimension of all points in this dataset

The fact that all the points are supposed to have the right dimension is expressed by a class invariant:

The number of columns in _contents is equal to _dimension. That is, for every item _contents[i] in the list _contents, len(_contents[i]) == _dimension.

You might argue that storing the dimension separately is redundant, since you could always look at the points in _contents to find it, but we would like the dimension to be defined even before we add any points to the dataset.

One of the things that you might notice about these attributes is that they all start with an underscore. As we mentioned in class, that is a coding convention to indicate that the attributes should not be accessed directly outside of the class Dataset. In particular, neither Cluster not ClusterGroup should access these attributes at all. When those two classes do need to access these attributes, they should do it through getters and setters. One of the things you will do in this assignment is implement these getters and setters.

Important: If a method of any class accesses the hidden attributes of an object in another class, you will lose points.

Class `Cluster`

The next idea in the algorithm that we have defined a class for is a cluster $S_j$ . To support this type, we defined the class Cluster. Objects of this class will actually hold two different values: the centroid $\textbf{m}_j$ and the set $S_j$ . We do this because the centroid and cluster are associated with one another, so it makes sense to store them in a single object. We need two attributes corresponding to these ideas:

_centroid: A list of numbers representing the point $\textbf{m}_j$
_indices: A list of integers that are the indices where the points of this cluster can be found in the dataset.

Note that we are not storing the points belonging to a cluster; we are only storing the indices of the points in the dataset. This simplifies Cluster because it does not need to worry about maintaining lists of lists. Code that works with clusters does need to be able to get to the points, however, so one of its instance variables is a reference to the dataset:

_dataset: The dataset of which this cluster is a subset.

Once again, these attributes start with an underscore, so we want getters and setters if we are going to access them in the class Algorithm (it is okay if a method in Cluster accesses one of its own attributes with an underscore in it). So again, you need getters, like getCentroid and setters like addIndex. There is also a special getter getContents that returns a copy of the points to be used by the visualizers.

In addition to the getters and setters, the class Cluster is the logical place to put code that does computations involving a single cluster. Reading the description of the k-means algorithm above, we find that the two core operations with clusters are computing the distance from a point to a cluster's centroid and finding a new centroid for the cluster. These operations are handled by the two most important methods of the Cluster class:

A method to find the distance from a point to the cluster (distance)
A method to compute a new centroid for a cluster (update)

Finally, this class contains the special Python methods __str__ and __repr__. This methods are to help you with printing clusters more informative while debugging.

Class `Algorithm`

The k-means algorithm needs to work with several clusters at once. Hence it does not make sense to put this code in the Cluster class. Instead, we put this code in a separate class called Algorithm. The data stored in this class is simple: a list of all clusters in the algorithm, and as with Cluster, a reference back to the dataset.

_dataset: The dataset we are clustering
_clusters: The list of current clusters $S_1, \ldots, S_k$

Once again, we have getters and setters for these attributes. But the important methods of class ClusterGroup are the core of the k-means algorithm.

A method to partition the dataset, implementing step 2 of the algorithm (_partition)
A helper method for _partition to find the nearest cluster to a point (_nearest)
A method to update all the cluster centroids, implementing step 3 (_update)

Finally there are the methods that orchestrate the whole process:

A method that executes one step of the process, updating the centroids and re-partitioning the data (step)
A method that computes a clustering, from start to finish (run)

So a user of the class who just wants to cluster a dataset into $k$ clusters would create a Dataset, add all the points to it, create a Algorithm with $k$ clusters, and then call run for the algorithm object.

You will notice that some of the methods have names starting with an underscore. The meaning is the same as with attribues: these methods are meant to be used internally rather than called directly by code outside the Algorithm class.

How to Work on the Assignment

This assignment involves implementing several things before you'll have the whole method working. As always, the key to finishing is to pace yourself and make effective use of all of the unit tests and the visualizer that we provide.

Assignment Source Code

The assignment code is much more complicated than it has been in previous assignments. To make sure that you have everything, we have provided with two zip files.

cluster.zip: The package for the assignment application code

Course Material Authors: D. Gries, L. Lee, S. Marschner, & W. White (over the years)

Assignment 6: Clustering

Learning Objectives

Table of Contents

Academic Integrity and Collaboration

Collaboration Policy

K-Means Clustering

Distances between Points

Cluster Centroids

The Algorithm

1. Pick kk Centroids

2. Partition the Dataset

3. Recompute the Means

4. Repeat Steps 2 and 3 Until Convergence

Classes Used in Our Implementation

Points are Lists

Class Dataset

Class Cluster

Class Algorithm

How to Work on the Assignment

Assignment Source Code

Pacing Yourself

Running the Application

Using the Test Cases

Using the Visualizer

Using CSV Files

Asserting Preconditions

Recommended Implementation Strategy

Task 1: The Dataset Class

Asserting Preconditions

Testing it Out

The Cluster Class

Part A: Getters and Setters

Part B: distance, getRadius, and update

Testing it Out

The Algorithm Class

Part A: The Initializer

Part B: Partitioning

Part C: Update

Part D: Run

Finishing the Assignment

Survey

Assignment 6:
Clustering

1. Pick $k$ Centroids

Class `Dataset`

Class `Cluster`

Class `Algorithm`