reveal.js

### CS4/5780 — Lecture 4
						## Unsupervised Learning
						# k-Means
						<center>Spring 2022</center>		
						---
						# Logistics
						
						* Placement quiz due today
						* Use this to gauge your preparedness for CS4/5780
							* and what you need to review
						* Review session soon!
						* If you scored `$<13$` you won't get a Vocareum invite
							* If this is somehow in error, let me know

---
						#### So far...supervised learning.

* Labeled training set $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n) \}$
						* Goal: make predictions that are accurate on future data from the same source
						---
						## Unsupervised Learning

* Dataset is now **unlabeled**: $\mathcal{D} = \{x_1, x_2, \ldots, x_n \}$
						* Goal is to "uncover structure" in the feature vectors $x_i \in \R^d$

This is much more open-ended than supervised learning!

Q: What does "structure" even mean?
						---
						#### Examples of Unsupervised Learning Tasks

* Clustering
						  * Group inputs based on similarity
						* Dimensionality reduction
						  * Embed inputs $x_i \in \R^d$ into $\R^m$ with $m \ll d$
						* Anomaly/outlier detection
						  * Identify rare inputs
						* Visualization
						  * Decide how to display inputs
						---
						## Clustering

* Input: $n$ data points $\mathcal{D} = \{x_1, x_2, \ldots, x_n \}$

* Output: $k$ **clusters** of the dataset
						  * i.e. a funtion $\mathcal{D} \rightarrow \{1, \ldots, k\}$
						---
						#### Clustering Example: Topic Modeling

* I have a large corpus of documents.
						* I want to split those documents by topic.
						* But I don't want to tell the system what the topics are
						    * I want the system to learn the topics from the corpus.
							* There may not even be a ground-truth set of topics.
						---
						## k-Means Clustering

Partition data into $k$ groups where all examples in a group are close in Euclidean distance.

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/Kmeans_cluster.jpg" width="60%"/>
						---
						## k-Means Clustering

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/nocluster.jpg" width="80%"/>
						---
						## k-Means Objective

A k-Means clustering (the analog of a hypothesis in this case) is a partition of $\mathcal{D}$ into $k$ sets (clusters) $\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_k$ such that $\mathcal{C}_i \cap \mathcal{C}_j = \emptyset$ and `$\mathcal{C}_1 \cup \mathcal{C}_2 \cup \cdots \cup \mathcal{C}_k = \mathcal{D}$`.

We measure how "good" a clustering is by
						
						`\[ Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2}. \]`

This is the average distance between pairs of points in a cluster, weighted by cluster size.
						---
						## k-Means Objective — Centroids

If `$\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$` denotes the mean (centroid) of `$\mathcal{C}_{\ell}$`,

\begin{align*}
							Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) 
							&= s
							\sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} 
							\sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2}
							\\&= 
							\sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} 
							\sum_{i,j\in\mathcal{C}_{\ell}} \| (x_{i} - \mu_\ell) - ( x_{j} - \mu_\ell) \|^{2}_{2}
							\\&= 
							\sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} 
							\sum_{i,j\in\mathcal{C}_{\ell}} \left(
								\| x_{i} - \mu_\ell \|^{2}_{2}
								+ \| x_{j} - \mu_\ell \|^{2}_{2}
								-2 \langle x_{i} - \mu_\ell, x_{j} - \mu_\ell \rangle
							\right)
							\\&= 
							\sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} 
							\left(
								\lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}
								+
								\lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}
								-2 \langle 0, 0 \rangle
							\right)
							\\&= 
							\sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.
						\end{align*}
						
						---
						## k-Means Objective — Centroids

So another way to write the k-Means objective is
						`\[
							Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) 
							=
							\sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}
						\]`
						where `$\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$`.
						This is the sum of the squares of the distances between each point and its cluster's centroid.
						---
						### How do we optimize this?

Won't be able to find the global optimum necessarily.
						---
						## "Augmented" k-Means Objective

Now let the "centroids" `$\mu_\ell \in \R^d$` vary freely, and set
						`\[
							Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) 
							=
							\sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.
						\]`

Why is minimizing this objective equivalent to minimizing the original k-Means objective?
						---
						## Minimizing this objective

It's hard! In fact, it's **NP-Hard**.

But, it's easy to minimize with respect to any one parameter, leaving the others fixed.
						---
						## Minimizing over $\mu$

Suppose the $\mathcal{C}$ are fixed, and we want to minimize over $\mu$.
						`\begin{align*}
							0 &= \nabla_{\mu_j} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) 
							\\&= \nabla_{\mu_j} \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}
							\\&= 2 \sum_{i\in\mathcal{C}_j} (\mu_j - x_{i})
							=
							2 \abs{\mathcal{C}_j} \mu_j - 2 \sum_{i\in\mathcal{C}_j} x_i.
						\end{align*}`
						`$\mu_j = \frac{1}{\abs{\mathcal{C}_j}} \sum_{i\in\mathcal{C}_j} x_i$` is just the mean of $\mathcal{C}_j$.
						---
						## Minimizing over $\mathcal{C}$

Suppose the $\mu$ are fixed, and we want to place $x_i$ in the cluster that minimizes the loss
						`\[
							Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) 
							=
							\sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}.
						\]`
						By inspection, this happens when we place $x_i$ in the class with the closest $\mu_{\ell}$, i.e. `$x_i \in \mathcal{C}_{\arg \min_{\ell \in \{1,\ldots,k\}} \| x_i - \mu_{\ell} \|_2}$`.
						---
						### Alternating minimization: Lloyd's algorithm

Idea: alternate minimizing `$Z$` over the centroids and the cluster assignments. Repeat until converged:

`$$\mu_\ell := \frac{1}{\abs{\mathcal{C}_\ell}} \sum_{i\in\mathcal{C}_\ell} x_\ell \text{ for all }\ell \in \{1,\ldots,k\}.$$`

`$$\mathcal{C}_\ell := \left\{ i \in \{1,\ldots,n\} \middle| \ell = \arg \min_{l \in \{1,\ldots,k\}} \| x_i - \mu_l \|_2 \right\}.$$`

If the cluster assignments didn't change, we converged.
						---
						### Lloyd's algorithm: Graphically

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/lloyd.jpg" width="50%"/>
						---
						### Lloyd's algorithm: Animated

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/kmeans.gif" width="50%"/>
						---
						## Why must this converge?

* Loss `$Z$` is decreasing at each step.
							* Moving a point to a different cluster with a closer centroid must diminish the loss.
							* Updating the centroids can't increase the loss.

* There are only a finite number of cluster assignments.
						* The algorithm can't loop because `$Z$` is decreasing.
						* So it must terminate.
						---
						### Lloyd's algorithm: Caveats

* Doesn't necessarily converge to the global optimum of `$Z$`.
							* A "local optimum" of Lloyd's algorithm isn't necessarily globally optimal.
							* Different initializations can yield different clusters.

* Computational cost is `$\mathcal{O}(ndk)$` per iteration
							* `$n$` is dataset size, `$d$` dimension, `$k$` clusters
							* Total run time depends on number of iterations
						---
						## How to choose `$k$`?

Could choose the `$k$` that minimizes $Z$.

Problem: with `$k = n$` each point gets its own centroid and the loss is `$Z = 0$`.
						---
						## How to choose `$k$`?

One heuristic: plot `$Z$` with respect to `$k$` and choose the `$k$` at which the loss stops significantly decreasing.

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/knee.jpg" width="50%"/>
						---
						## How to choose `$k$`?

Often we use k-Means as part of a larger system, where the cluster output is passed into some downstream task.

Another heuristic: choose the `$k$` that results in the best performance on the downstream task.
						---
						## How to initialize?

Simple approach: assign each point to a cluster at random.
						---
						## How to initialize?

Generally better approach: assign the centroids $\mu_\ell$ at random by sampling (without replacement) from the dataset.
						---
						## How to initialize?

Even better approach: k-means++

Assign the centroids $\mu_\ell$ at random from the dataset, weighted so that the centroids are spread out.
						---
						## How to initialize?

We usually try many random initializations and pick the one that results in the lowest loss `$Z$` after running Lloyd's algorithm.

This increases our chances of getting the globally optimal solution — although it still does not guarantee anything. 
						---
						#### What do k-Means clusters look like?

A cluster occupies the space that is nearer to its centroid than any other centroid. Say our cluster's centroid is `$\mu \in \R^d$` and another cluster's centroid is `$\nu \in \R^d$`. Then `$x$` will be in our cluster if
						`\[
							\| x - \mu \|^2 \le \| x - \nu \|^2
						\]`
						---
						### What do k-Means clusters look like?

`\[
							\| x - \mu \|^2 \le \| x - \nu \|^2
						\]`
						`\[
							\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
						\]`
						---
						### What do k-Means clusters look like?
						
						`\[
							\| x - \mu \|^2 \le \| x - \nu \|^2
						\]`
						`\[
							\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
						\]`
						`\[
							2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2
						\]`
						---
						### What do k-Means clusters look like?
						
						`\[
							\| x - \mu \|^2 \le \| x - \nu \|^2
						\]`
						`\[
							\| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2
						\]`
						`\[
							2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2
						\]`
						`\[
							\langle x, \nu - \mu \rangle \le \frac{ \| \nu \|^2 - \| \mu \|^2 }{2}.
						\]`
						This is just the equation for a half-space: `$\langle x, a \rangle \le b$`.
						---
						### What do k-Means clusters look like?
						
						Conclusion: a k-Means cluster lies in the intersection of half-spaces.

The intersection of half-spaces is a polytope`$^*$`.

`$^*$` if we use an expanded definition of "polytope" that includes unbounded objects.
						---
						### What do k-Means clusters look like?
						
						Conclusion: a k-Means cluster lies in the intersection of half-spaces.

The intersection of half-spaces is a convex polytope`$^*$`.

All these intersections together form a Voronoi diagram.

`$^*$` if we use an expanded definition of "polytope" that includes unbounded objects.
						---
						### k-Means as a Voronoi Diagram

<img src="https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/images_new/unsupervised/boundaries.jpg" width="50%" />
						---
						## We see the same thing in 1-NN classifiers: the decision space is described by polytopes and forms a Voronoi diagram.
						---
						## What does this mean?

Boundaries between k-Means clusters are flat hyperplanes.
						* This limits the sorts of datasets we can cluster.