### CS4/5780 — Lecture 4 ## Unsupervised Learning # k-Means
Spring 2022
--- # Logistics * Placement quiz due today * Use this to gauge your preparedness for CS4/5780 * and what you need to review * Review session soon! * If you scored `$<13$` you won't get a Vocareum invite * If this is somehow in error, let me know --- #### So far...supervised learning. * Labeled training set $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n) \}$ * Goal: make predictions that are accurate on future data from the same source --- ## Unsupervised Learning * Dataset is now **unlabeled**: $\mathcal{D} = \{x_1, x_2, \ldots, x_n \}$ * Goal is to "uncover structure" in the feature vectors $x_i \in \R^d$ This is much more open-ended than supervised learning!
Q: What does "structure" even mean?
--- #### Examples of Unsupervised Learning Tasks * Clustering * Group inputs based on similarity * Dimensionality reduction * Embed inputs $x_i \in \R^d$ into $\R^m$ with $m \ll d$ * Anomaly/outlier detection * Identify rare inputs * Visualization * Decide how to display inputs --- ## Clustering * Input: $n$ data points $\mathcal{D} = \{x_1, x_2, \ldots, x_n \}$ * Output: $k$ **clusters** of the dataset * i.e. a funtion $\mathcal{D} \rightarrow \{1, \ldots, k\}$ --- #### Clustering Example: Topic Modeling * I have a large corpus of documents. * I want to split those documents by topic. * But I don't want to tell the system what the topics are * I want the system to learn the topics from the corpus. * There may not even be a ground-truth set of topics. --- ## k-Means Clustering Partition data into $k$ groups where all examples in a group are close in Euclidean distance.
--- ## k-Means Clustering
--- ## k-Means Objective A k-Means clustering (the analog of a hypothesis in this case) is a partition of $\mathcal{D}$ into $k$ sets (clusters) $\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_k$ such that $\mathcal{C}_i \cap \mathcal{C}_j = \emptyset$ and `$\mathcal{C}_1 \cup \mathcal{C}_2 \cup \cdots \cup \mathcal{C}_k = \mathcal{D}$`.
We measure how "good" a clustering is by
`\[ Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2}. \]`
This is the average distance between pairs of points in a cluster, weighted by cluster size.
--- ## k-Means Objective — Centroids If `$\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$` denotes the mean (centroid) of `$\mathcal{C}_{\ell}$`,
\begin{align*} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) &= s \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \|x_{i} - x_{j}\|^{2}_{2} \\&= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \| (x_{i} - \mu_\ell) - ( x_{j} - \mu_\ell) \|^{2}_{2} \\&= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \sum_{i,j\in\mathcal{C}_{\ell}} \left( \| x_{i} - \mu_\ell \|^{2}_{2} + \| x_{j} - \mu_\ell \|^{2}_{2} -2 \langle x_{i} - \mu_\ell, x_{j} - \mu_\ell \rangle \right) \\&= \sum_{\ell=1}^{k}\frac{1}{2\lvert\mathcal{C}_{\ell}\rvert} \left( \lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} + \lvert\mathcal{C}_{\ell}\rvert \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} -2 \langle 0, 0 \rangle \right) \\&= \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}. \end{align*}
--- ## k-Means Objective — Centroids So another way to write the k-Means objective is `\[ Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} \]` where `$\mu_\ell = \frac{1}{|\mathcal{C}_{\ell}|} \sum_{i \in \mathcal{C}_{\ell}} x_i$`. This is the sum of the squares of the distances between each point and its cluster's centroid. --- ### How do we optimize this? Won't be able to find the global optimum necessarily. --- ## "Augmented" k-Means Objective Now let the "centroids" `$\mu_\ell \in \R^d$` vary freely, and set `\[ Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}. \]`
Why is minimizing this objective equivalent to minimizing the original k-Means objective?
--- ## Minimizing this objective It's hard! In fact, it's **NP-Hard**. But, it's easy to minimize with respect to any one parameter, leaving the others fixed. --- ## Minimizing over $\mu$ Suppose the $\mathcal{C}$ are fixed, and we want to minimize over $\mu$. `\begin{align*} 0 &= \nabla_{\mu_j} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) \\&= \nabla_{\mu_j} \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2} \\&= 2 \sum_{i\in\mathcal{C}_j} (\mu_j - x_{i}) = 2 \abs{\mathcal{C}_j} \mu_j - 2 \sum_{i\in\mathcal{C}_j} x_i. \end{align*}`
`$\mu_j = \frac{1}{\abs{\mathcal{C}_j}} \sum_{i\in\mathcal{C}_j} x_i$` is just the mean of $\mathcal{C}_j$
. --- ## Minimizing over $\mathcal{C}$ Suppose the $\mu$ are fixed, and we want to place $x_i$ in the cluster that minimizes the loss `\[ Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}, \mu_1, \ldots, \mu_k) = \sum_{\ell=1}^{k} \sum_{i\in\mathcal{C}_{\ell}} \| x_{i} - \mu_\ell \|^{2}_{2}. \]` By inspection, this happens when we place $x_i$ in the class with the closest $\mu_{\ell}$, i.e. `$x_i \in \mathcal{C}_{\arg \min_{\ell \in \{1,\ldots,k\}} \| x_i - \mu_{\ell} \|_2}$`. --- ### Alternating minimization: Lloyd's algorithm Idea:
alternate minimizing `$Z$` over the centroids and the cluster assignments.
Repeat until converged: `$$\mu_\ell := \frac{1}{\abs{\mathcal{C}_\ell}} \sum_{i\in\mathcal{C}_\ell} x_\ell \text{ for all }\ell \in \{1,\ldots,k\}.$$` `$$\mathcal{C}_\ell := \left\{ i \in \{1,\ldots,n\} \middle| \ell = \arg \min_{l \in \{1,\ldots,k\}} \| x_i - \mu_l \|_2 \right\}.$$` If the cluster assignments didn't change, we converged. --- ### Lloyd's algorithm: Graphically
--- ### Lloyd's algorithm: Animated
--- ## Why must this converge? * Loss `$Z$` is
decreasing
at each step. * Moving a point to a different cluster with a closer centroid must diminish the loss. * Updating the centroids can't increase the loss. * There are only a finite number of cluster assignments. * The algorithm can't loop because `$Z$` is decreasing. * So it must terminate. --- ### Lloyd's algorithm: Caveats * Doesn't necessarily converge to the global optimum of `$Z$`. * A "local optimum" of Lloyd's algorithm isn't necessarily globally optimal. * Different initializations can yield different clusters. * Computational cost is `$\mathcal{O}(ndk)$` per iteration * `$n$` is dataset size, `$d$` dimension, `$k$` clusters * Total run time depends on number of iterations --- ## How to choose `$k$`? Could choose the `$k$` that minimizes $Z$.
Problem: with `$k = n$` each point gets its own centroid and the loss is `$Z = 0$`.
--- ## How to choose `$k$`? One heuristic:
plot `$Z$` with respect to `$k$` and choose the `$k$` at which the loss stops significantly decreasing
.
--- ## How to choose `$k$`? Often we use k-Means as part of a larger system, where the cluster output is passed into some downstream task. Another heuristic:
choose the `$k$` that results in the best performance on the downstream task
. --- ## How to initialize? Simple approach: assign each point to a cluster at random. --- ## How to initialize? Generally better approach: assign the centroids $\mu_\ell$ at random by sampling (without replacement) from the dataset. --- ## How to initialize? Even better approach:
k-means++
Assign the centroids $\mu_\ell$ at random from the dataset, weighted so that the centroids are spread out. --- ## How to initialize?
We usually try many random initializations and pick the one that results in the lowest loss `$Z$` after running Lloyd's algorithm.
This increases our chances of getting the globally optimal solution — although it still does not guarantee anything. --- #### What do k-Means clusters look like?
A cluster occupies the space that is nearer to its centroid than any other centroid.
Say our cluster's centroid is `$\mu \in \R^d$` and another cluster's centroid is `$\nu \in \R^d$`. Then `$x$` will be in our cluster if `\[ \| x - \mu \|^2 \le \| x - \nu \|^2 \]` --- ### What do k-Means clusters look like? `\[ \| x - \mu \|^2 \le \| x - \nu \|^2 \]` `\[ \| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2 \]` --- ### What do k-Means clusters look like? `\[ \| x - \mu \|^2 \le \| x - \nu \|^2 \]` `\[ \| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2 \]` `\[ 2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2 \]` --- ### What do k-Means clusters look like? `\[ \| x - \mu \|^2 \le \| x - \nu \|^2 \]` `\[ \| x \|^2 - 2 \langle x, \mu \rangle + \| \mu \|^2 \le \| x \|^2 - 2 \langle x, \nu \rangle + \| \nu \|^2 \]` `\[ 2 \langle x, \nu \rangle - 2 \langle x, \mu \rangle \le \| \nu \|^2 - \| \mu \|^2 \]` `\[ \langle x, \nu - \mu \rangle \le \frac{ \| \nu \|^2 - \| \mu \|^2 }{2}. \]` This is just the equation for a half-space: `$\langle x, a \rangle \le b$`. --- ### What do k-Means clusters look like? Conclusion:
a k-Means cluster lies in the intersection of half-spaces
. The intersection of half-spaces is a polytope`$^*$`.
`$^*$` if we use an expanded definition of "polytope" that includes unbounded objects.
--- ### What do k-Means clusters look like? Conclusion:
a k-Means cluster lies in the intersection of half-spaces
. The intersection of half-spaces is a convex polytope`$^*$`. All these intersections together form a
Voronoi diagram
.
`$^*$` if we use an expanded definition of "polytope" that includes unbounded objects.
--- ### k-Means as a Voronoi Diagram
--- ## We see the same thing in 1-NN classifiers: the decision space is described by polytopes and forms a Voronoi diagram. --- ## What does this mean?
Boundaries between k-Means clusters are flat hyperplanes.
* This limits the sorts of datasets we can cluster.