Clustering

K-means Algorithm

For a given dataset $x^{(1)}, \ldots, x^{(n)}$ without any labels $y^{(i)}$ , clustering is the task of finding a partition of the data into subsets (clusters) such that the data points in the same cluster are more similar to each other than to those in other clusters. For this purpose, we usually use the k-means algorithm.

\text{Initialize cluster centroids } \mu_1, \ldots, \mu_k \text{ randomly}

\text{Repeat until convergence \{}

\quad \text{For each } i, \text{ set } \text{\{}

\quad \quad c^{(i)} \leftarrow \arg \min_j \| x^{(i)} - \mu_j \|^2

\quad \text{\}}

\quad \text{For each } j, \text{ set } \text{\{}

\quad \quad \mu_j \leftarrow \frac{\sum_{i=1}^n 1\left\{c^{(i)} = j\right\} \, x^{(i)}}{\sum_{i=1}^n 1\left\{c^{(i)} = j\right\}}

\quad \text{\}}

\text{\}}

The first loop gives us the centroid $c^{(i)}$ that is closest to each data point $x^{(i)}$ . The second loop updates the centroid $\mu_j$ for each cluster $j$ as the mean of all data points $x^{(i)}$ assigned to that cluster.

Three panels showing one round of k-means: randomly placed initial centroids, points assigned to their nearest centroid with the resulting Voronoi regions, then each centroid moved to the mean of its assigned points — One round on three clusters. From random centroids (left), each $x^{(i)}$ is assigned to its nearest centroid, carving the plane into Voronoi regions (middle); each $\mu_j$ then moves to its points' mean (right), alternating until assignments stop changing.

The distortion function (that measures the sum of the squared distances between each data point and its assigned centroid) can be written as:

J(c, \mu) = \sum_{i=1}^n \| x^{(i)} - \mu_{c^{(i)}} \|^2

In each step of the k-means algorithm, $J$ either decreases or stays the same. Therefore, the algorithm is guaranteed to converge. However, the convergence is not guaranteed to reach a global minimum since $J$ is non-convex.

The same dataset clustered two ways: a good initialization recovers the three clusters at low distortion, while a bad initialization splits one cluster between two centroids and merges the other two, giving a much higher distortion — A lucky initialization recovers the three clusters ( $J \approx 96$ , left); an unlucky one splits one cluster and merges two others, a far worse optimum ( $J \approx 659$ , right). In practice we re-run and keep the lowest $J$ .

Please use a larger screen

This content is best viewed on a laptop or desktop device.