k-means++

(1.2 hours to learn)

Summary

The k-means++ algorithm is a simple stochastic procedure for choosing the initial cluster centers for the classic k-means algorithm. This initialization guarantees that the expected objective of the final clustering solution will be within a constant factor of the optimal objective. Briefly stated, k-means++ selects the initial cluster centers as follows: the first center is chosen randomly from the input data points. Next, each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.

Context

This concept has the prerequisites:

Goals

  • Understand the k-means++ initialization procedure, how it's a stochastic algorithm, and the interpretation of the optimality guarantee

Core resources (read/watch one of the following)

-Free-

k-means++: The Advantage of Careful Seeding
Location: p. 1-3 (through section 2) provides the k-means++ algorithm, while the rest of the paper provides proofs and characterizations
Authors: David Arthur,Sergei Vassilvitskii

Supplemental resources (the following are optional, but you may find them useful)

-Free-

StackOverflow
Location: accepted answer
Other notes:
  • accepted answer provides a solid definition and python implementation

See also

-No Additional Notes-