# k-means++

(1.2 hours to learn)

## Summary

The k-means++ algorithm is a simple stochastic procedure for choosing the initial cluster centers for the classic k-means algorithm. This initialization guarantees that the expected objective of the final clustering solution will be within a constant factor of the optimal objective. Briefly stated, k-means++ selects the initial cluster centers as follows: the first center is chosen randomly from the input data points. Next, each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.

## Context

This concept has the prerequisites:

- k-means
- probability (k-means++ selects input data points as initial cluster locations with probability proportional to the points squared distance from the nearest existing cluster center)

## Goals

- Understand the k-means++ initialization procedure, how it's a stochastic algorithm, and the interpretation of the optimality guarantee

## Core resources (read/watch one of the following)

## -Free-

→ k-means++: The Advantage of Careful Seeding

Location:
p. 1-3 (through section 2) provides the k-means++ algorithm, while the rest of the paper provides proofs and characterizations

## Supplemental resources (the following are optional, but you may find them useful)

## -Free-

→ StackOverflow

Location:
accepted answer

Other notes:

- accepted answer provides a solid definition and python implementation

→ Wikipedia

## See also

-No Additional Notes-