k-means

(1.5 hours to learn)

Summary

K-means is a clustering algorithm, i.e. a way of partitioning a set of data points into "clusters," or sets of data points which are similar to one another. It works by iteratively reassigning data points to clusters and computing cluster centers based on the average of the point locations. It is commonly used for vector quantization and as an initialization for Gaussian mixture models.

Context

-this concept has no prerequisites-

Goals

  • What objective function does the k-means algorithm minimize?
  • What happens during the two stages of each k-means iteration?
  • Is it possible that the k-means algorithm increases the objective function in a full iteration?
  • What are some scenarios in which k-means gets stuck in poor local optima?
    • What are some general techniques to overcome these scenarios?
  • What are some techniques for choosing the initial points of the clusters?
  • What are some techniques for choosing the number of clusters?

Core resources (read/watch one of the following)

-Free-

Information Theory, Inference, and Learning Algorithms
A graudate-level textbook on machine learning and information theory.
Author: David MacKay
Coursera: Machine Learning (2013)
An online machine learning course aimed at a broad audience.
Location: Lecture sequence "Clustering"
Author: Andrew Y. Ng
Other notes:
  • Click on "Preview" to see the videos.
K-Means Clustering and Related Algorithms
Author: Ryan P. Adams

-Paid-

Supplemental resources (the following are optional, but you may find them useful)

-Free-

Visualizing K-Means Clustering
Author: Naftali Harris
Other notes:
  • provides an interactive visualization of the k-means algorithm for various datasets
StackOverflow Question: Kmeans without knowing the number of clusters?
Other notes:
  • the top answer provides a good set of links that dig into the question "how many clusters should I use?" (NB: this question is a Pandora's box)

-Paid-

See also