Blogs · Unsupervised Learning · Clustering

Clustering: K-Means and Gaussian Mixture Models

A practical comparison of K-means and Gaussian mixture models, including assumptions, distance, soft assignments, initialization, and evaluation.

2020.02.01 · 1 min read · by Zhenlin Wang

Introduction

K-means and Gaussian mixture models (GMMs) are two classic clustering methods. Both try to summarize data with groups, but they make different assumptions.

K-means assigns each point to one cluster center. GMMs model data as a mixture of Gaussian distributions and produce soft probabilities for cluster membership.

K-Means

K-means tries to minimize within-cluster squared distance:

$$ \sum_{i=1}^{n} |x_i - \mu_{c_i}|^2 $$

where $\mu_{c_i}$ is the centroid of the assigned cluster.

Use K-means when:

Watch out:

Gaussian Mixture Models

A GMM assumes data comes from a mixture of Gaussian components:

$$ p(x) = \sum_{k=1}^{K} \pi_k N(x \mid \mu_k, \Sigma_k) $$

Each point receives a probability of belonging to each component.

Use GMMs when:

Watch out:

Choosing the Number of Clusters

Useful tools:

The best number of clusters is not only a metric decision. It should also be useful for the downstream task.

Closing

K-means is a fast hard-clustering baseline. GMMs are a probabilistic soft-clustering extension. Use both with scaling, stability checks, and careful interpretation.