Blogs · Machine Learning · Clustering · Unsupervised Learning

Clustering: K Means and Gaussian Mixture Models

The basic models for clustering

2020.02.01 · 10 min read · by Zhenlin Wang · updated 2021-08-19

Overview

In this blog we talk about K means and GMM algorithms, the famous and intuitively useful algorithms. As we venture further into unsupervised learning/clustering problems, we will see more interesting problem formulations as well as diverse evaluation metrics. Hope we would enjoy this learning journey along the way :)

K means

1. Defintion

2. Pros & Cons

Pros

  1. Easy to interpret
  2. Relatively fast
  3. Scalable for large data sets
  4. Able to choose the positions of initial centroids in a smart way that speeds up the convergence
  5. Guarantees convergence

Cons

  1. The globally optimal result may not be achieved
  2. The number of clusters must be selected beforehand
  3. k-means is limited to linear cluster boundaries:
    • this one may be solved using Similar technique as SVM does.
    • One possible solution is the “Spectral Clustering”: i.e Kernelized K-means below

3. Applications

4. Code Implementation

{% codeblock Sklearn package’s GMM lang:python%} from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, max_iter=100) km.fit(X_std) centroids = km.cluster_centers_ {% endcodeblock %}

GMM

1. Definition

2. Pros & Cons

Pros

Cons

3. Application

4. Simple code

{% codeblock Sklearn package’s GMM lang:python%} from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components = 3) gmm.fit(X_principal) gmm.fit_predict(X_principal) {% endcodeblock %}