Overview
Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today’s blog will talk about.
Distance Metrics
We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.
-
Minkowski Distance:
- Minkowski distance is a metric in Normed vector space.
- Formula $\text{Minkowski Distance} = (\sum_{i = 1}^{n} |x_i - y_i|^p)^\frac{1}{p}$
- p = 1, Manhattan Distance
- p = 2, Euclidean Distance
- p = ∞, Chebychev Distance
-
Manhattan Distance:
- We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.
-
Euclidean Distance:
- Euclidean distance formula can be used to calculate the distance between two data points in a plane.
-
Cosine Distance:
- Mostly Cosine distance metric is used to find similarities between different documents.
- In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
- This particular metric is used when the magnitude between vectors does not matter but the orientation.
- Formula: $\text{Cosine Distance} = \cos \theta = \frac{\vec{a} \cdot \vec{b}}{\Vert\vec{a}\Vert \space \Vert\vec{b}\Vert}$
Evaluation Methods
1. Clustering Tendency
- Before evaluating the clustering performance, making sure that data set we are working has clustering tendency and does not contain uniformly distributed points is very important.
Hopkins Test: a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.- Null Hypothesis ($H_0$) : Data points are generated by non-random, uniform distribution (implying no meaningful clusters)
- Alternate Hypothesis ($H_1$): Data points are generated by random data points (presence of clusters)
- If the H value is between {0.01, …,0.3}, the data is regularly spaced.
- If the H value is around 0.5, it is random.
- If the H value is between {0.7, …, 0.99}, it has a high tendency to cluster.
- The
Hopkins Testis highly influenced by outliers - Sample Code
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.decomposition import PCA from sklearn import datasets from sklearn.preprocessing import scale from pyclustertend import hopkins ## the hopkins test from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt heart_df = pd.read_csv("heart.csv") X = heart_df[heart_df.columns[~heart_df.columns.isin(["target"])]].values y = heart_df[heart_df.columns[heart_df.columns.isin(["target"])]].values.flatten() display(hopkins(X, X.shape[0])) display(hopkins(scale(X),X.shape[0]))
2. Number of Optimal Clusters
Mainly 2 Direction
-
Domain knowledge — Domain knowledge might give some prior knowledge on finding number of clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa, virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.
-
Data driven approach — If the domain knowledge is not available, mathematical methods help in finding out right number of clusters.
Mainly 2 Methods
- Statistical approach:
Gap statisticis a powerful statistical method to find the optimal number of clusters, k.- Sum of within-cluster (intra-cluster) variance is calculated for different values of k.
- $Sum_{ori, k}$: Sum-of-within-Cluster variance of original data set for k clusters
- $Sum_{null, k}$: Sum-of-within-cluster variance of reference data set (null reference data set of uniform distribution) of k clusters
- Formula: $\text{Gap stats} = E[\log Sum_{null, k}] - \log Sum_{ori, k}$
- As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
- Cluster number with maximum Gap statistic value corresponds to optimal number of cluster.
- Sample Code
# Gap Statistics %matplotlib inline import time import hashlib import scipy import matplotlib.pyplot as plt import pandas as pd import numpy as np from sklearn.cluster import KMeans from sklearn.datasets.samples_generator import make_blobs plt.rcParams['figure.figsize'] = 10, 10 x, y = make_blobs(750, n_features=2, centers=12) plt.scatter(x[:, 0], x[:, 1]) plt.show() def optimalK(data, nrefs=3, maxClusters=15): """ Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie Params: data: ndarry of shape (n_samples, n_features) nrefs: number of sample reference datasets to create maxClusters: Maximum number of clusters to test for Returns: (gaps, optimalK) """ gaps = np.zeros((len(range(1, maxClusters)),)) resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]}) for gap_index, k in enumerate(range(1, maxClusters)): # Holder for reference dispersion results refDisps = np.zeros(nrefs) # For n references, generate random sample and perform kmeans getting resulting dispersion of each loop for i in range(nrefs): # Create new random reference set randomReference = np.random.random_sample(size=data.shape) # Fit to it km = KMeans(k) km.fit(randomReference) refDisp = km.inertia_ refDisps[i] = refDisp # Fit cluster to original data and create dispersion km = KMeans(k) km.fit(data) origDisp = km.inertia_ # Calculate gap statistic gap = np.log(np.mean(refDisps)) - np.log(origDisp) # Assign this loop's gap statistic to gaps gaps[gap_index] = gap resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True) return (gaps.argmax() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal k, gapdf = optimalK(x, nrefs=5, maxClusters=15) print(f'Optimal k is: {k}')
- Elbow method:
- Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster variance, higher the compactness of cluster formed.
- Sample Code
# Elbow Method import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns from sklearn.datasets.samples_generator import (make_blobs, make_circles, make_moons) from sklearn.cluster import KMeans, SpectralClustering from sklearn.preprocessing import StandardScaler from sklearn.metrics import silhouette_samples, silhouette_score # Import the data df = pd.read_csv('old_faithful.csv') # Standardize the data X_std = StandardScaler().fit_transform(df) sse = [] list_k = list(range(1, 10)) for k in list_k: km = KMeans(n_clusters=k) km.fit(X_std) sse.append(km.inertia_) # Plot sse against k plt.figure(figsize=(6, 6)) plt.plot(list_k, sse, '-o') plt.xlabel(r'Number of clusters $k$') plt.ylabel('Sum of squared distance')
3. Clustering Quality
There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.
- Extrinsic Measures:
- Require ground truth labels.
- Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
- Intrinsic Measures:
- Does not require ground truth labels.
- Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.
# silhouette analysis
for i, k in enumerate([2, 3, 4]):
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# Run the Kmeans algorithm
km = KMeans(n_clusters=k)
labels = km.fit_predict(X_std)
centroids = km.cluster_centers_
# Get silhouette samples
silhouette_vals = silhouette_samples(X_std, labels)
# Silhouette plot
y_ticks = []
y_lower, y_upper = 0, 0
for i, cluster in enumerate(np.unique(labels)):
cluster_silhouette_vals = silhouette_vals[labels == cluster]
cluster_silhouette_vals.sort()
y_upper += len(cluster_silhouette_vals)
ax1.barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none', height=1)
ax1.text(-0.03, (y_lower + y_upper) / 2, str(i + 1))
y_lower += len(cluster_silhouette_vals)
# Get the average silhouette score and plot it
avg_score = np.mean(silhouette_vals)
ax1.axvline(avg_score, linestyle='--', linewidth=2, color='green')
ax1.set_yticks([])
ax1.set_xlim([-0.1, 1])
ax1.set_xlabel('Silhouette coefficient values')
ax1.set_ylabel('Cluster labels')
ax1.set_title('Silhouette plot for the various clusters', y=1.02);
# Scatter plot of data colored with labels
ax2.scatter(X_std[:, 0], X_std[:, 1], c=labels)
ax2.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='r', s=250)
ax2.set_xlim([-2, 2])
ax2.set_xlim([-2, 2])
ax2.set_xlabel('Eruption time in mins')
ax2.set_ylabel('Waiting time to next eruption')
ax2.set_title('Visualization of clustered data', y=1.02)
ax2.set_aspect('equal')
plt.tight_layout()
plt.suptitle(f'Silhouette analysis using k = {k}',
fontsize=16, fontweight='semibold', y=1.05);