Unsupervised Learning: Measures about Clustering
Overview
Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today's blog will talk about.
Distance Metrics
We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.
Minkowski Distance:
- Minkowski distance is a metric in Normed vector space.
- Formula
- p = 1, Manhattan Distance
- p = 2, Euclidean Distance
- p = ∞, Chebychev Distance
Manhattan Distance:
- We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.
Euclidean Distance:
- Euclidean distance formula can be used to calculate the distance between two data points in a plane.
Cosine Distance:
- Mostly Cosine distance metric is used to find similarities between different documents.
- In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
- This particular metric is used when the magnitude between vectors does not matter but the orientation.
- Formula:
Evaluation Methods
1. Clustering Tendency
- Before evaluating the clustering performance, making sure that data set we are working has clustering tendency and does not contain uniformly distributed points is very important.
Hopkins Test
: a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.- Null Hypothesis (
) : Data points are generated by non-random, uniform distribution (implying no meaningful clusters) - Alternate Hypothesis (
): Data points are generated by random data points (presence of clusters) - If the H value is between {0.01, …,0.3}, the data is regularly spaced.
- If the H value is around 0.5, it is random.
- If the H value is between {0.7, …, 0.99}, it has a high tendency to cluster.
- Null Hypothesis (
- The
Hopkins Test
is highly influenced by outliers - Sample Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import scale
from pyclustertend import hopkins ## the hopkins test
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
heart_df = pd.read_csv("heart.csv")
X = heart_df[heart_df.columns[~heart_df.columns.isin(["target"])]].values
y = heart_df[heart_df.columns[heart_df.columns.isin(["target"])]].values.flatten()
display(hopkins(X, X.shape[0]))
display(hopkins(scale(X),X.shape[0]))
2. Number of Optimal Clusters
Mainly 2 Direction
Domain knowledge — Domain knowledge might give some prior knowledge on finding number of clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa, virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.
Data driven approach — If the domain knowledge is not available, mathematical methods help in finding out right number of clusters.
Mainly 2 Methods
- Statistical approach:
Gap statistic
is a powerful statistical method to find the optimal number of clusters, k.- Sum of within-cluster (intra-cluster) variance is calculated for different values of k.
: Sum-of-within-Cluster variance of original data set for k clusters : Sum-of-within-cluster variance of reference data set (null reference data set of uniform distribution) of k clusters - Formula:
- As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
- Cluster number with maximum Gap statistic value corresponds to optimal number of cluster.
- Sample Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65# Gap Statistics
%matplotlib inline
import time
import hashlib
import scipy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
plt.rcParams['figure.figsize'] = 10, 10
x, y = make_blobs(750, n_features=2, centers=12)
plt.scatter(x[:, 0], x[:, 1])
plt.show()
def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
Params:
data: ndarry of shape (n_samples, n_features)
nrefs: number of sample reference datasets to create
maxClusters: Maximum number of clusters to test for
Returns: (gaps, optimalK)
"""
gaps = np.zeros((len(range(1, maxClusters)),))
resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = np.zeros(nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
for i in range(nrefs):
# Create new random reference set
randomReference = np.random.random_sample(size=data.shape)
# Fit to it
km = KMeans(k)
km.fit(randomReference)
refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
km.fit(data)
origDisp = km.inertia_
# Calculate gap statistic
gap = np.log(np.mean(refDisps)) - np.log(origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap
resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)
return (gaps.argmax() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal
k, gapdf = optimalK(x, nrefs=5, maxClusters=15)
print(f'Optimal k is: {k}')
- Elbow method:
- Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster variance, higher the compactness of cluster formed.
- Sample Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31# Elbow Method
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets.samples_generator import (make_blobs,
make_circles,
make_moons)
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
# Import the data
df = pd.read_csv('old_faithful.csv')
# Standardize the data
X_std = StandardScaler().fit_transform(df)
sse = []
list_k = list(range(1, 10))
for k in list_k:
km = KMeans(n_clusters=k)
km.fit(X_std)
sse.append(km.inertia_)
# Plot sse against k
plt.figure(figsize=(6, 6))
plt.plot(list_k, sse, '-o')
plt.xlabel(r'Number of clusters $k$')
plt.ylabel('Sum of squared distance')
3. Clustering Quality
There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.
- Extrinsic Measures:
- Require ground truth labels.
- Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
- Intrinsic Measures:
- Does not require ground truth labels.
- Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.
1 | # silhouette analysis |
Unsupervised Learning: Measures about Clustering
https://criss-wang.github.io/post/blogs/unsupervised/unsupervised-learning/