# Unsupervised Learning: Measures about Clustering

### Overview

Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today's blog will talk about.

### Distance Metrics

We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.

Minkowski Distance:

- Minkowski distance is a metric in Normed vector space.
- Formula
- p = 1, Manhattan Distance
- p = 2, Euclidean Distance
- p = ∞, Chebychev Distance

Manhattan Distance:

- We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.

Euclidean Distance:

- Euclidean distance formula can be used to calculate the distance between two data points in a plane.

Cosine Distance:

- Mostly Cosine distance metric is used to find similarities between different documents.
- In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
- This particular metric is used when the magnitude between vectors does not matter but the orientation.
- Formula:

### Evaluation Methods

#### 1. Clustering Tendency

- Before evaluating the clustering performance, making sure that data set we are working has clustering tendency and does not contain uniformly distributed points is very important.
`Hopkins Test`

: a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.- Null Hypothesis (
) : Data points are generated by non-random, uniform distribution (implying no meaningful clusters) - Alternate Hypothesis (
): Data points are generated by random data points (presence of clusters) - If the H value is between {0.01, …,0.3}, the data is regularly spaced.
- If the H value is around 0.5, it is random.
- If the H value is between {0.7, …, 0.99}, it has a high tendency to cluster.

- Null Hypothesis (
- The
`Hopkins Test`

is highly influenced by outliers - Sample Code
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.decomposition import PCA

from sklearn import datasets

from sklearn.preprocessing import scale

from pyclustertend import hopkins ## the hopkins test

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt

heart_df = pd.read_csv("heart.csv")

X = heart_df[heart_df.columns[~heart_df.columns.isin(["target"])]].values

y = heart_df[heart_df.columns[heart_df.columns.isin(["target"])]].values.flatten()

display(hopkins(X, X.shape[0]))

display(hopkins(scale(X),X.shape[0]))

#### 2. Number of Optimal Clusters

**Mainly 2 Direction**

Domain knowledge — Domain knowledge might give some prior knowledge on finding number of clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa, virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.

Data driven approach — If the domain knowledge is not available, mathematical methods help in finding out right number of clusters.

**Mainly 2 Methods**

- Statistical approach:
`Gap statistic`

is a powerful statistical method to find the optimal number of clusters, k.- Sum of within-cluster (intra-cluster) variance is calculated for different values of k.
: Sum-of-within-Cluster variance of original data set for k clusters : Sum-of-within-cluster variance of reference data set (null reference data set of uniform distribution) of k clusters - Formula:
- As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
- Cluster number with maximum Gap statistic value corresponds to optimal number of cluster.
- Sample Code
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65# Gap Statistics

%matplotlib inline

import time

import hashlib

import scipy

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.cluster import KMeans

from sklearn.datasets.samples_generator import make_blobs

plt.rcParams['figure.figsize'] = 10, 10

x, y = make_blobs(750, n_features=2, centers=12)

plt.scatter(x[:, 0], x[:, 1])

plt.show()

def optimalK(data, nrefs=3, maxClusters=15):

"""

Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie

Params:

data: ndarry of shape (n_samples, n_features)

nrefs: number of sample reference datasets to create

maxClusters: Maximum number of clusters to test for

Returns: (gaps, optimalK)

"""

gaps = np.zeros((len(range(1, maxClusters)),))

resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})

for gap_index, k in enumerate(range(1, maxClusters)):

# Holder for reference dispersion results

refDisps = np.zeros(nrefs)

# For n references, generate random sample and perform kmeans getting resulting dispersion of each loop

for i in range(nrefs):

# Create new random reference set

randomReference = np.random.random_sample(size=data.shape)

# Fit to it

km = KMeans(k)

km.fit(randomReference)

refDisp = km.inertia_

refDisps[i] = refDisp

# Fit cluster to original data and create dispersion

km = KMeans(k)

km.fit(data)

origDisp = km.inertia_

# Calculate gap statistic

gap = np.log(np.mean(refDisps)) - np.log(origDisp)

# Assign this loop's gap statistic to gaps

gaps[gap_index] = gap

resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)

return (gaps.argmax() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal

k, gapdf = optimalK(x, nrefs=5, maxClusters=15)

print(f'Optimal k is: {k}')

- Elbow method:
- Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster variance, higher the compactness of cluster formed.
- Sample Code
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31# Elbow Method

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.datasets.samples_generator import (make_blobs,

make_circles,

make_moons)

from sklearn.cluster import KMeans, SpectralClustering

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import silhouette_samples, silhouette_score

# Import the data

df = pd.read_csv('old_faithful.csv')

# Standardize the data

X_std = StandardScaler().fit_transform(df)

sse = []

list_k = list(range(1, 10))

for k in list_k:

km = KMeans(n_clusters=k)

km.fit(X_std)

sse.append(km.inertia_)

# Plot sse against k

plt.figure(figsize=(6, 6))

plt.plot(list_k, sse, '-o')

plt.xlabel(r'Number of clusters $k$')

plt.ylabel('Sum of squared distance')

#### 3. Clustering Quality

There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.

**Extrinsic Measures**:- Require ground truth labels.
- Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.

**Intrinsic Measures**:- Does not require ground truth labels.
- Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.

1 | # silhouette analysis |

Unsupervised Learning: Measures about Clustering

https://criss-wang.github.io/post/blogs/unsupervised/unsupervised-learning/