Unsupervised Learning: Measures about Clustering


Overview

Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today's blog will talk about.

Distance Metrics

We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.

  1. Minkowski Distance:

    • Minkowski distance is a metric in Normed vector space.
    • Formula
      • p = 1, Manhattan Distance
      • p = 2, Euclidean Distance
      • p = ∞, Chebychev Distance
  2. Manhattan Distance:

    • We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.
  3. Euclidean Distance:

    • Euclidean distance formula can be used to calculate the distance between two data points in a plane.
  4. Cosine Distance:

    • Mostly Cosine distance metric is used to find similarities between different documents.
    • In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
    • This particular metric is used when the magnitude between vectors does not matter but the orientation.
    • Formula:

Evaluation Methods

1. Clustering Tendency

  • Before evaluating the clustering performance, making sure that data set we are working has clustering tendency and does not contain uniformly distributed points is very important.
  • Hopkins Test: a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.
    • Null Hypothesis () : Data points are generated by non-random, uniform distribution (implying no meaningful clusters)
    • Alternate Hypothesis (): Data points are generated by random data points (presence of clusters)
    • If the H value is between {0.01, …,0.3}, the data is regularly spaced.
    • If the H value is around 0.5, it is random.
    • If the H value is between {0.7, …, 0.99}, it has a high tendency to cluster.
  • The Hopkins Test is highly influenced by outliers
  • Sample Code
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

    from sklearn.decomposition import PCA
    from sklearn import datasets
    from sklearn.preprocessing import scale
    from pyclustertend import hopkins ## the hopkins test


    from mpl_toolkits.mplot3d import Axes3D
    import matplotlib.pyplot as plt

    heart_df = pd.read_csv("heart.csv")

    X = heart_df[heart_df.columns[~heart_df.columns.isin(["target"])]].values
    y = heart_df[heart_df.columns[heart_df.columns.isin(["target"])]].values.flatten()
    display(hopkins(X, X.shape[0]))
    display(hopkins(scale(X),X.shape[0]))

2. Number of Optimal Clusters

Mainly 2 Direction

  • Domain knowledge — Domain knowledge might give some prior knowledge on finding number of clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa, virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.

  • Data driven approach — If the domain knowledge is not available, mathematical methods help in finding out right number of clusters.

Mainly 2 Methods

  • Statistical approach:
    • Gap statistic is a powerful statistical method to find the optimal number of clusters, k.
    • Sum of within-cluster (intra-cluster) variance is calculated for different values of k.
    • : Sum-of-within-Cluster variance of original data set for k clusters
    • : Sum-of-within-cluster variance of reference data set (null reference data set of uniform distribution) of k clusters
    • Formula:
    • As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
    • Cluster number with maximum Gap statistic value corresponds to optimal number of cluster.
    • Sample Code
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      # Gap Statistics
      %matplotlib inline
      import time
      import hashlib
      import scipy
      import matplotlib.pyplot as plt
      import pandas as pd
      import numpy as np
      from sklearn.cluster import KMeans
      from sklearn.datasets.samples_generator import make_blobs

      plt.rcParams['figure.figsize'] = 10, 10

      x, y = make_blobs(750, n_features=2, centers=12)

      plt.scatter(x[:, 0], x[:, 1])
      plt.show()

      def optimalK(data, nrefs=3, maxClusters=15):
      """
      Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
      Params:
      data: ndarry of shape (n_samples, n_features)
      nrefs: number of sample reference datasets to create
      maxClusters: Maximum number of clusters to test for
      Returns: (gaps, optimalK)
      """
      gaps = np.zeros((len(range(1, maxClusters)),))
      resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
      for gap_index, k in enumerate(range(1, maxClusters)):

      # Holder for reference dispersion results
      refDisps = np.zeros(nrefs)

      # For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
      for i in range(nrefs):

      # Create new random reference set
      randomReference = np.random.random_sample(size=data.shape)

      # Fit to it
      km = KMeans(k)
      km.fit(randomReference)

      refDisp = km.inertia_
      refDisps[i] = refDisp

      # Fit cluster to original data and create dispersion
      km = KMeans(k)
      km.fit(data)

      origDisp = km.inertia_

      # Calculate gap statistic
      gap = np.log(np.mean(refDisps)) - np.log(origDisp)

      # Assign this loop's gap statistic to gaps
      gaps[gap_index] = gap

      resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)

      return (gaps.argmax() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal

      k, gapdf = optimalK(x, nrefs=5, maxClusters=15)
      print(f'Optimal k is: {k}')
  • Elbow method:
    • Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster variance, higher the compactness of cluster formed.
    • Sample Code
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      # Elbow Method
      import matplotlib.pyplot as plt
      import numpy as np
      import pandas as pd
      import seaborn as sns
      from sklearn.datasets.samples_generator import (make_blobs,
      make_circles,
      make_moons)
      from sklearn.cluster import KMeans, SpectralClustering
      from sklearn.preprocessing import StandardScaler
      from sklearn.metrics import silhouette_samples, silhouette_score

      # Import the data
      df = pd.read_csv('old_faithful.csv')

      # Standardize the data
      X_std = StandardScaler().fit_transform(df)

      sse = []
      list_k = list(range(1, 10))

      for k in list_k:
      km = KMeans(n_clusters=k)
      km.fit(X_std)
      sse.append(km.inertia_)

      # Plot sse against k
      plt.figure(figsize=(6, 6))
      plt.plot(list_k, sse, '-o')
      plt.xlabel(r'Number of clusters $k$')
      plt.ylabel('Sum of squared distance')

3. Clustering Quality

There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.

  1. Extrinsic Measures:
    • Require ground truth labels.
    • Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
  2. Intrinsic Measures:
    • Does not require ground truth labels.
    • Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# silhouette analysis
for i, k in enumerate([2, 3, 4]):
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

# Run the Kmeans algorithm
km = KMeans(n_clusters=k)
labels = km.fit_predict(X_std)
centroids = km.cluster_centers_

# Get silhouette samples
silhouette_vals = silhouette_samples(X_std, labels)

# Silhouette plot
y_ticks = []
y_lower, y_upper = 0, 0
for i, cluster in enumerate(np.unique(labels)):
cluster_silhouette_vals = silhouette_vals[labels == cluster]
cluster_silhouette_vals.sort()
y_upper += len(cluster_silhouette_vals)
ax1.barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none', height=1)
ax1.text(-0.03, (y_lower + y_upper) / 2, str(i + 1))
y_lower += len(cluster_silhouette_vals)

# Get the average silhouette score and plot it
avg_score = np.mean(silhouette_vals)
ax1.axvline(avg_score, linestyle='--', linewidth=2, color='green')
ax1.set_yticks([])
ax1.set_xlim([-0.1, 1])
ax1.set_xlabel('Silhouette coefficient values')
ax1.set_ylabel('Cluster labels')
ax1.set_title('Silhouette plot for the various clusters', y=1.02);

# Scatter plot of data colored with labels
ax2.scatter(X_std[:, 0], X_std[:, 1], c=labels)
ax2.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='r', s=250)
ax2.set_xlim([-2, 2])
ax2.set_xlim([-2, 2])
ax2.set_xlabel('Eruption time in mins')
ax2.set_ylabel('Waiting time to next eruption')
ax2.set_title('Visualization of clustered data', y=1.02)
ax2.set_aspect('equal')
plt.tight_layout()
plt.suptitle(f'Silhouette analysis using k = {k}',
fontsize=16, fontweight='semibold', y=1.05);
Author

Zhenlin Wang

Posted on

2020-02-15

Updated on

2021-10-04

Licensed under