Unsupervised Learning: Measures about Clustering

2020-02-15Blogs

Overview

Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today's blog will talk about.

Distance Metrics

We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.

Minkowski Distance:
- Minkowski distance is a metric in Normed vector space.
- Formula
  - p = 1, Manhattan Distance
  - p = 2, Euclidean Distance
  - p = ∞, Chebychev Distance
Manhattan Distance:
- We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.
Euclidean Distance:
- Euclidean distance formula can be used to calculate the distance between two data points in a plane.
Cosine Distance:
- Mostly Cosine distance metric is used to find similarities between different documents.
- In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
- This particular metric is used when the magnitude between vectors does not matter but the orientation.
- Formula:

Evaluation Methods

1. Clustering Tendency

Before evaluating the clustering performance, making sure that data set we are working has clustering tendency and does not contain uniformly distributed points is very important.
Hopkins Test: a statistical test for spatial randomness of a variable, can be used to measure the probability of data points generated by uniform data distribution.
- Null Hypothesis () : Data points are generated by non-random, uniform distribution (implying no meaningful clusters)
- Alternate Hypothesis (): Data points are generated by random data points (presence of clusters)
- If the H value is between {0.01, …,0.3}, the data is regularly spaced.
- If the H value is around 0.5, it is random.
- If the H value is between {0.7, …, 0.99}, it has a high tendency to cluster.
The Hopkins Test is highly influenced by outliers

Sample Code

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.preprocessing import scale
from pyclustertend import hopkins ## the hopkins test


from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

heart_df = pd.read_csv("heart.csv")

X = heart_df[heart_df.columns[~heart_df.columns.isin(["target"])]].values
y = heart_df[heart_df.columns[heart_df.columns.isin(["target"])]].values.flatten()
display(hopkins(X, X.shape[0]))
display(hopkins(scale(X),X.shape[0]))

2. Number of Optimal Clusters

Mainly 2 Direction

Domain knowledge — Domain knowledge might give some prior knowledge on finding number of clusters. For example, in case of clustering iris data set, if we have the prior knowledge of species (sertosa, virginica, versicolor) , then k = 3. Domain knowledge driven k value gives more relevant insights.
Data driven approach — If the domain knowledge is not available, mathematical methods help in finding out right number of clusters.

Mainly 2 Methods

Statistical approach:

Gap statistic is a powerful statistical method to find the optimal number of clusters, k.
Sum of within-cluster (intra-cluster) variance is calculated for different values of k.
: Sum-of-within-Cluster variance of original data set for k clusters
: Sum-of-within-cluster variance of reference data set (null reference data set of uniform distribution) of k clusters
Formula:
As Gap statistic quantifies this deviation, More the Gap statistic means more the deviation.
Cluster number with maximum Gap statistic value corresponds to optimal number of cluster.

Sample Code

# Gap Statistics
%matplotlib inline
import time
import hashlib
import scipy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

plt.rcParams['figure.figsize'] = 10, 10

x, y = make_blobs(750, n_features=2, centers=12)

plt.scatter(x[:, 0], x[:, 1])
plt.show()

def optimalK(data, nrefs=3, maxClusters=15):
    """
    Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie
    Params:
        data: ndarry of shape (n_samples, n_features)
        nrefs: number of sample reference datasets to create
        maxClusters: Maximum number of clusters to test for
    Returns: (gaps, optimalK)
    """
    gaps = np.zeros((len(range(1, maxClusters)),))
    resultsdf = pd.DataFrame({'clusterCount':[], 'gap':[]})
    for gap_index, k in enumerate(range(1, maxClusters)):

        # Holder for reference dispersion results
        refDisps = np.zeros(nrefs)

        # For n references, generate random sample and perform kmeans getting resulting dispersion of each loop
        for i in range(nrefs):
            
            # Create new random reference set
            randomReference = np.random.random_sample(size=data.shape)
            
            # Fit to it
            km = KMeans(k)
            km.fit(randomReference)
            
            refDisp = km.inertia_
            refDisps[i] = refDisp

        # Fit cluster to original data and create dispersion
        km = KMeans(k)
        km.fit(data)
        
        origDisp = km.inertia_

        # Calculate gap statistic
        gap = np.log(np.mean(refDisps)) - np.log(origDisp)

        # Assign this loop's gap statistic to gaps
        gaps[gap_index] = gap
        
        resultsdf = resultsdf.append({'clusterCount':k, 'gap':gap}, ignore_index=True)

    return (gaps.argmax() + 1, resultsdf)  # Plus 1 because index of 0 means 1 cluster is optimal, index 2 = 3 clusters are optimal
    
k, gapdf = optimalK(x, nrefs=5, maxClusters=15)
print(f'Optimal k is: {k}')

Elbow method:

Within-cluster variance is a measure of compactness of the cluster. Lower the value of within cluster variance, higher the compactness of cluster formed.

Sample Code

# Elbow Method
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets.samples_generator import (make_blobs,
                                                make_circles,
                                                make_moons)
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score

# Import the data
df = pd.read_csv('old_faithful.csv')

# Standardize the data
X_std = StandardScaler().fit_transform(df)

sse = []
list_k = list(range(1, 10))

for k in list_k:
    km = KMeans(n_clusters=k)
    km.fit(X_std)
    sse.append(km.inertia_)

# Plot sse against k
plt.figure(figsize=(6, 6))
plt.plot(list_k, sse, '-o')
plt.xlabel(r'Number of clusters $k$')
plt.ylabel('Sum of squared distance')

3. Clustering Quality

There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.

Extrinsic Measures:
- Require ground truth labels.
- Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
Intrinsic Measures:
- Does not require ground truth labels.
- Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.

# silhouette analysis
for i, k in enumerate([2, 3, 4]):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
    
    # Run the Kmeans algorithm
    km = KMeans(n_clusters=k)
    labels = km.fit_predict(X_std)
    centroids = km.cluster_centers_

    # Get silhouette samples
    silhouette_vals = silhouette_samples(X_std, labels)

    # Silhouette plot
    y_ticks = []
    y_lower, y_upper = 0, 0
    for i, cluster in enumerate(np.unique(labels)):
        cluster_silhouette_vals = silhouette_vals[labels == cluster]
        cluster_silhouette_vals.sort()
        y_upper += len(cluster_silhouette_vals)
        ax1.barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none', height=1)
        ax1.text(-0.03, (y_lower + y_upper) / 2, str(i + 1))
        y_lower += len(cluster_silhouette_vals)

    # Get the average silhouette score and plot it
    avg_score = np.mean(silhouette_vals)
    ax1.axvline(avg_score, linestyle='--', linewidth=2, color='green')
    ax1.set_yticks([])
    ax1.set_xlim([-0.1, 1])
    ax1.set_xlabel('Silhouette coefficient values')
    ax1.set_ylabel('Cluster labels')
    ax1.set_title('Silhouette plot for the various clusters', y=1.02);
    
    # Scatter plot of data colored with labels
    ax2.scatter(X_std[:, 0], X_std[:, 1], c=labels)
    ax2.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='r', s=250)
    ax2.set_xlim([-2, 2])
    ax2.set_xlim([-2, 2])
    ax2.set_xlabel('Eruption time in mins')
    ax2.set_ylabel('Waiting time to next eruption')
    ax2.set_title('Visualization of clustered data', y=1.02)
    ax2.set_aspect('equal')
    plt.tight_layout()
    plt.suptitle(f'Silhouette analysis using k = {k}',
                 fontsize=16, fontweight='semibold', y=1.05);

Unsupervised Learning: Measures about Clustering

https://criss-wang.github.io/post/blogs/unsupervised/unsupervised-learning/

Author

Zhenlin Wang

Posted on

2020-02-15

Updated on

2021-10-04

Unsupervised Learning: Measures about Clustering

Overview

Distance Metrics

Evaluation Methods

1. Clustering Tendency

2. Number of Optimal Clusters

3. Clustering Quality

Author

Posted on

Updated on

Licensed under

Tags

Catalogue