Blogs · Evaluation Methods · Clustering · Unsupervised Learning

Unsupervised Learning: Measures about Clustering

We have clusters, and then?

2020.02.15 · 3 min read · by Zhenlin Wang · updated 2021-10-04

Overview

Unsupervised learning is a vast topic, and clustering is really a big part (yet not all) of it. Whenever we have some ideas about clusteirng, we should first ask: is this idea comparable to some existing works? Now to answer this, we need some evaluation strategies and choose measures for such evaluations. This is what today’s blog will talk about.

Distance Metrics

We have four most popular distance metrics outlined below. In essense, one should understand the structure of each metric, and when to use them.

  1. Minkowski Distance:

    • Minkowski distance is a metric in Normed vector space.
    • Formula $\text{Minkowski Distance} = (\sum_{i = 1}^{n} |x_i - y_i|^p)^\frac{1}{p}$
      • p = 1, Manhattan Distance
      • p = 2, Euclidean Distance
      • p = ∞, Chebychev Distance
  2. Manhattan Distance:

    • We use Manhattan Distance if we need to calculate the distance between two data points in a grid like path.
  3. Euclidean Distance:

    • Euclidean distance formula can be used to calculate the distance between two data points in a plane.
  4. Cosine Distance:

    • Mostly Cosine distance metric is used to find similarities between different documents.
    • In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).
    • This particular metric is used when the magnitude between vectors does not matter but the orientation.
    • Formula: $\text{Cosine Distance} = \cos \theta = \frac{\vec{a} \cdot \vec{b}}{\Vert\vec{a}\Vert \space \Vert\vec{b}\Vert}$

Evaluation Methods

1. Clustering Tendency

2. Number of Optimal Clusters

Mainly 2 Direction

Mainly 2 Methods

3. Clustering Quality

There are majorly two types of measures to assess the clustering performance. For more details, check sklearn document on cluster performance evaluation.

  1. Extrinsic Measures:
    • Require ground truth labels.
    • Examples are Adjusted Rand index, Fowlkes-Mallows scores, Mutual information based scores, Homogeneity, Completeness and V-measure.
  2. Intrinsic Measures:
    • Does not require ground truth labels.
    • Examples are Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index etc.
# silhouette analysis
for i, k in enumerate([2, 3, 4]):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
    
    # Run the Kmeans algorithm
    km = KMeans(n_clusters=k)
    labels = km.fit_predict(X_std)
    centroids = km.cluster_centers_

    # Get silhouette samples
    silhouette_vals = silhouette_samples(X_std, labels)

    # Silhouette plot
    y_ticks = []
    y_lower, y_upper = 0, 0
    for i, cluster in enumerate(np.unique(labels)):
        cluster_silhouette_vals = silhouette_vals[labels == cluster]
        cluster_silhouette_vals.sort()
        y_upper += len(cluster_silhouette_vals)
        ax1.barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none', height=1)
        ax1.text(-0.03, (y_lower + y_upper) / 2, str(i + 1))
        y_lower += len(cluster_silhouette_vals)

    # Get the average silhouette score and plot it
    avg_score = np.mean(silhouette_vals)
    ax1.axvline(avg_score, linestyle='--', linewidth=2, color='green')
    ax1.set_yticks([])
    ax1.set_xlim([-0.1, 1])
    ax1.set_xlabel('Silhouette coefficient values')
    ax1.set_ylabel('Cluster labels')
    ax1.set_title('Silhouette plot for the various clusters', y=1.02);
    
    # Scatter plot of data colored with labels
    ax2.scatter(X_std[:, 0], X_std[:, 1], c=labels)
    ax2.scatter(centroids[:, 0], centroids[:, 1], marker='*', c='r', s=250)
    ax2.set_xlim([-2, 2])
    ax2.set_xlim([-2, 2])
    ax2.set_xlabel('Eruption time in mins')
    ax2.set_ylabel('Waiting time to next eruption')
    ax2.set_title('Visualization of clustered data', y=1.02)
    ax2.set_aspect('equal')
    plt.tight_layout()
    plt.suptitle(f'Silhouette analysis using k = {k}',
                 fontsize=16, fontweight='semibold', y=1.05);