Blogs · Machine Learning · Clustering · Unsupervised Learning

Clustering: Hierachial, BIRCH and Spectral

The nitty-gritty of 'brute-force'

2020.02.05 · 14 min read · by Zhenlin Wang · updated 2021-08-21

Hierachial Clustering

1. Definition

2. Pros & Cons

Pros

Cons

3. Application

4. Code Implementation

{% codeblock lang:python%} import numpy as np import pandas as pd import matplotlib.pyplot as plt

from sklearn.cluster import AgglomerativeClustering from sklearn.preprocessing import StandardScaler, normalize from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import scipy.cluster.hierarchy as shc

raw_df = pd.read_csv(‘CC GENERAL.csv’) raw_df = raw_df.drop(‘CUST_ID’, axis = 1) raw_df.fillna(method =‘ffill’, inplace = True)

Standardize data

scaler = StandardScaler() scaled_df = scaler.fit_transform(raw_df)

Normalizing the Data

normalized_df = normalize(scaled_df)

Converting the numpy array into a pandas DataFrame

normalized_df = pd.DataFrame(normalized_df)

Reducing the dimensions of the data

pca = PCA(n_components = 2) X_principal = pca.fit_transform(normalized_df) X_principal = pd.DataFrame(X_principal) X_principal.columns = [‘P1’, ‘P2’]

plt.figure(figsize =(6, 6)) plt.title(‘Visualising the data’) Dendrogram = shc.dendrogram((shc.linkage(X_principal, method =‘ward’)))

Determine the optimal number of clusters using [Silhouette Score]

silhouette_scores = []

for n_cluster in range(2, 8): silhouette_scores.append( silhouette_score(X_principal, AgglomerativeClustering(n_clusters = n_cluster).fit_predict(X_principal)))

Plotting a bar graph to compare the results

k = [2, 3, 4, 5, 6,7] plt.bar(k, silhouette_scores) plt.xlabel(‘Number of clusters’, fontsize = 10) plt.ylabel(‘Silhouette Score’, fontsize = 10) plt.show()

agg = AgglomerativeClustering(n_clusters=3) agg.fit(X_principal)

Visualizing the clustering

plt.scatter(X_principal[‘P1’], X_principal[‘P2’],
c = AgglomerativeClustering(n_clusters = 3).fit_predict(X_principal), cmap =plt.cm.winter) plt.show() {% endcodeblock %}

BIRCH Clustering

1. Definition

2. Pros & Cons

Pros

Cons

3. Applications

4. Code implementation

{% codeblock lang:python%} import numpy as np from matplotlib import pyplot as plt import seaborn as sns sns.set() from sklearn.datasets import make_blobs from sklearn.cluster import Birch

X, clusters = make_blobs(n_samples=450, centers=6, cluster_std=0.70, random_state=0) plt.scatter(X[:,0], X[:,1], alpha=0.7, edgecolors=‘b’)

Predict and visualize

brc = Birch(branching_factor=50, n_clusters=None, threshold=1.5) brc.fit(X) labels = brc.predict(X)

plt.scatter(X[:,0], X[:,1], c=labels, cmap=‘rainbow’, alpha=0.7, edgecolors=‘b’) {% endcodeblock %}

Spectral Clustering

1. Definition

2. Pros & Cons

Pros

Cons

3. Code Implementation

{% codeblock lang:python%} import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import SpectralClustering from sklearn.preprocessing import StandardScaler, normalize from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score

raw_df = pd.read_csv(‘CC GENERAL.csv’) raw_df = raw_df.drop(‘CUST_ID’, axis = 1) raw_df.fillna(method =‘ffill’, inplace = True)

Preprocessing the data to make it visualizable

Scaling the Data

scaler = StandardScaler() X_scaled = scaler.fit_transform(raw_df)

Normalizing the Data

X_normalized = normalize(X_scaled)

Converting the numpy array into a pandas DataFrame

X_normalized = pd.DataFrame(X_normalized)

Reducing the dimensions of the data

pca = PCA(n_components = 2) X_principal = pca.fit_transform(X_normalized) X_principal = pd.DataFrame(X_principal) X_principal.columns = [‘P1’, ‘P2’]

Affinity matrix with Gaussian Kernel

affinity = “rbf”

Building the clustering model

spectral_model_rbf = SpectralClustering(n_clusters = 2, affinity =‘rbf’)

Training the model and Storing the predicted cluster labels

labels_rbf = spectral_model_rbf.fit_predict(X_principal)

Visualizing the clustering

plt.scatter(X_principal[‘P1’], X_principal[‘P2’],
c = SpectralClustering(n_clusters = 2, affinity =‘rbf’) .fit_predict(X_principal), cmap =plt.cm.winter) plt.show()

Affinity matrix with Eucledean Distance

affinity = ‘nearest_neighbors’

Building the clustering model

spectral_model_nn = SpectralClustering(n_clusters = 2, affinity =‘nearest_neighbors’)

Training the model and Storing the predicted cluster labels

labels_nn = spectral_model_nn.fit_predict(X_principal)

Visualizing the clustering

plt.scatter(X_principal[‘P1’], X_principal[‘P2’],
c = SpectralClustering(n_clusters = 2, affinity =‘nearest_neighbors’) .fit_predict(X_principal), cmap =plt.cm.winter) plt.show()

Evaluate performance

List of different values of affinity

affinity = [‘rbf’, ‘nearest-neighbours’]

List of Silhouette Scores

s_scores = []

Evaluating the performance

s_scores.append(silhouette_score(raw_df, labels_rbf)) s_scores.append(silhouette_score(raw_df, labels_nn))

Plotting a Bar Graph to compare the models

plt.bar(affinity, s_scores) plt.xlabel(‘Affinity’) plt.ylabel(‘Silhouette Score’) plt.title(‘Comparison of different Clustering Models’) plt.show()

print(s_scores) {% endcodeblock %}