Share @ LinkedIn Facebook  sklearn, hierarchical-clustering
Scikit-Learn - Hierarchical Clustering

Scikit-Learn - Hierarchical Clustering

Table of Contents

Introduction

A clustering algorithm like KMeans is good for clustering tasks as it is fast and easy to implement but it has limitations that it works well if data can be grouped into globular or spherical clusters and also one needs to provide a number of clusters. As KMeans is based on distance from the cluster center to point in the cluster, it'll be able to cluster data that is organized globular/spherical shape and will fail to cluster data organized in a different manner.

We'll look at hierarchical methods in this tutorial.

Hierarchical Clustering

Hierarchical clustering is a kind of clustering that uses either top-down or bottom-up approach in creating clusters from data. It either starts with all samples in the dataset as one cluster and goes on dividing that cluster into more clusters or it starts with single samples in the dataset as clusters and then merges samples based on criteria to create clusters with more samples.

We can visualize the results of clustering as a dendrogram as hierarchical clustering progress. This helps us in deciding when we want to stop clustering further (how "deep") by setting "depth" with some threshold.

It has 2 types

  • Agglomerative Clustering (bottom-up approach) - We start with single samples and clusters and keep on combining them into clusters until we are left with a single cluster.
  • Divisive Clustering (top-down approach) - We start with the whole dataset as one cluster and then keep on dividing it into small clusters until each consists of a single sample.

To understand agglomerative clustering & divisive clustering, we need to understand concepts of single linkage and complete linkage. Single linkage helps in deciding the similarity between 2 clusters which can then be merged into one cluster. Complete linkage helps with divisive clustering which is based on dissimilarity measures between clusters. Divisive Clustering chooses the object with the maximum average dissimilarity and then moves all objects to this cluster that are more similar to the new cluster than to the remainder.

  • Single Linkage: We take pairs of most similar samples in each cluster and merge 2 clusters that have the most similar 2 members into the new cluster.
  • Complete Linkage: We take pairs of most dissimilar samples in each cluster and then merge them into 2 clusters where dissimilarity distance is least.
  • Average Linkage: We take pairs of most similar samples in each cluster using average distance and merge 2 clusters which have the most similar 2 members into the new cluster.
  • Ward: Minimizes the sum of squared distance between all pairs of clusters. Its concept is the same as KMeans but the approach is hierarchical.

Note that we measure similarity based on the distance which is generally euclidean distance.

We'll start by importing necessary libraries. We'll be explaining the usage of agglomerative clustering using scipy and scikit-learn both.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy
from scipy.cluster import hierarchy

import sklearn

import sys

print("Python Version : ", sys.version)
print("Scikit-Learn Version : ", sklearn.__version__)
Python Version :  3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
Scikit-Learn Version :  0.21.2

Load IRIS Data

We'll start by loading the famous IRIS dataset which has information about 3 different IRIS flower categories as features. It also holds information about the flower category. Scikit-Learn provides this dataset as part of its datasets module. We are taking only two features from the dataset for explanation purposes.

In [2]:
from sklearn import datasets

iris = datasets.load_iris()
X, Y = iris.data[:, [2,3]], iris.target

print("Dataset Features : ", iris.feature_names)
print("Dataset Target : ", iris.target_names)
print('Dataset Size : ', X.shape, Y.shape)
Dataset Features :  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Dataset Target :  ['setosa' 'versicolor' 'virginica']
Dataset Size :  (150, 2) (150,)

Below we are visualizing the iris dataset where the X-axis represents the 1st component (petal length) of the dataset and the Y-axis represents 2nd component (petal width). We have also color-encoded points according to flower type. We can notice that one type of flower is totally independent of the other two flowers. The flowers represented with color green and yellow overlap a bit.

In [ ]:
with plt.style.context("ggplot"):
    plt.scatter(X[:,0], X[:, 1], c=Y)
    plt.xlabel(iris.feature_names[2])
    plt.ylabel(iris.feature_names[3])
    plt.title("IRIS Dataset")

Scikit-Learn - Hierarchical Clustering

scipy.hierarchy

The hierarchy module of scipy provides us with linkage() method which accepts data as input and returns an array of size (n_samples-1, 4) as output which iteratively explains hierarchical creation of clusters.

The array of size (n_samples-1, 4) is explained as below with the meaning of each column of it. We'll be referring to it as an array Z of size (n_samples-1, 4).

  • 0th Column - The data in this column represents cluster number at iteration i which will be combined with another cluster.
  • 1st Column - The data in this column represents cluster number at iteration i which will be combined with another cluster.
  • 2nd Column - The data in this column represents distance between clusters specified in 0th and 1st column.
  • 3rd Column - The number of original observations in this newly formed cluster.

At ith-iteration of clustering algorithm, clusters Z[i,0] and Z[i, 1] are combined to form cluster n_samples+i. A cluster with index n_samples corresponds to a cluster with the original sample.

Important Parameters of linkage()

Below is a list of other important parameters of method linkage():

  • method: It accepts a string specifying linkage algorithm to use for clustering. Below is list of string arguments which it accepts.
    • complete
    • single
    • average
    • ward
    • centroid
    • weighted
    • median
  • metric: It accepts string specifying distance calculation between two points. It's set to euclidean by default.
  • optimal_ordering: Its a boolean parameter which if set to True will try to minimize the distance between successive leaves.

Hierarchical Clustering - Complete Linkage

Below we are generating cluster details for iris dataset loaded above using linkage() method of scipy.hierarchy. We have used the linkage algorithm complete for this purpose.

In [4]:
clusters = hierarchy.linkage(X, method="complete")

clusters[:10]
Out[4]:
array([[  0.,   1.,   0.,   2.],
       [  4., 150.,   0.,   3.],
       [  2.,  36.,   0.,   2.],
       [  3.,   7.,   0.,   2.],
       [  8., 151.,   0.,   4.],
       [  6.,  17.,   0.,   2.],
       [ 10., 153.,   0.,   3.],
       [ 28., 154.,   0.,   5.],
       [  9.,  32.,   0.,   2.],
       [ 27., 156.,   0.,   4.]])

Visualizing Hierarchical Clustering as Dendrogram

The scipy.hirearchy module provides method named dendrogram() for visualization of dendrogram created by linkage() method of clustering. It'll display overall process of how labels were combined to create clusters. Below we are visualizing dendrogram created using the complete linkage algorithm above. We need to pass it the output of the linkage() method along with original cluster labels.

In [ ]:
def plot_dendrogram(clusters):
    plt.figure(figsize=(20,6))
    dendrogram = hierarchy.dendrogram(clusters, labels=Y, orientation="top",leaf_font_size=9, leaf_rotation=360)
    plt.ylabel('Euclidean Distance');

plot_dendrogram(clusters)

Scikit-Learn - Hierarchical Clustering

We can notice from the above visualization that the majority of labels that are from the same flower class are kept together in the same cluster. There are few errors though due to overlap in feature values as depicted in the original visualization of the dataset at the beginning.

Hierarchical Clustering - Single Linkage

Below we are generating cluster details for iris dataset loaded above using linkage() method of scipy.hierarchy. We have used the linkage algorithm single for this purpose.

In [6]:
clusters = hierarchy.linkage(X, method="single")

clusters[:5]
Out[6]:
array([[  0.,   1.,   0.,   2.],
       [  4., 150.,   0.,   3.],
       [  8., 151.,   0.,   4.],
       [ 28., 152.,   0.,   5.],
       [ 33., 153.,   0.,   6.]])

Visualizing Hierarchical Clustering as Dendrogram

In [ ]:
plot_dendrogram(clusters)

Scikit-Learn - Hierarchical Clustering

Hierarchical Clustering - Average Linkage

Below we are generating cluster details for iris dataset loaded above using linkage() method of scipy.hierarchy. We have used the linkage algorithm average for this purpose.

In [8]:
clusters = hierarchy.linkage(X, method="average")

clusters[:5]
Out[8]:
array([[  0.,   1.,   0.,   2.],
       [  4., 150.,   0.,   3.],
       [  2.,  36.,   0.,   2.],
       [  3.,   7.,   0.,   2.],
       [  8., 151.,   0.,   4.]])

Visualizing Hierarchical Clustering as Dendrogram

In [ ]:
plot_dendrogram(clusters)

Scikit-Learn - Hierarchical Clustering

Hierarchical Clustering - Ward Linkage

Below we are generating cluster details for iris dataset loaded above using linkage() method of scipy.hierarchy. We have used the linkage algorithm ward for this purpose.

In [10]:
clusters = hierarchy.linkage(X, method="ward")

clusters[:5]
Out[10]:
array([[  0.,   1.,   0.,   2.],
       [  4., 150.,   0.,   3.],
       [  2.,  36.,   0.,   2.],
       [  3.,   7.,   0.,   2.],
       [  8., 151.,   0.,   4.]])

Visualizing Hierarchical Clustering as Dendrogram

In [ ]:
plot_dendrogram(clusters)

Scikit-Learn - Hierarchical Clustering

Scikit-Learn

The scikit-learn also provides an algorithm for hierarchical agglomerative clustering. The AgglomerativeClustering class available as a part of the cluster module of sklearn can let us perform hierarchical clustering on data. We need to provide a number of clusters beforehand

Important Parameters of AgglomerativeClustering

Below is a list of important parameters of AgglomerativeClustering which can be tuned to improve the performance of the clustering algorithm:

  • linkage - It accepts string specifying linkage algorithms to use for clustering. It has a ward as the default value. Below is a list of values that this parameter accepts.
    • single
    • complete
    • average
    • ward
  • affinity - It accepts string value or function specifying metric used to calculate linkage distance between points. The default value is euclidean for it. Below is a list of possible values.
    • euclidean
    • l1
    • l2
    • manhattan
    • cosine
    • precomputed

Agglomerative Clustering - Ward

Below we are trying AgglomerativeClustering on IRIS data loaded earlier with linkage algorithm as ward. We'll fit the model on train data and predict labels using the fit_predict() method. We'll be using the default euclidean method of measuring distance between two points of data.

In [12]:
from sklearn.cluster import AgglomerativeClustering

clustering = AgglomerativeClustering(n_clusters=3, linkage="ward")
Y_preds = clustering.fit_predict(X)

Y_preds
Out[12]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Evaluating Performance Of Clustering Algorithm

We'll be using the adjusted_rand_score method for measuring the performance of the clustering algorithm by giving original labels and predicted labels as input to the method. It tries all possible pairs of clustering labels returns a value between -1.0 and 1.0. If the clustering algorithm has predicted labels randomly then it'll return value of 0.0. If the value of 1.0 is returned then it means that the algorithm predicted all labels correctly.

In [13]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(Y, Y_preds)
Out[13]:
0.8857921001989628

Visualizing Original and Predicted Labels for Comparison

Below we have designed a method that takes as input original data, original labels, predicted labels and then plots two scatter plot comparing original data with predicted labels. This plot can help us compare the result of the clustering algorithm by comparing points to check whether they have been assigned the same cluster (flower type) as original data.

In [ ]:
def plot_actual_prediction_iris(X, Y, Y_preds):
    with plt.style.context(("ggplot", "seaborn")):
        plt.figure(figsize=(17,6))

        plt.subplot(1,2,1)
        plt.scatter(X[Y==0,0],X[Y==0,1], c = 'red', marker="^", s=50)
        plt.scatter(X[Y==1,0],X[Y==1,1], c = 'green', marker="^", s=50)
        plt.scatter(X[Y==2,0],X[Y==2,1], c = 'blue', marker="^", s=50)
        plt.title("Original Data")

        plt.subplot(1,2,2)
        plt.scatter(X[Y_preds==0,0],X[Y_preds==0,1], c = 'red', marker="^", s=50)
        plt.scatter(X[Y_preds==1,0],X[Y_preds==1,1], c = 'green', marker="^", s=50)
        plt.scatter(X[Y_preds==2,0],X[Y_preds==2,1], c = 'blue', marker="^", s=50)

        plt.title("Clustering Algorithm Prediction");

plot_actual_prediction_iris(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Single Linkage

Below we are trying AgglomerativeClustering on IRIS data loaded earlier with linkage algorithm as single. We'll fit the model on train data and predict labels using the fit_predict() method. We'll be using the default euclidean method of measuring distance between two points of data.

We are then plotting data with original labels and predicted labels to compare the performance of the clustering algorithm.

In [15]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="single")
Y_preds = clustering.fit_predict(X)

Y_preds
Out[15]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Evaluating Performance Of Clustering Algorithm

In [16]:
adjusted_rand_score(Y, Y_preds)
Out[16]:
0.5657466140922338

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_iris(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Complete Linkage

Below we are trying AgglomerativeClustering on IRIS data loaded earlier with the linkage algorithm as complete. We'll fit the model on train data and predict labels using the fit_predict() method. We'll be using the default euclidean method of measuring distance between two points of data.

We are then plotting data with original labels and predicted labels to compare the performance of the clustering algorithm.

In [18]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="complete")
Y_preds = clustering.fit_predict(X)

Y_preds
Out[18]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Evaluating Performance Of Clustering Algorithm

In [19]:
adjusted_rand_score(Y, Y_preds)
Out[19]:
0.8857921001989628

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_iris(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Average Linkage

Below we are trying AgglomerativeClustering on IRIS data loaded earlier with linkage algorithm as average. We'll fit the model on train data and predict labels using the fit_predict() method. We'll be using the default euclidean method of measuring distance between two points of data.

We are then plotting data with original labels and predicted labels to compare the performance of the clustering algorithm.

In [21]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="average")
Y_preds = clustering.fit_predict(X)

Y_preds
Out[21]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Evaluating Performance Of Clustering Algorithm

In [22]:
adjusted_rand_score(Y, Y_preds)
Out[22]:
0.8857921001989628

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_iris(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Create Moons Dataset (Two Interleaving Circles)

Below we are creating two half interleaving circles using make_moons() method of scikit-learn's datasets module. It creates a dataset with 400 samples and 2 class labels.

We are also plotting the dataset to understand the dataset that was created.

In [24]:
X, Y = datasets.make_moons(n_samples=400,
                  noise=0.09,
                  random_state=1)

print('Dataset Size : ',X.shape, Y.shape)
Dataset Size :  (400, 2) (400,)
In [ ]:
with plt.style.context("ggplot"):
    plt.scatter(X[:,0],X[:,1], c = Y, marker="o", s=50)
    plt.title("Original Data");

Scikit-Learn - Hierarchical Clustering

Trying Agglomerative Clustering With Different Linkage

We'll now try agglomerative clustering with different linkage algorithms on a dataset created above to check how each clustering algorithm is performing on it.

Agglomerative Clustering - Complete Linkage

In [26]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="complete")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [27]:
adjusted_rand_score(Y, Y_preds)
Out[27]:
0.11013979437153523

Visualizing Original and Predicted Labels for Comparison

In [ ]:
def plot_actual_prediction_moons(X,Y,Y_preds):
    with plt.style.context("ggplot"):
        plt.figure(figsize=(8,4))

        plt.subplot(1,2,1)
        plt.scatter(X[:,0],X[:,1], c = Y, marker="o", s=50)
        plt.title("Original Data")

        plt.subplot(1,2,2)
        plt.scatter(X[:,0],X[:,1], c = Y_preds , marker="o", s=50)

        plt.title("Clustering Algorithm Prediction");

plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Single Likage

In [29]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="single")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [30]:
adjusted_rand_score(Y, Y_preds)
Out[30]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Ward Linkage

In [32]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="ward")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [33]:
adjusted_rand_score(Y, Y_preds)
Out[33]:
0.569009661294173

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Average Linkage

In [35]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="average")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [36]:
adjusted_rand_score(Y, Y_preds)
Out[36]:
0.569009661294173

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Create Dataset Of Circles (Large Circle Containing A Smaller Circle)

Below we are creating a dataset which has one circle inside another circle. It creates a dataset with 300 samples and 2 class labels representing each circle.

We are also plotting the dataset to look at data.

In [38]:
from sklearn import datasets

X, Y  = datasets.make_circles(n_samples=300, noise=0.01,factor=0.5)
print('Dataset Size : ',X.shape, Y.shape)
Dataset Size :  (300, 2) (300,)
In [ ]:
with plt.style.context("ggplot"):
    plt.scatter(X[:,0],X[:,1], c = Y, marker="o", s=50)
    plt.title("Original Data");

Scikit-Learn - Hierarchical Clustering

Trying Agglomerative Clustering With Different Linkage

We'll now try agglomerative clustering with different linkage algorithms on a dataset created above to check how each clustering algorithm is performing on it.

Agglomerative Clustering - Complete Linkage

In [41]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="complete")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [42]:
adjusted_rand_score(Y, Y_preds)
Out[42]:
-0.0018522270480906466

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Average Linkage

In [44]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="average")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [45]:
adjusted_rand_score(Y, Y_preds)
Out[45]:
0.016374761928410086

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Single Linkage

In [47]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="single")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [48]:
adjusted_rand_score(Y, Y_preds)
Out[48]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Ward

In [50]:
clustering = AgglomerativeClustering(n_clusters=2, linkage="ward")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [51]:
adjusted_rand_score(Y, Y_preds)
Out[51]:
-0.002556081684854566

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Create Isotropic Gaussian Blobs dataset

Below we are creating isotropic Gaussian blobs dataset for clustering. It creates a dataset with 3 blobs. It has 400 samples and 3 class labels each representing blob label.

We are also plotting the dataset to understand it better.

In [ ]:
X, Y = datasets.make_blobs(n_samples=400, random_state=8)

with plt.style.context("ggplot"):
    plt.scatter(X[:,0],X[:,1], c = Y, marker="o", s=50)
    plt.title("Original Data");

Scikit-Learn - Hierarchical Clustering

Trying Agglomerative Clustering With Different Linkage

We'll now try agglomerative clustering with different linkage algorithms on a dataset created above to check how each clustering algorithm is performing on it.

Agglomerative Clustering - Ward

In [54]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="ward")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [55]:
adjusted_rand_score(Y, Y_preds)
Out[55]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Single Linkage

In [57]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="single")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [58]:
adjusted_rand_score(Y, Y_preds)
Out[58]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Complete Linkage

In [60]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="complete")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [61]:
adjusted_rand_score(Y, Y_preds)
Out[61]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Average Linkage

In [63]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="average")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [64]:
adjusted_rand_score(Y, Y_preds)
Out[64]:
1.0

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Create Isotropic Gaussian Blobs dataset & Skew Blobs Using Transformation

Below we are again creating a Gaussian blobs dataset but this time we are also skewing blobs by adding noise data to it.

In [66]:
random_state = 170
X, Y = datasets.make_blobs(n_samples=400, random_state=random_state)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X = np.dot(X, transformation)
In [ ]:
with plt.style.context("ggplot"):
    plt.scatter(X[:,0],X[:,1], c = Y, marker="o", s=50)
    plt.title("Original Data");

Scikit-Learn - Hierarchical Clustering

Trying Agglomerative Clustering With Different Linkage

We'll now try agglomerative clustering with different linkage algorithms on a dataset created above to check how each clustering algorithm is performing on it.

Agglomerative Clustering - Average Linkage

In [68]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="average")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [69]:
adjusted_rand_score(Y, Y_preds)
Out[69]:
0.5525388067906037

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Ward

In [71]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="ward")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [72]:
adjusted_rand_score(Y, Y_preds)
Out[72]:
0.5065903247847451

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Single Linkage

In [74]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="single")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [75]:
adjusted_rand_score(Y, Y_preds)
Out[75]:
0.5683499586603816

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

Agglomerative Clustering - Complete Linkage

In [77]:
clustering = AgglomerativeClustering(n_clusters=3, linkage="complete")
Y_preds = clustering.fit_predict(X)

Evaluating Performance Of Clustering Algorithm

In [78]:
adjusted_rand_score(Y, Y_preds)
Out[78]:
0.45733577518422425

Visualizing Original and Predicted Labels for Comparison

In [ ]:
plot_actual_prediction_moons(X, Y, Y_preds)

Scikit-Learn - Hierarchical Clustering

This ends our small tutorial explaining how to perform hierarchical clustering using scikit-learn. Please feel free to let us know your views in the comments section.

References



Sunny Solanki  Sunny Solanki