Clustering is a type of Unsupervised Machine Learning. In clustering, developers are not provided any prior knowledge about data like supervised learning where developer knows target variable.

Clustering is the task of creating clusters of samples that have the same characteristics based on some predefined similarity or dissimilarity distance measures like euclidean distance.

- Clustering students from class who have the same performance using grades and other attributes for customizing coaching later on.
- Clustering documents together which have content on same topics
- Separating voice from different sources from mixed voice.
- & many more.

`sklearn.cluster`

module provides a list of clustering algorithms which we'll try below. We'll start with `KMeans`

and then explore other algorithms.

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import cluster, datasets
import warnings
import sys
print("Python Version : ",sys.version)
print("Scikit-Learn Version : ",sklearn.__version__)
warnings.filterwarnings('ignore') ## We'll silent future warnings using this command.
np.set_printoptions(precision=3)
## Beow magic function fits plot inside of current notebook.
## There is another option to it (%matplotlib notebook) which opens plot in new notebook.
%matplotlib inline
```

KMeans is an iterative algorithm that begins with random cluster centers and then tries to minimize the distance between sample points and these cluster centers. We need to provide number of clusters in advance. KMeans uses Euclidean distance to measure the distance between cluster centers and sample points. Sample points are moved between clusters if later on, it found that sample points are nearer to some other cluster.

We'll create a dataset with 250 samples, 2 features and 5 cluster centers using scikit-learn's `make_blobs`

method.

In [2]:

```
samples, clusters = datasets.make_blobs(n_samples=250, n_features=2, centers=5, cluster_std=0.7, random_state=12345)
print('Dataset size : ', samples.shape, clusters.shape)
print('Cluster names : ',set(clusters))
```

We'll be visualizing the dataset by plotting scatter chart of Feature-1 and Feature-2. We'll also color-encode and marker-encode each of cluster to show them different.

In [3]:

```
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
for i, c, m in zip(range(5),['red','green','blue','orange','purple'], ['s','+','^','o', 'x']):
plt.scatter(samples[clusters == i,0],samples[clusters == i,1], color=c, marker=m, s=80, alpha = 0.8, label= 'Cluster %d'%i)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Visualizing Dataset')
plt.legend(loc='best');
```

We are initializing KMeans clustering algorithms below with `n_clusters=5`

because we already know a number of clusters beforehand. For cases where we don't know a number of clusters upfront, we have explained the elbow method below to find out the proper number of clusters.

In [4]:

```
kmeans = cluster.KMeans(n_clusters=5)
kmeans.fit(samples)
```

Out[4]:

In [5]:

```
preds = kmeans.predict(samples)
```

We are printing below the accuracy and confusion matrix. We can notice from the confusion matrix that classes returned by kmeans is different from actual classes hence we are getting low accuracy. We need to use the `adjusted_rand_score`

method to handle such scenarios.

In [6]:

```
from sklearn.metrics import accuracy_score, confusion_matrix, adjusted_rand_score
print('Accuracy : %.3f'%accuracy_score(y_true = clusters, y_pred=preds))
print('Confusion Matrix : \n', confusion_matrix(y_true=clusters, y_pred=preds))
print('Adjusted Accuracy : %.3f'%adjusted_rand_score(labels_true=clusters, labels_pred=preds))
```

We can also access cluster center for each cluster using `cluster_centers_`

attribute of KMeans object.

In [7]:

```
print('Cluster Centers : \n', str(kmeans.cluster_centers_))
```

We can also access sum of squared distance of each sample from their closest cluster center using `intertia_`

attribute of KMeans object. It should be as minimum as possible.

In [8]:

```
print('Sum of squared distances of samples to their closest cluster center : %.2f'%kmeans.inertia_,)
```

Below we are plotting all points of sample data and also linking them to their cluster center using line plot.

In [9]:

```
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(10,6))
plt.scatter(samples[preds == 0,0],samples[preds == 0,1], color='red', marker='s', s=80, alpha = 0.8, label= 'Cluster 0')
plt.scatter(samples[preds == 1,0],samples[preds == 1,1], color='green', marker='^', s=80, alpha = 0.8, label= 'Cluster 1')
plt.scatter(samples[preds == 2,0],samples[preds == 2,1], color='blue', marker='*', s=80, alpha = 0.8, label= 'Cluster 2')
plt.scatter(samples[preds == 3,0],samples[preds == 3,1], color='orange', marker='o', s=80, alpha = 0.8, label= 'Cluster 3')
plt.scatter(samples[preds == 4,0],samples[preds == 4,1], color='purple', marker='+', s=80, alpha = 0.8, label= 'Cluster 4')
for x,y in zip(samples[preds == 0,0],samples[preds == 0,1]):
plt.plot([kmeans.cluster_centers_[0][0],x],[kmeans.cluster_centers_[0][1],y], color='red')
for x,y in zip(samples[preds == 1,0],samples[preds == 1,1]):
plt.plot([kmeans.cluster_centers_[1][0],x],[kmeans.cluster_centers_[1][1],y], color='green')
for x,y in zip(samples[preds == 2,0],samples[preds == 2,1]):
plt.plot([kmeans.cluster_centers_[2][0],x],[kmeans.cluster_centers_[2][1],y], color='blue')
for x,y in zip(samples[preds == 3,0],samples[preds == 3,1]):
plt.plot([kmeans.cluster_centers_[3][0],x],[kmeans.cluster_centers_[3][1],y], color='orange')
for x,y in zip(samples[preds == 4,0],samples[preds == 4,1]):
plt.plot([kmeans.cluster_centers_[4][0],x],[kmeans.cluster_centers_[4][1],y], color='purple')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Visualizing Predictions & Cluster Centers')
plt.legend(loc='best');
```

In the above scenario, we already knew a number of clusters in advance. But what if we encounter data for which we aren't aware of a number of possible clusters. There is a method called Elbow method which can be used to solve this problem.

Here we look at cluster dispersion for different values of k and plot it. Once plotted we take k value which is at "pit of the elbow" to be a number of clusters. It's based on the intuition that after that many clusters adding more clusters is not improving the sum of squared distances of samples from their clusters further hence that's the best number of clusters one should try.

In [10]:

```
plt.figure(figsize=(8,5))
distortions = []
for i in range(1,11):
kmeans = cluster.KMeans(n_clusters=i)
kmeans.fit(samples)
distortions.append(kmeans.inertia_)
print('Distortions (Sum Of Squared Distance of Samples from Closest Cluster Center) : ',distortions)
with plt.style.context(('ggplot', 'seaborn')):
plt.plot(range(1,11), distortions, )
plt.scatter(range(1,11), distortions, color='red', marker='o', s=80)
plt.xlabel('Number Of Clusters')
plt.ylabel('Distortions')
plt.title('The Elbow Method (Num of Clusters vs Distortions)')
plt.xticks(range(1,11));
```

**Clustering comes with assumptions:** A clustering algorithm finds clusters by making assumptions on how samples should be grouped together. Each algorithm has different assumptions. The quality and interpretability of resulting clusters depend on how these assumptions are satisfied with your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.

Sunny Solanki