Share @ LinkedIn Facebook  yellobrick, ml-metrics, visualization, text-data
Yellowbrick - Text Data Visualizations [Python]

Yellowbrick - Text Data Visualizations

The yellowbrick is a Python library designed on top of scikit-learn and matplotlib to visualize various machine learning metrics. It provides API to visualize metrics related to classification, regression, text data analysis, clustering, feature relations, and many more. We have already created a tutorial explaining the usage of yellowbrick to create classification and regression metrics. This tutorial will specifically concentrate on text data analysis metrics available with yellowbrick.

Below is a list of visualizations available with yellowbrick to visualize text data to better understand it:

  • Term Frequency Bar Chart - It displays the frequency of words as a bar chart in ascending/descending order.
  • t-SNE Corpus Visualization - It uses sklearn t-SNE (t-distributed stochastic neighbor embedding) clustering algorithm to transform document data to 2-dimensional data using probability distributions from original dimensionality and decomposed dimensionality. It can easily detect clusters. The two-dimensional data is then plotted as a scatter chart.
  • Dispersion Plot - The dispersion plot shows how often a list of words is appearing in a corpus of data.
  • UMAP Corpus Visualization - It uses UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction algorithm for reducing the dimensionality of data to 2 dimensions which is then plotted as a scatter chart. It's also very useful in detecting clusters.
  • PosTag Visualization - It displays parts of speech (verbs, nouns, prepositions, adjectives) distribution in text corpus as a bar chart.

We'll start by loading the necessary libraries.

In [1]:
import yellowbrick
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

Load Dataset

We'll be using a spam/ham dataset available from UCI for explaining text data analysis visualizations available with yellowbrick. We have first downloaded the dataset, unzipped it, and then loaded data from the file as a list of emails and their labels. The dataset has the content of mails and their labels (spam/ham).

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
--2020-11-09 16:46:26--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z 100%[===================>] 198.65K  99.5KB/s    in 2.0s

2020-11-09 16:46:31 (99.5 KB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection
  inflating: readme
In [3]:
import collections

with open('SMSSpamCollection') as f:
    data = [line.strip().split('\t') for line in f.readlines()]

y, text = zip(*data)

collections.Counter(y)
Out[3]:
Counter({'ham': 4827, 'spam': 747})

Token Frequency Distribution

The yellowbrick provides a token frequency distribution bar chart as a part of the FreqDistVisualizer class. We have first transformed text data to float of word counts using the CountVectorizer feature extractor from scikit-learn.

We have then created an instance of FreqDistVisualizer by giving it a list of feature names and a count of how many top words we want to display as n parameter. We have then fitted then transformed text data to the visualizer object. At last, we have generated visualization by calling the show() method on the FreqDistVisualizer visualizer object.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vec = CountVectorizer(stop_words="english")

transformed_data = vec.fit_transform(text)
In [ ]:
from yellowbrick.text import FreqDistVisualizer

fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111)

freq_dist_viz = FreqDistVisualizer(features=vec.get_feature_names(), color="tomato", n=30, ax=ax)

freq_dist_viz.fit(transformed_data)

freq_dist_viz.show();

Yellowbrick - Text Data Visualizations

The yellowbrick also provides easy to use methods if we don't want to use the class approach. Below we have created a token frequency distribution bar chart using the fredist method of the yellowbrick.text module. We have this time laid out a bar chart as vertically by setting the orient parameter to v.

In [ ]:
from yellowbrick.text import freqdist

fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)

freqdist(vec.get_feature_names(), transformed_data, orient="v", color="lime", ax=ax);

Yellowbrick - Text Data Visualizations

t-SNE Corpus Visualization

The yellowbrick lets us create t-SNE corpus clustering detection visualization using the TSNEVisualizer class. We have first transformed our original text data to float using TfidfVectorizer from sklearn. We have then created an instance of the TfidfVectorizer visualizer and fitted transformed data to it. The TfidfVectorizer class has few important parameters.

  • decompose - It provides us with two options to choose from to decompose data.
    • svd - This is default
    • pca
  • decompose_by - It lets us specify how many components to use to create decomposition. The default is 50.

We can see from the below visualization that the majority of spam and ham emails are grouped together.

In [ ]:
from yellowbrick.text import TSNEVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)

fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111)

tsne_viz = TSNEVisualizer(ax=ax,
                          decompose="svd",
                          decompose_by=50,
                          colors=["tomato", "lime"],
                          random_state=123)

tsne_viz.fit(transformed_text.toarray(), y)

tsne_viz.show();

Yellowbrick - Text Data Visualizations

Below we have created another t-SNE visualization on our dataset but this time using pca decomposition.

In [ ]:
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111)

tsne_viz = TSNEVisualizer(ax=ax,
                          decompose="pca",
                          decompose_by=50,
                          colors=["tomato", "lime"],
                          random_state=123)

tsne_viz.fit(transformed_text.toarray(), y)

tsne_viz.show();

Yellowbrick - Text Data Visualizations

Now, we have created t-SNE visualization using the tsne() method available from yellowbrick. We have used 100 components this time for decomposition.

In [ ]:
from yellowbrick.text.tsne import tsne

vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)

fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)

tsne(transformed_text.toarray(), y, ax=ax, decompose="pca", decompose_by=100, colors=["dodgerblue", "fuchsia"]);

Yellowbrick - Text Data Visualizations

Dispersion Plot

The yellowbrick provides us with class DispersionPlot to create a dispersion plot. We have first split each individual mail into a list of words. We have then listed down target words for which we want to see dispersion in a text corpus.

We have then created an instance of DispersionPlot by giving is a list of words for which we want dispersion. We have then fitted transformed text data to it and at last, called show() method to display the chart.

In [ ]:
from yellowbrick.text import DispersionPlot

total_docs = [doc.split() for doc in text]

target_words = ["free", "download", "win", "congrats", "crazy", "customer"]

fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)

visualizer = DispersionPlot(target_words,
                            ignore_case=True,
                            color=["lime", "tomato"],
                            ax=ax)
visualizer.fit(total_docs, y)
visualizer.show();

Yellowbrick - Text Data Visualizations

Below we have created a dispersion chart using the dispersion() method available from yellowbrick. Please make a note that this time we have taken the case into consideration by setting ignore_case to False hence words Free and free will be treated differently rather than one word.

In [ ]:
from yellowbrick.text import dispersion

target_words = ["free", "download", "win", "congrats", "crazy", "customer"]

fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)

dispersion(target_words, total_docs, y=y, ax=ax, ignore_case=False, color=["lawngreen", "red"]);

Yellowbrick - Text Data Visualizations

UMAP Corpus Visualization

The third chart type that we'll explain is UMAP corpus visualization using UMAPVisualizer from yellowbrick. We have first transformed text data using TfidfVectorizer. We have then created an instance of UMAPVisualizer, fitted text data to it, and generated visualization using the show() method.

In [ ]:
from yellowbrick.text import UMAPVisualizer


vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)

umap = UMAPVisualizer(ax=ax, colors=["lime", "tomato"])
umap.fit(transformed_text, y)
umap.show();

Yellowbrick - Text Data Visualizations

Below we have explained the second way of generating UMAP corpus using umap() method.

In [ ]:
from yellowbrick.text import umap

vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)

umap(transformed_text, y, ax=ax, colors=["lime", "tomato"]);

Yellowbrick - Text Data Visualizations

PosTag Visualization

The PosTag visualization is available using PosTagVisualizer from yellowbrick. The PosTagVisualizer takes as input list for each sentence where the individual element of the list of word and its parts speech tag (noun, verb, adjectives, etc). The nltk library provides an easy way to generate parts of speech tags for a list of texts.

We have created an instance of PosTagVisualizer first and then fitted parts of speech tags data to it. We have then generated visualization using the show() method.

In [ ]:
from yellowbrick.text import PosTagVisualizer
import nltk

pos_tags_first_sents = [[val] for val in nltk.pos_tag_sents(text[:100])]

fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)

viz = PosTagVisualizer(ax=ax, frequency=True)
viz.fit(pos_tags_first_sents, y[:100])
viz.show();

Yellowbrick - Text Data Visualizations

Below we have explained another way to generate PosTag visualization but this time we have separated parts of speech tags distribution per class of classification dataset.

In [ ]:
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)

viz = PosTagVisualizer(ax=ax, frequency=True, stack=True)
viz.fit(pos_tags_first_sents, y[:100])
viz.show();

Yellowbrick - Text Data Visualizations

At last, we have explained how the above visualization can be created using postag() method of yellowbrick.

In [ ]:
from yellowbrick.text.postag import postag

pos_tags_first_sents = [[val] for val in nltk.pos_tag_sents(text[:500])]

fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)

postag(pos_tags_first_sents, y[:500], ax=ax, stack=True);

Yellowbrick - Text Data Visualizations



Sunny Solanki  Sunny Solanki