The yellowbrick is a Python library designed on top of scikit-learn and matplotlib to visualize various machine learning metrics. It provides API to visualize metrics related to classification, regression, text data analysis, clustering, feature relations, and many more. We have already created a tutorial explaining the usage of yellowbrick to create classification and regression metrics. This tutorial will specifically concentrate on text data analysis metrics available with yellowbrick.
Below is a list of visualizations available with yellowbrick to visualize text data to better understand it:
We'll start by loading the necessary libraries.
import yellowbrick
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
We'll be using a spam/ham dataset available from UCI for explaining text data analysis visualizations available with yellowbrick. We have first downloaded the dataset, unzipped it, and then loaded data from the file as a list of emails and their labels. The dataset has the content of mails and their labels (spam/ham).
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
import collections
with open('SMSSpamCollection') as f:
data = [line.strip().split('\t') for line in f.readlines()]
y, text = zip(*data)
collections.Counter(y)
The yellowbrick provides a token frequency distribution bar chart as a part of the FreqDistVisualizer class. We have first transformed text data to float of word counts using the CountVectorizer feature extractor from scikit-learn.
We have then created an instance of FreqDistVisualizer by giving it a list of feature names and a count of how many top words we want to display as n parameter. We have then fitted then transformed text data to the visualizer object. At last, we have generated visualization by calling the show() method on the FreqDistVisualizer visualizer object.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vec = CountVectorizer(stop_words="english")
transformed_data = vec.fit_transform(text)
from yellowbrick.text import FreqDistVisualizer
fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111)
freq_dist_viz = FreqDistVisualizer(features=vec.get_feature_names(), color="tomato", n=30, ax=ax)
freq_dist_viz.fit(transformed_data)
freq_dist_viz.show();

The yellowbrick also provides easy to use methods if we don't want to use the class approach. Below we have created a token frequency distribution bar chart using the fredist method of the yellowbrick.text module. We have this time laid out a bar chart as vertically by setting the orient parameter to v.
from yellowbrick.text import freqdist
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
freqdist(vec.get_feature_names(), transformed_data, orient="v", color="lime", ax=ax);

The yellowbrick lets us create t-SNE corpus clustering detection visualization using the TSNEVisualizer class. We have first transformed our original text data to float using TfidfVectorizer from sklearn. We have then created an instance of the TfidfVectorizer visualizer and fitted transformed data to it. The TfidfVectorizer class has few important parameters.
decompose - It provides us with two options to choose from to decompose data.svd - This is defaultpcadecompose_by - It lets us specify how many components to use to create decomposition. The default is 50.We can see from the below visualization that the majority of spam and ham emails are grouped together.
from yellowbrick.text import TSNEVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111)
tsne_viz = TSNEVisualizer(ax=ax,
decompose="svd",
decompose_by=50,
colors=["tomato", "lime"],
random_state=123)
tsne_viz.fit(transformed_text.toarray(), y)
tsne_viz.show();

Below we have created another t-SNE visualization on our dataset but this time using pca decomposition.
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111)
tsne_viz = TSNEVisualizer(ax=ax,
decompose="pca",
decompose_by=50,
colors=["tomato", "lime"],
random_state=123)
tsne_viz.fit(transformed_text.toarray(), y)
tsne_viz.show();

Now, we have created t-SNE visualization using the tsne() method available from yellowbrick. We have used 100 components this time for decomposition.
from yellowbrick.text.tsne import tsne
vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111)
tsne(transformed_text.toarray(), y, ax=ax, decompose="pca", decompose_by=100, colors=["dodgerblue", "fuchsia"]);

The yellowbrick provides us with class DispersionPlot to create a dispersion plot. We have first split each individual mail into a list of words. We have then listed down target words for which we want to see dispersion in a text corpus.
We have then created an instance of DispersionPlot by giving is a list of words for which we want dispersion. We have then fitted transformed text data to it and at last, called show() method to display the chart.
from yellowbrick.text import DispersionPlot
total_docs = [doc.split() for doc in text]
target_words = ["free", "download", "win", "congrats", "crazy", "customer"]
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
visualizer = DispersionPlot(target_words,
ignore_case=True,
color=["lime", "tomato"],
ax=ax)
visualizer.fit(total_docs, y)
visualizer.show();

Below we have created a dispersion chart using the dispersion() method available from yellowbrick. Please make a note that this time we have taken the case into consideration by setting ignore_case to False hence words Free and free will be treated differently rather than one word.
from yellowbrick.text import dispersion
target_words = ["free", "download", "win", "congrats", "crazy", "customer"]
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
dispersion(target_words, total_docs, y=y, ax=ax, ignore_case=False, color=["lawngreen", "red"]);

The third chart type that we'll explain is UMAP corpus visualization using UMAPVisualizer from yellowbrick. We have first transformed text data using TfidfVectorizer. We have then created an instance of UMAPVisualizer, fitted text data to it, and generated visualization using the show() method.
from yellowbrick.text import UMAPVisualizer
vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
umap = UMAPVisualizer(ax=ax, colors=["lime", "tomato"])
umap.fit(transformed_text, y)
umap.show();

Below we have explained the second way of generating UMAP corpus using umap() method.
from yellowbrick.text import umap
vec = TfidfVectorizer(stop_words="english")
transformed_text = vec.fit_transform(text)
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
umap(transformed_text, y, ax=ax, colors=["lime", "tomato"]);

The PosTag visualization is available using PosTagVisualizer from yellowbrick. The PosTagVisualizer takes as input list for each sentence where the individual element of the list of word and its parts speech tag (noun, verb, adjectives, etc). The nltk library provides an easy way to generate parts of speech tags for a list of texts.
We have created an instance of PosTagVisualizer first and then fitted parts of speech tags data to it. We have then generated visualization using the show() method.
from yellowbrick.text import PosTagVisualizer
import nltk
pos_tags_first_sents = [[val] for val in nltk.pos_tag_sents(text[:100])]
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)
viz = PosTagVisualizer(ax=ax, frequency=True)
viz.fit(pos_tags_first_sents, y[:100])
viz.show();

Below we have explained another way to generate PosTag visualization but this time we have separated parts of speech tags distribution per class of classification dataset.
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)
viz = PosTagVisualizer(ax=ax, frequency=True, stack=True)
viz.fit(pos_tags_first_sents, y[:100])
viz.show();

At last, we have explained how the above visualization can be created using postag() method of yellowbrick.
from yellowbrick.text.postag import postag
pos_tags_first_sents = [[val] for val in nltk.pos_tag_sents(text[:500])]
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111)
postag(pos_tags_first_sents, y[:500], ax=ax, stack=True);

This ends our small tutorial explaining the features of text data. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to
Sunny Solanki
Comfortable Learning through Video Tutorials?
Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?
Want to Share Your Views? Have Any Suggestions?