Updated On : May-13,2022 Time Investment : ~30 mins

Keras: GloVe Embeddings for Text Classification Tasks¶

Word embeddings is one of the most preferred approaches when encoding text data to real-valued data for Machine learning tasks. When using the word embeddings approach, we first break text into a list of tokens (words) and then assign a real-valued vector (embeddings) of pre-decided length to each token. Other encoding approaches like one-hot, word frequency, Tf-IDF, etc assign just a single scalar to each token. This limits the representation power of the tokens. With a single scalar, you can only represent a single meaning. With the word embeddings approach where we assign a real-valued vector to the token, it gives more representation power to it. The same word can have different meanings in a different context and can have a different relationship with other words. These can be better captured by embeddings.

GloVe (Global Vectors) is an unsupervised learning algorithm that is trained on a big corpus of data to capture the meaning of the words by generating word embeddings for them. These word embeddings can be then used by other ML tasks that have different small datasets. The trained token embeddings can be taken from GloVe Embeddings. This saves us time to generate and train our own embeddings which might not be possible with small datasets. There are different versions of Glove embeddings generated by Stanford professors with different embedding lengths. Please feel free to check the below link if you want to learn more about GloVe.

GloVe: Global Vectors for Word Representation

As a part of this tutorial, we have designed neural networks using Python deep learning library Keras (Tensorflow) that uses GloVe Word Embeddings (840B.300d) for text classification tasks. We have tried different approaches to using embeddings and recorded their results for comparison purposes. We have also calculated various ML metrics for evaluating the performance of our networks. Apart from calculations, we have also explained predictions made by our networks using LIME (Local Interpretable Model-Agnostic Explanations) algorithm.

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Prepare Data
- 1.1 Download And Unzip GloVe Embeddings (840B.300d)
- 1.2 Load GloVe Word Embeddings in Memory
- 1.3 Load Dataset
- 1.4 Tokenize Examples, Populate Vocabulary And Vecorize Text Examples
- 1.5 Create Matrix Of GloVe Embeddings for Vocabulary Tokens
Approach 1: GloVe Embeddings Flattened (Max Tokens=50, Embedding Length=300)
- Define Network
- Compile Network
- Train Network
- Evaluate Network Performance
- Explain Network Predictions using LIME Algorithm
Approach 2: GloVe Embeddings Averaged (Max Tokens=50, Embedding Length=300)
Approach 3: GloVe Embeddings Summed (Max Tokens=50, Embedding Length=300)
Results Summary And Further Suggestions

Below, we have imported Python deep learning library Keras (tensorflow) and printed the version we used in our tutorial.

import tensorflow
from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))

Keras Version : 2.6.0

1. Prepare Data ¶

In this section, we are preparing data to be given to the neural network for training purposes. We'll follow the below steps in order to use GloVe word embeddings for text classification tasks.

Loop through all text examples, tokenize them and populate the vocabulary of all tokens. A vocabulary is a simple mapping from token to integer index. Each token is assigned a unique index starting from 0. This unique index identifies tokens uniquely.
Loop through each text example, tokenize them and retrieve indexes of tokens of examples from the vocabulary. After this step, we'll have a list of token indexes for each text example.
Loop through each token of vocabulary and retrieve GloVe embeddings for tokens. Wrap embeddings of all tokens in a single matrix.
Retrieve glove embeddings for tokens by integer indexing embedding matrix created in the third step.

So basically, we first tokenize text examples, populate vocabulary, and retrieve token indexes for tokens of text examples. We can then retrieve glove embeddings using these token indexes to index the embeddings matrix.

The first three steps mentioned above will be completed in this section whereas the fourth step will be implemented in the neural network as the Embedding layer. The Glove embedding matrix will be set as the weight matrix of the first layer of the network which is the embedding layer and this layer will retrieve embeddings from the weight matrix based on input token indexes.

Below, we have included an image that explains word embeddings.

1.1 Download And Unzip GloVe Embeddings (840B.300d)¶

In this step, we have simply downloaded GloVe 840B.300d word embeddings from the Stanford website as a zip file and then unzipped it. This word embedding has embeddings of length 300 for 2.2 Million tokens.

!wget https://nlp.stanford.edu/data/glove.840B.300d.zip

--2022-05-14 05:31:53--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2022-05-14 05:31:53--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’

glove.840B.300d.zip 100%[===================>]   2.03G  5.00MB/s    in 6m 51s

2022-05-14 05:38:45 (5.05 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]

!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt

1.2 Load GloVe Word Embeddings in Memory¶

In this section, we are simply loading Glove embeddings in memory from the file. We have created a simple dictionary whose keys are tokens (words) and values are embeddings.

%%time

import numpy as np

glove_embeddings = {}
with open("glove.840B.300d.txt") as f:
    for line in f:
        try:
            line = line.split()
            glove_embeddings[line[0]] = np.array(line[1:], dtype=np.float32)
        except:
            continue

CPU times: user 2min, sys: 3.58 s, total: 2min 4s
Wall time: 2min 4s

embeddings = glove_embeddings["the"]

embeddings.shape, embeddings.dtype

((300,), dtype('float32'))

1.3 Load Dataset¶

In this section, we have loaded the dataset that we are going to use for our text classification task. We'll be using the newsgroups dataset available from scikit-learn. The dataset has ~20k text documents for 20 different categories of news. We have loaded text documents for four categories for our purpose. The dataset is already divided into train and test sets for our convenience.

import numpy as np
from sklearn import datasets
import gc

all_categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware',
                  'comp.sys.mac.hardware','comp.windows.x', 'misc.forsale','rec.autos','rec.motorcycles',
                  'rec.sport.baseball','rec.sport.hockey','sci.crypt','sci.electronics','sci.med',
                  'sci.space','soc.religion.christian','talk.politics.guns','talk.politics.mideast',
                  'talk.politics.misc','talk.religion.misc']

target_classes = ['comp.sys.ibm.pc.hardware','rec.autos','rec.sport.hockey','talk.politics.mideast']

X_train_text, Y_train = datasets.fetch_20newsgroups(subset="train", categories=target_classes, return_X_y=True)
X_test_text , Y_test  = datasets.fetch_20newsgroups(subset="test", categories=target_classes, return_X_y=True)

classes = np.unique(Y_train)
mapping = dict(zip(classes, target_classes))

len(X_train_text), len(X_test_text), classes, mapping

(2348,
 1563,
 array([0, 1, 2, 3]),
 {0: 'comp.sys.ibm.pc.hardware',
  1: 'rec.autos',
  2: 'rec.sport.hockey',
  3: 'talk.politics.mideast'})

1.4 Tokenize Examples, Populate Vocabulary, And Vecorize Text Examples¶

In this section, we have implemented the first two steps of our encoding process which we had explained at the beginning.

First, we have created an instance of Tokenizer and called fit_on_texts() method on it. We have provided the method train and test examples. A call to this method will internally populate a vocabulary of all unique tokens in the tokenizer object.

Next, we have called texts_to_sequences() method on Tokenizer object with train and test text examples. This method will tokenize each text example into tokens and then retrieve indexes of those tokens from the vocabulary.

Now, each of our text examples is of a different length and hence has a different number of tokens (words). We have decided to keep maximum of 50 tokens per text example. To do this, we have called pad_sequences() function on the list of token indexes. This method will make sure that each example has exactly 50 token indexes. The examples that have more than 50 tokens will be truncated to 50 tokens and those who have less than 50 tokens will be padded with 0s to bring it to length 50.

So we have turned each text example into a list of token indexes. We'll later use these indexes to integer index embedding matrix and retrieve embeddings of tokens.

In the next cells after the below cell, we have printed the number of tokens in vocabulary.

Below, we have explained with a simple example how text example is vectorized.

text = "Hello, How are you? Where are you planning to go?"

tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
            'are', 'you', 'planning', 'to', 'go', '?']

vocab = {
    'hello': 0,
    'bye': 1,
    'how': 2,
    'the': 3,
    'welcome': 4,
    'are': 5,
    'you': 6,
    'to': 7,
    '<unk>': 8,
}

vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_tokens = 50 ## Hyperparameter

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)

## Vectorizing data to keep 50 words per sample.
X_train_vect = pad_sequences(tokenizer.texts_to_sequences(X_train_text), maxlen=max_tokens, padding="post", truncating="post", value=0.)
X_test_vect  = pad_sequences(tokenizer.texts_to_sequences(X_test_text), maxlen=max_tokens, padding="post", truncating="post", value=0.)

print(X_train_vect[:3])

X_train_vect.shape, X_test_vect.shape

[[   13  6724  6187  1873    15  4696  8305    31  1337   900   159   191
     35   212   544   389   454    66   454   840   191    34   132  1205
   6724  6187     7  5916     1  1337   900     1  2382    16  1722     1
   5701    16  1722    10    60     5 22832  1666   267    23    16     1
    108    96]
 [   13 12318 12319 13819  7407    15 13820 13821    31    53  3384  2031
    788  2857  2451    35   579     3 12320     4  2007   544     4   631
     34   257     6    55 13822 30572 30573  3384    37  4818  3384    37
  15958    50  2281  1616   265   877  2495   877 18596  3451   156  2383
    156  2383]
 [   13  2134 22834    37    31   954   512    14     5  7408   325   215
     35 22834  1152   115 18598  5703    34   195   174 30574   232     5
    777     3   271   325 30575   940     7   455     5  1045  1176   506
      9   922    71     5  3312   685   506    10    16   221    32 30576
    423   515]]

((2348, 50), (1563, 50))

print("Vocab Size : {}".format(len(tokenizer.word_index)))

Vocab Size : 47386

## What is word 13

print(tokenizer.index_word[13])

## How many times it comes in first text document??

print(X_train_text[0]) ## 2 times

from
From: icop@csa.bu.edu (Antonio Pera)
Subject: ABC coverage
Distribution: usa
Organization: Computer Science Department, Boston University, Boston, MA, USA
Lines: 9
Originator: icop@csa


	I loved the ABC coverage. The production was excellent. The appearance
was excellent. It had a sleek modern look. This was the first time I heard
Thorne & Clement & I thought they were great. My only request is to leave
Al Micheals out of this. He annoys me.
	I'm hoping this leads to a regular-season contract. My guess would
be is that it will be roughly a weekly game from Feb.-April and then the
playoffs. I envy you Canadians with your TSN & CBC. Maybe I'll get a dish
to pick up Canadian TV. How much are those things, BTW?

1.5 Create Matrix Of GloVe Embeddings for Vocabulary Tokens¶

In this section, we have implemented the third step of our encoding process that we had explained earlier. We are simply looping through our vocabulary and retrieving GloVe embeddings for each token. We have then stacked embeddings of all tokens of our vocabulary in one big matrix. The matrix have shape (vocab_len, embed_len). The embedding length in our case is 300 as we had said earlier.

We can now retrieve glove embedding from this matrix by integer indexing it using the token index of that token. To explain it with an example, let's say that the index of 'the' token in our vocabulary is '1' then we can simply index embedding matrix like 'embedding_matrix[1]' to retrieve the embedding of the token 'the'.

%%time

embed_len = 300

word_embeddings = np.zeros((len(tokenizer.index_word)+1, embed_len))

for idx, word in tokenizer.index_word.items():
    word_embeddings[idx] = glove_embeddings.get(word, np.zeros(embed_len))

CPU times: user 136 ms, sys: 67.4 ms, total: 204 ms
Wall time: 203 ms

word_embeddings[1][:10]

array([ 0.27204001, -0.06203   , -0.1884    ,  0.023225  , -0.018158  ,
        0.0067192 , -0.13877   ,  0.17708001,  0.17709   ,  2.58820009])

Approach 1: GloVe Embeddings Flattened (Max Tokens=50, Embedding Length=300) ¶

Our first approach flattens GloVe embeddings and processes them through dense layers to make predictions. We have defined a simple network of one embedding layer and 3 dense layers for our text classification task. After training the network, we have also calculated ML metrics for evaluating the performance of the network.

Define Network¶

In this section, we have defined a network that we'll use for our text classification task. The network consists of one embedding layer and 3 dense layers.

The first layer of the network is the embedding layer. We have created an embedding layer using Embedding() constructor. We have provided the length of vocabulary as a number of tokens to the constructor. The embedding length is set at 300. We have also set the glove embedding matrix we created earlier as the weight of the layer. We don't want to update these embeddings hence we have set trainable parameter to False which will prevent updating of weights. This layer is responsible for mapping token indexes to their respective embeddings. The shape of input data to embedding layer is (batch_size, max_tokens) = (batch_size, 50) and output data shape is (batch_size, max_tokens, embed_len) = (batch_size, 50, 300).

The output of embedding layer is flattened which transforms data shape from (batch_size, 50, 300) to (batch_size, 50 x 300) = (batch_size, 15000).

The flattened output is given to a dense layer with 128 output units. This will change data from shape (batch_size, 15000) to (batch_size, 128). The relu activation is applied to the output of the dense layer.

The output of the first dense layer is given to the second dense layer which has 64 output units. This transforms shape from (batch_size, 128) to (batch_size, 64). It also applies relu activation to the output.

The output of second layer is given to the third and last dense layer of the network which has 4 output units (same as no of target classes) for processing. This transforms shape from (batch_size, 64) to (batch_size, 4). The last layer also applies softmax activation to convert outputs to probabilities.

After defining the network, we have also printed a summary of layers' output shapes and count of parameters.

from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten

model = Sequential([
                    Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
                              input_length=max_tokens, trainable=False, weights=[word_embeddings]),
                    Flatten(),
                    Dense(128, activation="relu"),
                    Dense(64, activation="relu"),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

2022-05-14 05:42:21.071425: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 50, 300)           14216100
_________________________________________________________________
flatten (Flatten)            (None, 15000)             0
_________________________________________________________________
dense (Dense)                (None, 128)               1920128
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 260
=================================================================
Total params: 16,144,744
Trainable params: 1,928,644
Non-trainable params: 14,216,100
_________________________________________________________________

model.weights[0][1][:10], word_embeddings[1][:10]

(<tf.Tensor: shape=(10,), dtype=float32, numpy=
 array([ 0.27204  , -0.06203  , -0.1884   ,  0.023225 , -0.018158 ,
         0.0067192, -0.13877  ,  0.17708  ,  0.17709  ,  2.5882   ],
       dtype=float32)>,
 array([ 0.27204001, -0.06203   , -0.1884    ,  0.023225  , -0.018158  ,
         0.0067192 , -0.13877   ,  0.17708001,  0.17709   ,  2.58820009]))

Compile Network¶

Here, we have compiled our network to use Adam optimizer, cross entropy loss, and accuracy metric.

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

In this section, we have trained our network for 8 epochs with a batch size of 1024 using fit() method. We have provided a test dataset as a validation dataset for validation purposes. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the text classification task.

model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))

2022-05-14 05:42:23.972549: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Epoch 1/8
74/74 [==============================] - 2s 12ms/step - loss: 0.6220 - accuracy: 0.7530 - val_loss: 0.4819 - val_accuracy: 0.8330
Epoch 2/8
74/74 [==============================] - 1s 10ms/step - loss: 0.0406 - accuracy: 0.9881 - val_loss: 0.4494 - val_accuracy: 0.8528
Epoch 3/8
74/74 [==============================] - 1s 10ms/step - loss: 0.0093 - accuracy: 0.9987 - val_loss: 0.4463 - val_accuracy: 0.8637
Epoch 4/8
74/74 [==============================] - 1s 9ms/step - loss: 0.0020 - accuracy: 1.0000 - val_loss: 0.4385 - val_accuracy: 0.8663
Epoch 5/8
74/74 [==============================] - 1s 9ms/step - loss: 6.1237e-04 - accuracy: 1.0000 - val_loss: 0.4345 - val_accuracy: 0.8733
Epoch 6/8
74/74 [==============================] - 1s 10ms/step - loss: 4.0840e-04 - accuracy: 1.0000 - val_loss: 0.4389 - val_accuracy: 0.8708
Epoch 7/8
74/74 [==============================] - 1s 10ms/step - loss: 3.1078e-04 - accuracy: 1.0000 - val_loss: 0.4430 - val_accuracy: 0.8746
Epoch 8/8
74/74 [==============================] - 1s 20ms/step - loss: 2.4573e-04 - accuracy: 1.0000 - val_loss: 0.4470 - val_accuracy: 0.8733

<keras.callbacks.History at 0x7f1948036cd0>

Evaluate Network Performance¶

In this section, we have evaluated the performance of our network by calculating accuracy score, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our network has done a good job at the classification task.

Here, we have calculated ML metrics using functions available from the Python library Scikit-learn. If you are interested in learning about various ML metrics available from sklearn then please check the below link which covers the majority of them in detail.

Scikit-Learn: Model Evaluation and Scoring Metrics

Apart from calculations, we have also created a visualization of the confusion matrix. This lets us better understand which target class is doing better and which are worse. We can notice from the visualization that the accuracy of all 4 categories is almost the same.

We have created confusion matrix visualization using Python library scikit-plot. It provides visualization for many ML metrics. Please feel free to check the below link if you are interested in it.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_vect).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.8733205374280231

Classification Report :
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.86      0.88      0.87       392
               rec.autos       0.84      0.86      0.85       396
        rec.sport.hockey       0.84      0.89      0.87       399
   talk.politics.mideast       0.96      0.87      0.91       376

                accuracy                           0.87      1563
               macro avg       0.88      0.87      0.87      1563
            weighted avg       0.88      0.87      0.87      1563


Confusion Matrix :
[[343  29  17   3]
 [ 28 340  26   2]
 [ 13  23 356   7]
 [ 14  11  25 326]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained the prediction made by our network using LIME (Local Interpretable Model-Agnostic Explanations) algorithm. The algorithm is commonly used to explain black-box ML models like deep neural networks. Currently, the Python library lime provides an implementation of the algorithm. We'll use it to create a visualization that highlights words that contributed to predicting a particular target label on a given text example.

If you are someone who is new to the concept of LIME and want to understand it better then we would recommend that you go through the below links in your spare time.

In order to create explanation visualization using lime, we need to create an instance of LimeTextExplainer which we had done below. We'll use it later.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

explainer

<lime.lime_text.LimeTextExplainer at 0x7f19299c5950>

Below, we have first created a prediction function. The function takes a batch of text examples and returns their probabilities predicted by the network. We'll use this function later to create an explanation.

Next, we randomly selected a text example from test data and made predictions on it using our trained network. We can notice that our network correctly predicts the target label as 'rec.autos' for the selected text example. Now, we'll create an explanation visualization for this text example.

def make_predictions(X_batch_text):
    X = tokenizer.texts_to_sequences(X_batch_text)
    X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
    preds = model.predict(X)
    return preds

rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

Prediction :  rec.autos
Actual :      rec.autos

Below, we have first created Explanation object by calling explain_instance() method on LimeTextExplainer instance. We provided the method with a selected text example, a prediction function we had defined earlier, and a target label. Then, we have called show_in_notebook() method on the explanation object to create visualizations.

We can notice from the visualization that words like 'cars', 'Datsun', 'Opel', 'owners', etc are contributing to predicting the target label as 'rec.autos'. This makes sense as these are commonly used words in the auto industry.

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

Approach 2: GloVe Embeddings Averaged (Max Tokens=50, Embedding Length=300) ¶

Our approach in this section has a minor change in network architecture compared to our previous approach. Our network in this section takes the average of embeddings of tokens per text example instead of flattening them like in the previous section. The majority of the code is the same as in the previous section with only a change in network architecture.

Define Network¶

Below, we have defined the network that we'll use for our text classification task. The network as usual consists of one embedding layer and three dense layers. The only major difference from the previous section is that the output of the embedding layer is averaged at the token level. This step transforms data shape from (batch_size, max_tokens, embed_len) = (batch_size, 50, 300) to (batch_size, embed_len) = (batch_size, 300). This output is then given to the first dense layer and forward.

After defining the network, we have printed a summary of layers output shapes and layer parameter counts.

from keras.models import Model
from keras.layers import Dense, Embedding, Input

inputs = Input(shape=(max_tokens, ))
embeddings_layer = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
                             input_length=max_tokens, trainable=False, weights=[word_embeddings])
dense1 = Dense(128, activation="relu")
dense2 = Dense(64, activation="relu")
dense3 = Dense(len(target_classes), activation="softmax")

x = embeddings_layer(inputs)
x = tensorflow.reduce_mean(x, axis=1) ### Averaged embeddings of tokens of each example
x = dense1(x)
x = dense2(x)
outputs = dense3(x)

model = Model(inputs=inputs, outputs=outputs)

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 50)]              0
_________________________________________________________________
embedding_1 (Embedding)      (None, 50, 300)           14216100
_________________________________________________________________
tf.math.reduce_mean (TFOpLam (None, 300)               0
_________________________________________________________________
dense_3 (Dense)              (None, 128)               38528
_________________________________________________________________
dense_4 (Dense)              (None, 64)                8256
_________________________________________________________________
dense_5 (Dense)              (None, 4)                 260
=================================================================
Total params: 14,263,144
Trainable params: 47,044
Non-trainable params: 14,216,100
_________________________________________________________________

Compile Network¶

Here, we have compiled a network to use Adam embedding, cross entropy loss, and accuracy metric.

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Now, we have trained our network for 8 epochs with a batch size of 32 using fit() method. We have provided test data as validation data for validation purposes. We can notice from the loss and accuracy values that our network is doing a good job at classification tasks.

model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))

Epoch 1/8
74/74 [==============================] - 1s 12ms/step - loss: 1.0021 - accuracy: 0.6640 - val_loss: 0.5770 - val_accuracy: 0.8407
Epoch 2/8
74/74 [==============================] - 0s 4ms/step - loss: 0.4024 - accuracy: 0.8773 - val_loss: 0.3612 - val_accuracy: 0.8759
Epoch 3/8
74/74 [==============================] - 0s 5ms/step - loss: 0.2444 - accuracy: 0.9212 - val_loss: 0.3074 - val_accuracy: 0.8912
Epoch 4/8
74/74 [==============================] - 0s 4ms/step - loss: 0.1952 - accuracy: 0.9336 - val_loss: 0.3230 - val_accuracy: 0.8797
Epoch 5/8
74/74 [==============================] - 0s 5ms/step - loss: 0.1540 - accuracy: 0.9493 - val_loss: 0.2702 - val_accuracy: 0.9002
Epoch 6/8
74/74 [==============================] - 0s 5ms/step - loss: 0.1359 - accuracy: 0.9540 - val_loss: 0.2665 - val_accuracy: 0.8976
Epoch 7/8
74/74 [==============================] - 0s 4ms/step - loss: 0.1219 - accuracy: 0.9574 - val_loss: 0.3039 - val_accuracy: 0.8970
Epoch 8/8
74/74 [==============================] - 0s 4ms/step - loss: 0.0979 - accuracy: 0.9689 - val_loss: 0.2816 - val_accuracy: 0.9015

<keras.callbacks.History at 0x7f192978add0>

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is quite better compared to our previous approach. We have also plotted the confusion matrix which is also showing improvement in classifying text documents into individual categories.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_vect).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.9014715291106846

Classification Report :
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.84      0.92      0.88       392
               rec.autos       0.90      0.86      0.88       396
        rec.sport.hockey       0.94      0.90      0.92       399
   talk.politics.mideast       0.94      0.93      0.94       376

                accuracy                           0.90      1563
               macro avg       0.90      0.90      0.90      1563
            weighted avg       0.90      0.90      0.90      1563


Confusion Matrix :
[[360  24   4   4]
 [ 38 341  10   7]
 [ 18  12 358  11]
 [ 13   4   9 350]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained network predictions using LIME algorithm. Our trained network correctly predicts the target label as 'rec.autos' for randomly selected text example from test data. The visualization shows that words like 'cars', 'Datsun', 'Nissan', 'Opel', 'mileage', etc contributes to predicting target category as 'rec.autos'.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

Approach 3: GloVe Embeddings Summed (Max Tokens=50, Embedding Length=300) ¶

Our approach in this section is the same as our previous approach the only minor difference is that we are taking the sum of embeddings instead of the average. The majority of the code is the same as earlier with only differences in network architecture.

Define Network¶

Below, we have defined a network that we'll use for our task in this section. The network has the same architecture as the previous section with the only difference being that we have used sum() function on the output of the embedding layer instead. The rest of the code is exactly the same.

from keras.models import Model
from keras.layers import Dense, Embedding, Input

inputs = Input(shape=(max_tokens, ))
embeddings_layer = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
                             input_length=max_tokens, trainable=False, weights=[word_embeddings])
dense1 = Dense(128, activation="relu")
dense2 = Dense(64, activation="relu")
dense3 = Dense(len(target_classes), activation="softmax")

x = embeddings_layer(inputs)
x = tensorflow.reduce_sum(x, axis=1) ### Summed embeddings of tokens of each example
x = dense1(x)
x = dense2(x)
outputs = dense3(x)

model = Model(inputs=inputs, outputs=outputs)

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_2 (InputLayer)         [(None, 50)]              0
_________________________________________________________________
embedding_2 (Embedding)      (None, 50, 300)           14216100
_________________________________________________________________
tf.math.reduce_sum (TFOpLamb (None, 300)               0
_________________________________________________________________
dense_6 (Dense)              (None, 128)               38528
_________________________________________________________________
dense_7 (Dense)              (None, 64)                8256
_________________________________________________________________
dense_8 (Dense)              (None, 4)                 260
=================================================================
Total params: 14,263,144
Trainable params: 47,044
Non-trainable params: 14,216,100
_________________________________________________________________

Compile Network¶

Here, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy metric.

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network using the same settings that we have been using for all our approaches. The loss and accuracy values getting printed after each epoch hint that the network is doing a good job at the classification task.

model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))

Epoch 1/8
74/74 [==============================] - 1s 7ms/step - loss: 0.9577 - accuracy: 0.7066 - val_loss: 0.4420 - val_accuracy: 0.8381
Epoch 2/8
74/74 [==============================] - 0s 4ms/step - loss: 0.3056 - accuracy: 0.8884 - val_loss: 0.3704 - val_accuracy: 0.8580
Epoch 3/8
74/74 [==============================] - 0s 4ms/step - loss: 0.1758 - accuracy: 0.9412 - val_loss: 0.3194 - val_accuracy: 0.8855
Epoch 4/8
74/74 [==============================] - 1s 9ms/step - loss: 0.1291 - accuracy: 0.9561 - val_loss: 0.3651 - val_accuracy: 0.8752
Epoch 5/8
74/74 [==============================] - 1s 8ms/step - loss: 0.0848 - accuracy: 0.9787 - val_loss: 0.3522 - val_accuracy: 0.8804
Epoch 6/8
74/74 [==============================] - 0s 5ms/step - loss: 0.0733 - accuracy: 0.9787 - val_loss: 0.3408 - val_accuracy: 0.8964
Epoch 7/8
74/74 [==============================] - 0s 4ms/step - loss: 0.0460 - accuracy: 0.9894 - val_loss: 0.3036 - val_accuracy: 0.9040
Epoch 8/8
74/74 [==============================] - 0s 4ms/step - loss: 0.0217 - accuracy: 0.9970 - val_loss: 0.3069 - val_accuracy: 0.9085

<keras.callbacks.History at 0x7f1929415bd0>

Evaluate Network Performance¶

In this section, we have evaluated the performance of the network by calculating ML metrics like accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is the highest of all our approaches. We have also plotted the confusion matrix metric for reference purposes.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_vect).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.90850927703135

Classification Report :
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.90      0.89      0.89       392
               rec.autos       0.89      0.89      0.89       396
        rec.sport.hockey       0.91      0.91      0.91       399
   talk.politics.mideast       0.94      0.94      0.94       376

                accuracy                           0.91      1563
               macro avg       0.91      0.91      0.91      1563
            weighted avg       0.91      0.91      0.91      1563


Confusion Matrix :
[[349  25  14   4]
 [ 23 354  12   7]
 [ 12  13 362  12]
 [  5   8   8 355]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained predictions made by our network using LIME algorithm. The network correctly predicts target label as 'rec.autos' for selected text example from test dataset. The visualization shows that words like 'cars', 'datsun', 'nissan', 'mileage', 'opel', 'sell', 'rust', '240Z', '260Z', '280Z', '280ZX' etc are contributing to predicting target category as 'rec.autos'.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

5. Results Summary And Further Suggestions ¶

Approach	Max Tokens	Embedding Length	Test Accuracy (%)
GloVe 840B Embeddings Flattened	50	300	87.33
GloVe 840B Embeddings Averaged	50	300	90.14
GloVe 840B Embeddings Summed	50	300	90.85

Further Recommendations¶

Try different token sizes per text example.
Try different embedding lengths.
Try other GloVe embeddings like 42B, 27B, 6B, etc.
Try different functions on embeddings other than average and sum.
Try different combinations of linear/dense layers.
Try different weight initialization methods.
Try different activation functions.
Train network for more epochs.
Try different optimizers.
Try learning rate schedulers

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

keras, glove-embeddings, text-classification

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Keras: GloVe Embeddings for Text Classification Tasks¶

Important Sections Of Tutorial¶

1. Prepare Data ¶

1.1 Download And Unzip GloVe Embeddings (840B.300d)¶

1.2 Load GloVe Word Embeddings in Memory¶

1.3 Load Dataset¶

1.4 Tokenize Examples, Populate Vocabulary, And Vecorize Text Examples¶

1.5 Create Matrix Of GloVe Embeddings for Vocabulary Tokens¶

Approach 1: GloVe Embeddings Flattened (Max Tokens=50, Embedding Length=300) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 2: GloVe Embeddings Averaged (Max Tokens=50, Embedding Length=300) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 3: GloVe Embeddings Summed (Max Tokens=50, Embedding Length=300) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

5. Results Summary And Further Suggestions ¶

Further Recommendations¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription