Updated On : May-11,2022 Time Investment : ~30 mins

Guide to Use GloVe Embeddings with MXNet Networks (GluonNLP)¶

When working with text data for machine learning tasks, we need to encode text data. The encoding is the process where we map text data to real-valued data because ML algorithms can only work on them. There are many different ways to encode text data like word frequency, one-hot encoding, Tf-Idf, word embeddings, etc. All these approaches break the text down into a list of tokens (words) and then assign real values to these tokens. The majority of approaches assign just a single value to each token except word embeddings approach. The word embeddings approach assigns a real-valued vector to each token. This kind of encoding/representation gives more flexibility to represent each token (word). It can now capture more than one meaning of words possible in a different context.

GloVe (Global Vectors) is an unsupervised learning algorithm that can help us generate word embeddings from the corpus of data. The Stanford University professors have already developed GloVe embeddings of different lengths by training an unsupervised algorithm on different big datasets (Wikipedia, Twitter, etc) which we can use for our NLP tasks (text classification). Please check the below link if you want to learn more about GloVe.

GloVe: Global Vectors for Word Representation

As a part of this tutorial, we have explained how we can use GloVe embeddings (kind of word embeddings) with a neural network designed with Python deep learning library MXNet for text classification tasks. The GloVe embeddings is available from GluonNLP Python library which is a helper library of MXNet for NLP tasks. We have explained different approaches to using embeddings. After training networks, we evaluated their performances for comparison purposes. We have also tried to explain predictions made by networks using LIME algorithm which is commonly used for black-box models.

Below, we have listed important sections of Tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Prepare Data
- 1.1 Load Dataset
- 1.2 Define Tokenizer
- 1.3 Populate Vocabulary
- 1.4 Define Vectorization Function
- 1.5 Create Data Loaders
Approach 1: GloVe '840B' Flattened (Max Tokens=50, Embeddings Length=300)
- Load GloVe 840B Embeddings
- Create Embeddings Matrix of Vocab Tokens using Glove Embeddings
- Define Network
- Train Network
- Evaluate Network Performance
- Explain Predictions using LIME Algorithm
Approach 2: GloVe '840B' Averaged (Max Tokens=50, Embeddings Length=300)
Approach 3: GloVe '840B' Summed (Max Tokens=50, Embeddings Length=300)
Approach 4: GloVe '42B' Flattened (Max Tokens=50, Embeddings Length=300)
Results Summary and Further Recommendations

Below, we have imported the necessary Python libraries and printed the versions we used in our tutorial.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))

MXNet Version : 1.9.0

import gluonnlp

print("GluonNLP Version : {}".format(gluonnlp.__version__))

GluonNLP Version : 0.10.0

import torchtext

print("TorchText Version : ".format(torchtext.__version__))

TorchText Version :

import gc

1. Prepare Data ¶

In this section, we are preparing data to be given to the neural network. We'll be using GloVe word embeddings approach to encoding our text data. In order to use GloVe with our text data, we need to perform the below steps.

Loop through each text example of the dataset, tokenize them and populate the vocabulary of all unique tokens (words). A vocabulary is a simple mapping from tokens to integer index. Each token is assigned a unique index starting from integer 0.
Tokenize each text example and retrieve indexes of tokens from the vocabulary.
Retrieve GloVe embeddings for each token of vocabulary and create an embedding matrix.
Set this matrix as the weight of the embedding layer of the neural network and prevent updating weights of this layer so that embeddings do not get updated. This embedding layer will be responsible for retrieving GloVe embeddings of tokens based on input indexes created in the 2nd step by indexing the weight matrix.

So basically we first map tokens to their indexes and then retrieve GloVe embeddings for these indexes by indexing the weight matrix of the embedding layer.

The first two steps are implemented in this section. The 3rd step is implemented in the next section and the 4th step will be implemented as the Embedding layer of the network.

The below image gives a simple idea about word embeddings.

1.1 Load Dataset¶

In this section, we have loaded the dataset that we are going to use for our text classification task. We have loaded AG NEWS dataset available from torchtext python library. The dataset has text documents for 4 different news categories (["World", "Sports", "Business", "Sci/Tech"]). The dataset is already divided into train and test sets. We have wrapped text examples and their respective target labels in ArrayDataset wrapper available from mxnet.gluon.data. This dataset object will be used later to create data loaders that will be used during the training process to loop through data in batches.

from mxnet.gluon.data import ArrayDataset

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

Y_train, X_train = zip(*list(train_dataset))
Y_test,  X_test  = zip(*list(test_dataset))

train_dataset = ArrayDataset(X_train, Y_train)
test_dataset  = ArrayDataset(X_test, Y_test)

train.csv: 29.5MB [00:00, 109MB/s]
test.csv: 1.86MB [00:00, 39.2MB/s]

1.2 Define Tokenizer¶

In this section, we have defined a tokenizer. The tokenizer is a function that takes a text document as input and returns a list of tokens (words). We have defined a tokenizer using regular expression which captures a sequence of one or more characters in a text document. The tokenizer is wrapped in partial() function available from Python library functools. We have also explained with a simple example how tokenizer tokenizes text example.

import re
from functools import partial

tokenizer = partial(lambda X: re.findall(r"\w+", X))

tokenizer("Hello, How are you?")

['Hello', 'How', 'are', 'you']

1.3 Populate Vocabulary¶

In this section, we have populated a vocabulary using all unique tokens of datasets. In order to create vocabulary, we first need to create a dictionary that has all tokens as keys and their frequency in datasets as values. We have maintained this dictionary as Counter object available from the Python library collections. Initially, we have created an empty Counter object. Then, we are looping through each dataset and their text examples one by one calling count_tokens() function available from the Python library gluonnlp. The function takes a list of tokens of text example and Counter object. It updates the Counter object with tokens and their frequencies. After we have looped through all text examples of the dataset, the Counter object has all tokens. We can then create vocabulary by calling Vocab() constructor from the Python library gluonnlp. We have provided a counter object to the constructor for token details.

After initializing the vocabulary, we have also printed the number of tokens present in the vocabulary.

from collections import Counter

counter = Counter()

for dataset in [train_dataset, test_dataset]:
    for X, Y in dataset:
        gluonnlp.data.count_tokens(tokenizer(X), to_lower=True, counter=counter)

vocab = gluonnlp.Vocab(counter=counter, min_freq=1)

print("Vocabulary Size : {}".format(len(vocab)))

Vocabulary Size : 66505

1.4 Define Vectorization Function¶

In this section, we have defined a vectorization function. This function will be used by data loaders later. It takes a batch of text examples and their target labels as input. Then, it tokenizes text examples and retrieves their indexes from the vocabulary. The different text documents have a different number of tokens but the neural network requires consistent size. Hence, we have set maximum token size per example to 50. All examples will have fixed 50 tokens. Those examples that have more than 50 tokens will be truncated to keep only 50 tokens and the examples that have less than 50 tokens will be padded with 0s to bring their length to 50 tokens. In the end, MXNet ndarray of token indexes for text examples and their target labels will be returned from the function.

After defining a function, we have also explained the usage with a simple example.

In this section, we have set the embedding length to 300 because we are going to use GloVe 840B embedding which has word embedding of length 300 for each token.

import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np

max_tokens = 50
embed_len = 300

clip_seq = gluonnlp.data.ClipSequence(max_tokens)
pad_seq  = gluonnlp.data.PadSequence(length=max_tokens, pad_val=0, clip=True)

def vectorize(batch):
    X, Y = list(zip(*batch))
    X = [[vocab(word) for word in tokenizer(text)] for text in X]
    #X = [text+([""]* (max_words-len(text))) if len(text)<max_words else text[:max_words] for text in X] ## Bringing all samples to 50 words length.
    X = [pad_seq(tokens) for tokens in X] ## Bringing all samples to 50 length
    return nd.array(X, dtype=np.int32), nd.array(Y, dtype=np.int32) - 1 # Subtracting 1 from labels to bring them in range 0-3 from 1-4

%time X, Y = vectorize([["how are you", 1]])

X.shape, Y.shape

CPU times: user 3.87 ms, sys: 0 ns, total: 3.87 ms
Wall time: 8.13 ms

((1, 50), (1,))

1.5 Create Data Loaders¶

In this section, we have simply created train and test data loaders using datasets we had created earlier. These data loaders will be used during the training process to loop through data in batches. The batch size is set at 1024 which means that a single batch will have 1024 examples and their labels. We have also provided the vectorization function we created in the previous cell to batchify_fn parameter. This function will be applied to each batch of data and the output of it will be a single batch of data.

Below, we have explained with a simple example how text example is vectorized.

text = "Hello, How are you? Where are you planning to go?"

tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
            'are', 'you', 'planning', 'to', 'go', '?']

vocab = {
    'hello': 0,
    'bye': 1,
    'how': 2,
    'the': 3,
    'welcome': 4,
    'are': 5,
    'you': 6,
    'to': 7,
    '<unk>': 8,
}

vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]

from mxnet.gluon.data import ArrayDataset, DataLoader

train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader  = DataLoader(test_dataset,  batch_size=1024, batchify_fn=vectorize)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

(1024, 50) (1024,)

Approach 1: GloVe '840B' Flattened (Max Tokens=50, Embeddings Length=300) ¶

Our first approach used GloVe 840B word embeddings. The 840B embeddings has embeddings for 2.2 Million tokens. The embedding length for each token is 300. We'll retrieve embeddings for our populated vocabulary tokens from this 840B and then use it for our text classification task. As these are already trained embeddings, we'll freeze the embedding layer that prevents it from updating.

Below, we have simply listed different GloVe embeddings available.

gluonnlp.embedding.list_sources(embedding_name="Glove")

['glove.42B.300d',
 'glove.6B.100d',
 'glove.6B.200d',
 'glove.6B.300d',
 'glove.6B.50d',
 'glove.840B.300d',
 'glove.twitter.27B.100d',
 'glove.twitter.27B.200d',
 'glove.twitter.27B.25d',
 'glove.twitter.27B.50d']

Load GloVe 840B Embeddings¶

The glove embedding is available from gluonnlp library. We just need to create an instance of GloVe() constructor with embedding name ('glove.840B.300d') and it'll create an GloVe instance. This instance can be treated like a dictionary. We can retrieve embeddings of tokens from it which we have explained in the example below.

glove_embeddings = gluonnlp.embedding.GloVe(source='glove.840B.300d')

glove_embeddings

Embedding file glove.840B.300d.npz is not found. Downloading from Gluon Repository. This may take some time.
Downloading /root/.mxnet/embedding/glove/glove.840B.300d.npz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/embeddings/glove/glove.840B.300d.npz...

<gluonnlp.embedding.token_embedding.GloVe at 0x7f0bc6a9e310>

glove_embeddings["hello"].shape, glove_embeddings["hello"][:50]

((300,),

 [ 0.25233   0.10176  -0.67485   0.21117   0.43492   0.16542   0.48261
  -0.81222   0.041321  0.78502  -0.077857 -0.66324   0.1464   -0.29289
  -0.25488   0.019293 -0.20265   0.98232   0.028312 -0.081276 -0.1214
   0.13126  -0.17648   0.13556  -0.16361  -0.22574   0.055006 -0.20308
   0.20718   0.095785  0.22481   0.21537  -0.32982  -0.12241  -0.40031
  -0.079381 -0.19958  -0.015083 -0.079139 -0.18132   0.20681  -0.36196
  -0.30744  -0.24422  -0.23113   0.09798   0.1463   -0.062738  0.42934
  -0.078038]
 <NDArray 50 @cpu(0)>)

glove_embeddings["<unk>"], glove_embeddings[""]

(
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 <NDArray 300 @cpu(0)>,

 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 <NDArray 300 @cpu(0)>)

Create Embeddings Matrix of Vocab Tokens using Glove Embeddings¶

In this section, we are simply looping through tokens of our populated vocabulary (from our datasets) and retrieving GloVe embeddings for tokens using GloVe object. We have wrapped all embeddings in a matrix of shape (vocab_len, 300) = (66505, 300). This matrix will become the weight matrix of the embedding layer.

%%time

vocab_embeddings = nd.zeros((len(vocab), embed_len), dtype=np.float32)
for i, token in enumerate(vocab.idx_to_token):
    vocab_embeddings[i] = glove_embeddings[token]

vocab_embeddings[5][:10]

CPU times: user 1min 29s, sys: 3.85 s, total: 1min 33s
Wall time: 42.8 s

[ 0.31924   0.06316  -0.27858   0.2612    0.079248 -0.21462  -0.10495
  0.15495  -0.03353   2.4834  ]
<NDArray 10 @cpu(0)>

Define Network¶

In this section, we have defined a neural network that we'll use for our text classification task. The network consists of one embedding layer and 3 dense layers.

The first layer of the network is the embedding layer. We have created an embedding layer using Embedding() constructor. We have provided vocabulary length as a number of tokens and embedding length to the constructor. After creating the layer, we have initialized its weight with our embedding matrix that we had created in the previous cell. This will make our embedding matrix weight of the embedding layer. Later on, we have to make sure that this weight matrix is not updated during the training process as they are already trained embeddings. The embedding layer will take token indexes as input and retrieve GloVe embedding by indexing this weight matrix. The input data shape of the layer is (batch_size, max_tokens) = (batch_size, 50) and output shape is (batch_size, max_tokens, embed_len) = (batch_size, 50, 300).

The output of embedding layer is flattened which will transform data shape from (batch_size, 50, 300) to (batch_size, 50 x 300) = (batch_size, 15000).

The flattened data is given to a dense layer with 128 output units. This will transform data to shape (batch_size, 128) from (batch_size, 15000). The dense layer also applies relu activation on the output.

The output of the first dense layer is given to the second dense layer that has 64 output units. This will transform the shape to (batch_size, 64). It also applies relu activation to the output.

The output of the second dense layer is given to the third and last dense layer of the network that has 4 output units (same as a number of target classes). The output of the third dense layer is a prediction of the network which is of shape (batch_size, 4).

After defining the network, we have initialized and performed a forward pass to make predictions for verification purposes. We can notice a warning saying that the embedding layer weight is already initialized and it won't be initialized again with new weights which are good. We have also printed a summary of network layer output shapes and parameter counts.

We have created a network using Sequential API of Gluon module of MXNet. Please feel free to check the below link if you are new to MXNet and want to learn how to create neural networks using it.

MXNet: Simple Guide to Create Neural Networks

from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        embed_layer = nn.Embedding(len(vocab), embed_len) ## Create Embedding Layer
        embed_layer.initialize() ## Initialize layer
        embed_layer.weight.set_data(vocab_embeddings) ## Initialize it with GloVe Embeddings

        self.seq  = nn.Sequential()
        self.seq.add(embed_layer)
        self.seq.add(nn.Flatten()) ### Flatten Embeddings
        self.seq.add(nn.Dense(128, activation="relu"))
        self.seq.add(nn.Dense(64, activation="relu"))
        self.seq.add(nn.Dense(len(target_classes)))

    def forward(self, x):
        logits = self.seq(x)
        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model

EmbeddingClassifier(
  (seq): Sequential(
    (0): Embedding(66505 -> 300, float32)
    (1): Flatten
    (2): Dense(None -> 128, Activation(relu))
    (3): Dense(None -> 64, Activation(relu))
    (4): Dense(None -> 4, linear)
  )
)

from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

preds.shape

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding0_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)

(10, 4)

model.summary(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
================================================================================
               Input                                    (10, 50)               0
         Embedding-1                               (10, 50, 300)        19951500
           Flatten-2                                 (10, 15000)               0
        Activation-3                                   (10, 128)               0
             Dense-4                                   (10, 128)         1920128
        Activation-5                                    (10, 64)               0
             Dense-6                                    (10, 64)            8256
             Dense-7                                     (10, 4)             260
EmbeddingClassifier-8                                     (10, 4)               0
================================================================================
Parameters in forward computation graph, duplicate included
   Total params: 21880144
   Trainable params: 21880144
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 21880144
--------------------------------------------------------------------------------

Train Network¶

In this section, we have trained the network we defined in the previous section. We have defined a function for the training network. The function takes the train object (network parameters), train data loader, validation data loader, and a number of epochs as input. It then performs a training loop number of epochs time. For each epoch, it loops through training data in batches. For each batch of data, it performs a forward pass to make predictions, calculate loss value, calculate gradients, and update network parameters using gradients. It records loss for each batch of data and prints the average loss of the train data at the end of each epoch. We have also created a helper function to calculate validation loss and accuracy values which we are printing at the end of the epochs.

from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def MakePredictions(model, val_loader):
    Y_actuals, Y_preds = [], []
    for X_batch, Y_batch in val_loader:
        preds = model(X_batch)
        preds = nd.softmax(preds)
        Y_actuals.append(Y_batch)
        Y_preds.append(preds.argmax(axis=-1))

    Y_actuals, Y_preds = nd.concatenate(Y_actuals), nd.concatenate(Y_preds)
    return Y_actuals, Y_preds

def CalcValLoss(model, val_loader):
    losses = []
    for X_batch, Y_batch in val_loader:
        val_loss = loss_func(model(X_batch), Y_batch)
        val_loss = val_loss.mean().asscalar()
        losses.append(val_loss)
    print("Valid CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))

def TrainModelInBatches(trainer, train_loader, val_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        #if i%5==0:
        print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
        CalcValLoss(model, val_loader)
        Y_actuals, Y_preds = MakePredictions(model, val_loader)
        print("Valid Accuracy : {:.3f}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))

Here, we are actually training our network using a training routine. We have initialized a number of epochs to 8 and the learning rate to 0.001. Then, we have initialized our text classification network, cross entropy loss, Adam optimizer, and Trainer object. The Trainer object has network parameters (collected by calling collect_params() method on the network) that will be updated during the training process.

Please make a NOTE that we have provided regular expression to collect_params() method. This regular expression will force it to collect parameters of only Dense layers and parameters of Embedding layer will be ignored and hence won't be updated (which is what we want. We don't want to update glove embeddings).

If you are interested in learning about how we can provide regular expressions for different situations to collect_params() method so that it collects and updates parameters (weights/biases) of specific layers of the network only then please check the below link. In our case, we only wanted to update the parameters of dense layers.

collect_params()

We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=8
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params('dense'), optimizer) ## Training only parameters of dense layers. Ignoring embeding layer.

TrainModelInBatches(trainer, train_loader, test_loader, epochs)

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding1_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|██████████| 118/118 [00:28<00:00,  4.16it/s]

Train CrossEntropyLoss : 0.580
Valid CrossEntropyLoss : 0.457
Valid Accuracy : 0.840

100%|██████████| 118/118 [00:28<00:00,  4.18it/s]

Train CrossEntropyLoss : 0.387
Valid CrossEntropyLoss : 0.448
Valid Accuracy : 0.845

100%|██████████| 118/118 [00:27<00:00,  4.31it/s]

Train CrossEntropyLoss : 0.318
Valid CrossEntropyLoss : 0.472
Valid Accuracy : 0.842

100%|██████████| 118/118 [00:36<00:00,  3.24it/s]

Train CrossEntropyLoss : 0.249
Valid CrossEntropyLoss : 0.534
Valid Accuracy : 0.832

100%|██████████| 118/118 [00:28<00:00,  4.14it/s]

Train CrossEntropyLoss : 0.190
Valid CrossEntropyLoss : 0.620
Valid Accuracy : 0.828

100%|██████████| 118/118 [00:27<00:00,  4.25it/s]

Train CrossEntropyLoss : 0.155
Valid CrossEntropyLoss : 0.732
Valid Accuracy : 0.822

100%|██████████| 118/118 [00:28<00:00,  4.15it/s]

Train CrossEntropyLoss : 0.126
Valid CrossEntropyLoss : 0.809
Valid Accuracy : 0.818

100%|██████████| 118/118 [00:28<00:00,  4.09it/s]

Train CrossEntropyLoss : 0.115
Valid CrossEntropyLoss : 0.881
Valid Accuracy : 0.812

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report (precision, recall, and f1-score) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model has done a good job at the classification task. We have calculated ML metrics using functions available from scikit-learn.

If you are interested in learning about various ML metrics available from sklearn for ML tasks then please check the below link. It covers the majority of them in detail

Scikit-Learn: Model Evaluation and Scoring Metrics

Apart from calculations, we have also created a visualization for confusion matrix using Python library scikit-plot. We can notice from the confusion matrix that our network is quite good at classifying text documents of categories Sports and World compared to Business and Sci/Tech categories.

The scikit-plot library is designed on top of Python library Matplotlib and provided visualization for many ML metrics. Please feel free to check the below link if you are interested in learning about it in depth.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))

Test Accuracy : 0.8117105263157894
Classification Report :
              precision    recall  f1-score   support

       World       0.75      0.89      0.81      1900
      Sports       0.88      0.94      0.91      1900
    Business       0.76      0.77      0.77      1900
    Sci/Tech       0.88      0.66      0.75      1900

    accuracy                           0.81      7600
   macro avg       0.82      0.81      0.81      7600
weighted avg       0.82      0.81      0.81      7600


Confusion Matrix :
[[1683   92   91   34]
 [  89 1778   24    9]
 [ 246   63 1463  128]
 [ 229   90  336 1245]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions using LIME Algorithm¶

In this section, we have explained predictions made by the network using LIME algorithm. It is a commonly used algorithm for explaining predictions of black-box ML models like neural networks. The Python library lime has an implementation of the algorithm. It let us create visualization highlighting words of text that contributed to predicting a particular target label.

If you are someone who is new to the concept of LIME and want to learn about it then we recommend that you go through the below links in your free time. It'll help you understand the concept better.

Below, we have first simply retrieved test text examples and target labels.

X_test, Y_test = [], []
for X, Y in test_dataset:
    X_test.append(X)
    Y_test.append(Y-1)

Below, we have first created an instance of LimeTextExplainer which will be used to create an explanation object explaining predictions.

Then, we have created a prediction function. This function takes a batch of text examples as input and returns their prediction probabilities generated by the network. It tokenizes text examples, retrieves their indexes, and then gives them to the network to make predictions. The softmax function is applied to the output of the network to turn the output into probabilities.

After defining a function, we randomly selected a text example from the test dataset and made predictions on it. Our network correctly predicts the target label as Sci/Tech for the selected text example. Now, we'll explain prediction by creating an explanation object.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

def make_predictions(X_batch_text):
    X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_batch_text]
    X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
    logits = model(nd.array(X_batch, dtype=np.int32))
    preds = nd.softmax(logits)
    return preds.asnumpy()

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

Actual :      Sci/Tech
Prediction :  Sci/Tech

Below, we have first called explain_instance() method on LimeTextExplainer instance to create Explanation object. We have provided a selected text example, prediction function, and target label to the method. Then, we have called show_in_notebook() method on the explanation instance to create the visualization. The visualization shows that words like 'software', 'wireless', 'technology', 'devices', 'remote', 'conference', etc are contributing to predicting target label as Sci/Tech. This makes sense as these are commonly used words in the technology world.

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

Approach 2: GloVe '840B' Averaged (Max Tokens=50, Embeddings Length=300) ¶

Our approach in this section has minor changes in the way we handle GloVe embeddings. In this section, we are averaging embeddings that come from the embedding layer instead of flattening it like in the previous section. This is the only change in the architecture of the network. The majority of the code is the same as in our previous section with only a change in network architecture.

Define Network¶

Below, we have defined a network that we'll use for our text classification task in this section. The network has the same layers defined in init() method as in our previous section. The only difference is there in forward() method where we are calling mean() method on the output of the embedding layer to average embeddings at a token level before giving it to the dense layer. The embeddings of all tokens of each example will be averaged. This will change shape of data from (batch_size, max_tokens, embed_len) =(batch_size, 50, 300) to (batch_size, embed_len) = (batch_size, 300). The rest of the dense layers are the same as our previous approach.

As usual, after defining the network, we have initialized it and performed a forward pass to make sure that it works as expected. We have also printed the summary of the layer's output shapes and parameters counts.

from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.word_embeddings = nn.Embedding(len(vocab), embed_len) ## Create Embedding Layer
        self.word_embeddings.initialize()
        self.word_embeddings.weight.set_data(vocab_embeddings) ## Initialize it with GloVe Embeddings

        self.dense1 = nn.Dense(128, activation="relu")
        self.dense2 = nn.Dense(64, activation="relu")
        self.dense3 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.word_embeddings(x)

        x = x.mean(axis=1)  ## Average Embeddings of All tokens of Single Text Examples

        x = self.dense1(x)
        x = self.dense2(x)
        logits = self.dense3(x)

        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model

EmbeddingClassifier(
  (word_embeddings): Embedding(66505 -> 300, float32)
  (dense1): Dense(None -> 128, Activation(relu))
  (dense2): Dense(None -> 64, Activation(relu))
  (dense3): Dense(None -> 4, linear)
)

from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

preds.shape

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding2_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)

(10, 4)

model.summary(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
================================================================================
               Input                                    (10, 50)               0
         Embedding-1                               (10, 50, 300)        19951500
        Activation-2                                   (10, 128)               0
             Dense-3                                   (10, 128)           38528
        Activation-4                                    (10, 64)               0
             Dense-5                                    (10, 64)            8256
             Dense-6                                     (10, 4)             260
EmbeddingClassifier-7                                     (10, 4)               0
================================================================================
Parameters in forward computation graph, duplicate included
   Total params: 19998544
   Trainable params: 19998544
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 19998544
--------------------------------------------------------------------------------

Train Network¶

Below, we have trained our network using exactly the same settings that we had used for our previous approach. We can notice from the loss and accuracy values getting printed that the network is doing a good job at the classification task.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=8
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params('dense'), optimizer) ## Training only parameters of dense layers. Ignoring embeding layer.

TrainModelInBatches(trainer, train_loader, test_loader, epochs)

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding3_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|██████████| 118/118 [00:41<00:00,  2.86it/s]

Train CrossEntropyLoss : 0.848
Valid CrossEntropyLoss : 0.495
Valid Accuracy : 0.828

100%|██████████| 118/118 [00:38<00:00,  3.04it/s]

Train CrossEntropyLoss : 0.462
Valid CrossEntropyLoss : 0.450
Valid Accuracy : 0.847

100%|██████████| 118/118 [00:39<00:00,  2.99it/s]

Train CrossEntropyLoss : 0.434
Valid CrossEntropyLoss : 0.435
Valid Accuracy : 0.849

100%|██████████| 118/118 [00:39<00:00,  3.01it/s]

Train CrossEntropyLoss : 0.422
Valid CrossEntropyLoss : 0.427
Valid Accuracy : 0.852

100%|██████████| 118/118 [00:40<00:00,  2.93it/s]

Train CrossEntropyLoss : 0.414
Valid CrossEntropyLoss : 0.422
Valid Accuracy : 0.854

100%|██████████| 118/118 [00:42<00:00,  2.80it/s]

Train CrossEntropyLoss : 0.407
Valid CrossEntropyLoss : 0.418
Valid Accuracy : 0.855

100%|██████████| 118/118 [00:40<00:00,  2.90it/s]

Train CrossEntropyLoss : 0.402
Valid CrossEntropyLoss : 0.414
Valid Accuracy : 0.856

100%|██████████| 118/118 [00:40<00:00,  2.93it/s]

Train CrossEntropyLoss : 0.397
Valid CrossEntropyLoss : 0.411
Valid Accuracy : 0.856

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that there is quite an improvement over the previous approach. We have also plotted the confusion matrix for reference purposes which also indicates improvement in performance.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))

Test Accuracy : 0.8557894736842105
Classification Report :
              precision    recall  f1-score   support

       World       0.86      0.86      0.86      1900
      Sports       0.91      0.95      0.93      1900
    Business       0.78      0.84      0.81      1900
    Sci/Tech       0.87      0.78      0.82      1900

    accuracy                           0.86      7600
   macro avg       0.86      0.86      0.86      7600
weighted avg       0.86      0.86      0.86      7600


Confusion Matrix :
[[1627   87  139   47]
 [  54 1804   28   14]
 [ 110   34 1596  160]
 [  98   52  273 1477]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained predictions made by the network using LIME algorithm. Our network correctly predicts the target label as Sci/Tech for randomly selected text example from the test dataset. The visualization shows that words like 'software', 'device', 'wireless', 'technology', 'remote', 'conference', 'host', etc are contributing to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

Approach 3: GloVe '840B' Summed (Max Tokens=50, Embeddings Length=300) ¶

Our approach in this section uses the same architecture as our previous approach with a minor change. In this section, we have summed the embeddings of tokens instead of averaging them. The majority of the code is the same as in the previous section with the only difference in network architecture.

Define Network¶

Below, we have defined a network that we'll use for our text classification task in this section. The network has exactly the same architecture as our previous example with the only change that we are calling sum() function on the output of the embedding layer. All other are laid out exactly the same as earlier.

After defining the network, we initialized it and performed a forward pass to make predictions for verification purposes. We have also printed the summary of layer output size and parameters counts.

from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.word_embeddings = nn.Embedding(len(vocab), embed_len) ## Create Embedding Layer
        self.word_embeddings.initialize()
        self.word_embeddings.weight.set_data(vocab_embeddings) ## Initialize it with GloVe Embeddings

        self.dense1 = nn.Dense(128, activation="relu")
        self.dense2 = nn.Dense(64, activation="relu")
        self.dense3 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.word_embeddings(x)

        x = x.sum(axis=1) ## Sum Embeddings of All tokens of Single Text Examples

        x = self.dense1(x)
        x = self.dense2(x)
        logits = self.dense3(x)

        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model

EmbeddingClassifier(
  (word_embeddings): Embedding(66505 -> 300, float32)
  (dense1): Dense(None -> 128, Activation(relu))
  (dense2): Dense(None -> 64, Activation(relu))
  (dense3): Dense(None -> 4, linear)
)

from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

preds.shape

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding4_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)

(10, 4)

model.summary(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

--------------------------------------------------------------------------------
        Layer (type)                                Output Shape         Param #
================================================================================
               Input                                    (10, 50)               0
         Embedding-1                               (10, 50, 300)        19951500
        Activation-2                                   (10, 128)               0
             Dense-3                                   (10, 128)           38528
        Activation-4                                    (10, 64)               0
             Dense-5                                    (10, 64)            8256
             Dense-6                                     (10, 4)             260
EmbeddingClassifier-7                                     (10, 4)               0
================================================================================
Parameters in forward computation graph, duplicate included
   Total params: 19998544
   Trainable params: 19998544
   Non-trainable params: 0
Shared params in forward computation graph: 0
Unique parameters in model: 19998544
--------------------------------------------------------------------------------

Train Network¶

In this section, we have trained our network using exactly the same settings that we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=8
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params('dense'), optimizer) ## Training only parameters of dense layers. Ignoring embeding layer.

TrainModelInBatches(trainer, train_loader, test_loader, epochs)

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding5_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|██████████| 118/118 [00:39<00:00,  2.99it/s]

Train CrossEntropyLoss : 0.562
Valid CrossEntropyLoss : 0.447
Valid Accuracy : 0.846

100%|██████████| 118/118 [00:43<00:00,  2.72it/s]

Train CrossEntropyLoss : 0.426
Valid CrossEntropyLoss : 0.424
Valid Accuracy : 0.854

100%|██████████| 118/118 [00:41<00:00,  2.87it/s]

Train CrossEntropyLoss : 0.405
Valid CrossEntropyLoss : 0.414
Valid Accuracy : 0.856

100%|██████████| 118/118 [00:40<00:00,  2.91it/s]

Train CrossEntropyLoss : 0.391
Valid CrossEntropyLoss : 0.407
Valid Accuracy : 0.859

100%|██████████| 118/118 [00:40<00:00,  2.92it/s]

Train CrossEntropyLoss : 0.381
Valid CrossEntropyLoss : 0.402
Valid Accuracy : 0.861

100%|██████████| 118/118 [00:39<00:00,  2.97it/s]

Train CrossEntropyLoss : 0.371
Valid CrossEntropyLoss : 0.396
Valid Accuracy : 0.863

100%|██████████| 118/118 [00:39<00:00,  2.96it/s]

Train CrossEntropyLoss : 0.363
Valid CrossEntropyLoss : 0.391
Valid Accuracy : 0.864

100%|██████████| 118/118 [00:39<00:00,  3.00it/s]

Train CrossEntropyLoss : 0.354
Valid CrossEntropyLoss : 0.388
Valid Accuracy : 0.866

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is a little better compared to our previous approach. We have also plotted the confusion matrix for reference purposes which shows that almost all categories are doing better now.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))

Test Accuracy : 0.8660526315789474
Classification Report :
              precision    recall  f1-score   support

       World       0.87      0.86      0.86      1900
      Sports       0.92      0.96      0.94      1900
    Business       0.82      0.83      0.82      1900
    Sci/Tech       0.86      0.82      0.84      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600


Confusion Matrix :
[[1630   87  121   62]
 [  49 1817   16   18]
 [ 112   35 1583  170]
 [  82   45  221 1552]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained the predictions using LIME algorithm. Our network correctly predicts the target label as Sci/Tech for the selected text example from the test dataset. The visualization highlights that words like 'device', 'software', 'wireless', 'technology', 'disparate', 'remote', 'host', 'conference', etc are contributing to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

Approach 4: GloVe '42B' Flattened (Max Tokens=50, Embeddings Length=300) ¶

Our approach in this section is exactly the same as our approach in the first section with the only difference being that we are using GloVe 42B.300d embeddings in this section. It uses the same network architecture that flattens embeddings as in our first approach. We'll see whether this embedding helps us improve performance over the first. The GloVe 42B.300d has embeddings for 1.9 million tokens and embeddings size is 300.

Load Glove 42B Embeddings¶

In this section, we have simply loaded GloVe 42B.300d embedding using GloVe() constructor available from gluonnlp Python library.

glove_embeddings = gluonnlp.embedding.GloVe(source='glove.42B.300d')

glove_embeddings

Embedding file glove.42B.300d.npz is not found. Downloading from Gluon Repository. This may take some time.
Downloading /root/.mxnet/embedding/glove/glove.42B.300d.npz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/embeddings/glove/glove.42B.300d.npz...

<gluonnlp.embedding.token_embedding.GloVe at 0x7f0bb0b2ead0>

Create Embeddings Matrix of Vocab Tokens using Glove Embeddings¶

In this section, we have created an embedding matrix using tokens of our vocabulary. The embeddings for tokens of our vocabulary are retrieved from GloVe embeddings. This matrix will become the weight matrix of the embedding layer later on.

%%time

vocab_embeddings = nd.zeros((len(vocab), embed_len), dtype=np.float32)
for i, token in enumerate(vocab.idx_to_token):
    vocab_embeddings[i] = glove_embeddings[token]

vocab_embeddings[5][:10]

CPU times: user 1min 23s, sys: 4.73 s, total: 1min 27s
Wall time: 40.2 s

[-0.24837  -0.45461   0.039227 -0.28422  -0.031852  0.26355  -4.6323
  0.01389  -0.53928  -0.084454]
<NDArray 10 @cpu(0)>

Define Network¶

In this section, we have defined a network that we'll use for our classification task in this section. The network architecture is exactly the same as our first approach with the only difference being that the weight matrix of the embedding layer now consists of GloVe 42B.300d embeddings.

from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        embed_layer = nn.Embedding(len(vocab), embed_len) ## Create Embedding Layer
        embed_layer.initialize()
        embed_layer.weight.set_data(vocab_embeddings) ## Initialize it with GloVe Embeddings

        self.seq  = nn.Sequential()
        self.seq.add(embed_layer)
        self.seq.add(nn.Flatten()) ### Flatten Embeddings
        self.seq.add(nn.Dense(128, activation="relu"))
        self.seq.add(nn.Dense(64, activation="relu"))
        self.seq.add(nn.Dense(len(target_classes)))

    def forward(self, x):
        logits = self.seq(x)
        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model

EmbeddingClassifier(
  (seq): Sequential(
    (0): Embedding(66505 -> 300, float32)
    (1): Flatten
    (2): Dense(None -> 128, Activation(relu))
    (3): Dense(None -> 64, Activation(relu))
    (4): Dense(None -> 4, linear)
  )
)

Train Network¶

Below, we have trained our network using exactly the same settings that we have been using for all our approaches. We can notice from the loss and accuracy scores that our network is doing a good job at the classification task.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=8
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)

/opt/conda/lib/python3.7/site-packages/mxnet/gluon/parameter.py:896: UserWarning: Parameter 'embedding7_weight' is already initialized, ignoring. Set force_reinit=True to re-initialize.
  v.initialize(None, ctx, init, force_reinit=force_reinit)
100%|██████████| 118/118 [00:34<00:00,  3.41it/s]

Train CrossEntropyLoss : 0.584
Valid CrossEntropyLoss : 0.425
Valid Accuracy : 0.854

100%|██████████| 118/118 [00:34<00:00,  3.43it/s]

Train CrossEntropyLoss : 0.329
Valid CrossEntropyLoss : 0.427
Valid Accuracy : 0.858

100%|██████████| 118/118 [00:34<00:00,  3.38it/s]

Train CrossEntropyLoss : 0.224
Valid CrossEntropyLoss : 0.503
Valid Accuracy : 0.843

100%|██████████| 118/118 [00:42<00:00,  2.79it/s]

Train CrossEntropyLoss : 0.147
Valid CrossEntropyLoss : 0.532
Valid Accuracy : 0.854

100%|██████████| 118/118 [00:37<00:00,  3.13it/s]

Train CrossEntropyLoss : 0.102
Valid CrossEntropyLoss : 0.618
Valid Accuracy : 0.850

100%|██████████| 118/118 [00:37<00:00,  3.16it/s]

Train CrossEntropyLoss : 0.056
Valid CrossEntropyLoss : 0.705
Valid Accuracy : 0.848

100%|██████████| 118/118 [00:36<00:00,  3.19it/s]

Train CrossEntropyLoss : 0.036
Valid CrossEntropyLoss : 0.988
Valid Accuracy : 0.825

100%|██████████| 118/118 [00:37<00:00,  3.15it/s]

Train CrossEntropyLoss : 0.045
Valid CrossEntropyLoss : 0.868
Valid Accuracy : 0.840

Evaluate Network Performance¶

In this section, we have evaluated the performance of our network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is better compared to our first approach but a little less compared to the other two approaches. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))

Test Accuracy : 0.8403947368421053
Classification Report :
              precision    recall  f1-score   support

       World       0.77      0.90      0.83      1900
      Sports       0.92      0.92      0.92      1900
    Business       0.86      0.73      0.79      1900
    Sci/Tech       0.83      0.81      0.82      1900

    accuracy                           0.84      7600
   macro avg       0.84      0.84      0.84      7600
weighted avg       0.84      0.84      0.84      7600


Confusion Matrix :
[[1708   62   67   63]
 [ 112 1755   14   19]
 [ 239   49 1379  233]
 [ 172   46  137 1545]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained predictions made by the network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for randomly selected text example from the test dataset. The visualization shows that words like 'technology', 'software', 'management', 'devices', 'wireless', 'guns', 'intel', etc are contributing to predicting target category as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()

6. Results Summary and Further Recommendations ¶

Approach	Max Tokens	Embedding Length	Test Accuracy (%)
GloVe '840B' Flattened	50	300	81.17
GloVe '840B' Averaged	50	300	85.57
GloVe '840B' Summed	50	300	86.60
GloVe '42B' Flattened	50	300	84.03

Further Suggestions¶

Try different tokens per text example.
Try different functions on embeddings other than average and sum.
Train network for more epochs.
Try other GloVe embeddings like 42B, 27B, 6B, etc.
Add more dense layers with different output units.
Try different weight initializers.
Try different optimizers.
Try different activation functions.
Try learning rate schedulers

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

mxnet, glove-embeddings, text-classification

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Guide to Use GloVe Embeddings with MXNet Networks (GluonNLP)¶

Important Sections Of Tutorial¶

1. Prepare Data ¶

1.1 Load Dataset¶

1.2 Define Tokenizer¶

1.3 Populate Vocabulary¶

1.4 Define Vectorization Function¶

1.5 Create Data Loaders¶

Approach 1: GloVe '840B' Flattened (Max Tokens=50, Embeddings Length=300) ¶

Load GloVe 840B Embeddings¶

Create Embeddings Matrix of Vocab Tokens using Glove Embeddings¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions using LIME Algorithm¶

Approach 2: GloVe '840B' Averaged (Max Tokens=50, Embeddings Length=300) ¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 3: GloVe '840B' Summed (Max Tokens=50, Embeddings Length=300) ¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 4: GloVe '42B' Flattened (Max Tokens=50, Embeddings Length=300) ¶

Load Glove 42B Embeddings¶

Create Embeddings Matrix of Vocab Tokens using Glove Embeddings¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

6. Results Summary and Further Recommendations ¶

Further Suggestions¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription