Updated On : Apr-07,2022 Tags mxnet, word-embeddings, …
How to Use Word Embeddings With MXNet For Text Classification Tasks?

How to Use Word Embeddings With MXNet For Text Classification Tasks?

When we want to work with text data for machine learning tasks, we need to convert text data to real-valued data as required by neural networks. All machine learning models work only on real-valued input data (float/int). There are various ways to convert text data to real-valued data (Frequency Count, Tf-Idf, One-hot encoding, word embeddings, etc). This process of converting text data to real-valued data is generally referred to as vectorization. Word Embeddings is one such text data vectorization approach. Generally, we tokenize data first where we split text data into tokens (words, punctuation marks, special characters, etc.). We keep track of all tokens from the whole dataset (all text examples) by creating a vocabulary of tokens. Then we assign a real-valued vector to each token of the data. These real-valued vectors are generally referred to as word embeddings. Each token can be assigned a vector of any length. Initially, these vectors are random numbers and are updated during the training process so that they capture the meaning of the token and understand the context of the text. Other vectorization approaches like frequency count, Tf-IDF, etc use just one real-valued number to represent the token whereas word embedding uses a real-valued vector (list of floats) to represent the token. Hence word embeddings have more representation power and can better understand words compared to other approaches.

How to Use Word Embeddings With MXNet Networks?

As a part of this tutorial, we'll explain how we can design MXNet networks that use word embeddings for text classification tasks. We have explained various ways to handle words embeddings by trying different approaches. We'll be using gluonnlp library to tokenize and vectorize text data.

Below, we have highlighted important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial

  1. Prepare Data
    • 1.1 Load Dataset
    • 1.2 Tokenize Text Data And Populate Vocabulary
    • 1.3 Function To Vectorize Text Data
    • 1.4 Create Data Loaders
  2. Approach 1 - Word Embeddings (Max Tokens=50, Embeddings Length=15)
    • Define Text Classification Network
    • Train Network
    • Evaluate Network Performance
    • Explain Network Predictions Using LIME
  3. Approach 2 - Word Embeddings (Max Tokens=50, Embeddings Length=50)
  4. Approach 3 - Word Embeddings Averaged (Max Tokens=50, Embeddings Length=50)
  5. Approach 4 - Word Embeddings Summed (Max Tokens=50, Embeddings Length=50)
  6. Summarized Results Of Approaches And Further Suggestions

Below, we have loaded important libraries and printed the versions of them that we have used in our tutorial.

In [1]:
import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.0
In [2]:
import gluonnlp

print("GluonNLP Version : {}".format(gluonnlp.__version__))
GluonNLP Version : 0.10.0
In [3]:
import torchtext

print("TorchText Version : {}".format(torchtext.__version__))
TorchText Version : 0.10.1

1. Prepare Data

In this section, we have prepared our data for the text classification task. We have loaded AG NEWS dataset from torchtext library, tokenized text examples from the dataset, populated vocabulary with tokens from text examples, and created data loaders that will be used during training. The data loader will generate indexes of tokens based on populated vocabulary for neural network input.

1.1 Load Dataset

In this section, we have loaded AG NEWS dataset available from torchtext library. The dataset has text documents for 4 different categories (["World", "Sports", "Business", "Sci/Tech"]). After loading both datasets, we have converted them to gluon ArrayDataset. It's a data structure used by mxnet to internally maintain datasets. We'll use it to create data loaders later which will be used during the training of the network.

Category Target Label
World 1
Sports 2
Business 3
Sci/Tech 4
In [4]:
from mxnet.gluon.data import ArrayDataset

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

Y_train, X_train = zip(*list(train_dataset))
Y_test,  X_test  = zip(*list(test_dataset))

train_dataset = ArrayDataset(X_train, Y_train)
test_dataset  = ArrayDataset(X_test, Y_test)
train.csv: 29.5MB [00:00, 99.8MB/s]
test.csv: 1.86MB [00:00, 66.7MB/s]

1.2 Tokenize Text Data And Populate Vocabulary

In this section, we have created a tokenizer that will be used to tokenize text data and then populated a vocabulary. The tokenizer takes the text document as input and returns a list of tokens which are words/punctuations of the text document. The vocabulary is a simple mapping of tokens to their integer indexing. Each word is assigned a unique index starting from 1. Later on, we'll vectorize text data to a list of indexes using a tokenizer and vocabulary.

The gluonnlp library provides many tokenizers. Below, we have explained how we can load SpacyTokenizer available from it and use it to tokenize text data. Though we won't be using it for our purpose. It was just included for introducing the reader that there are many tokenizers available from gluonnlp.

In [5]:
import spacy

spacy.load('en_core_web_sm')

spacy_tokenizer = gluonnlp.data.SpacyTokenizer(lang="en_core_web_sm")

spacy_tokenizer("Hello, How are you?")
/opt/conda/lib/python3.7/site-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
  warnings.warn(Warnings.W108)
Out[5]:
['Hello', ',', 'How', 'are', 'you', '?']

Below, we have created a simple tokenizer function that we'll use for our purpose. It just takes a text document as input and returns a list of words from it. It uses regular expression for creating tokenizer.

In [6]:
import re
from functools import partial

tokenizer = partial(lambda X: re.findall(r"\w+", X))

tokenizer("Hello, How are you?")
Out[6]:
['Hello', 'How', 'are', 'you']

Below, we have created a vocabulary using Vocab() constructor available from gluonnlp library. It requires us to provide a Counter object which is simply mapping from token to their frequency. The Counter object has all words that will be included in the vocabulary and their frequency (no of times they appeared in all text documents). We have assigned string <unk> for unknown tokens. When vectorizing, later on, the words that are not part of the vocabulary will be mapped to this token.

To populate Counter object, we have used count_tokens() function available from data module of gluonnlp. We have looped through each text document of datasets, tokenized them, and called count_tokens() function on the list of tokens. We have initially created an empty Counter object, which we provide to each call to count_tokens(). The Counter object gets filled with tokens and their frequencies.

After populating Counter object with tokens and their frequencies, we have created Vocab object using it. We have also printed the vocabulary size (no of tokens in vocab) at the end.

In [7]:
from collections import Counter

counter = Counter()

for dataset in [train_dataset, test_dataset]:
    for X, Y in dataset:
        gluonnlp.data.count_tokens(tokenizer(X), to_lower=True, counter=counter)

vocab = gluonnlp.Vocab(counter=counter, special_token="<unk>", min_freq=1)

print("Vocabulary Size : {}".format(len(vocab)))
Vocabulary Size : 66505

1.3 Function To Vectorize Text Data

In this section, we have created a simple function that will be used by data loaders, later on, to vectorize text documents to a list of indexes per vocabulary. The function takes as an input batch of data. It separates text documents (X) and their target labels (Y) in separate variables first. Then it loops through each text document, tokenizes it to a list of tokens, and retrieves the index of each token from the vocabulary.

We have decided to keep 50 tokens per text example. This will keep the first 50 words per text example and the rest will be ignored if present. If there are less than 50 words then we'll pad the text example with 0s (<unk> token). This way all examples will have 50 tokens.

At last, we have converted lists to mxnet nd arrays and returned. Please make a NOTE that we have subtracted 1 from target labels as they are in the range 1-4 and we want labels in the range 0-3.

In [8]:
import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np

def vectorize(batch):
    X, Y = list(zip(*batch))
    X = [[vocab(word) for word in tokenizer(sample)] for sample in X]
    X = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X] ## Bringing all samples to 50 length.
    return nd.array(X, dtype=np.int32), nd.array(Y, dtype=np.int32) - 1 # Subtracting 1 from labels to bring them in range 0-3 from 1-4

vectorize([["how are you", 1]])
Out[8]:
(
 [[381  44 175   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0   0   0   0   0   0   0   0]]
 <NDArray 1x50 @cpu(0)>,

 [0]
 <NDArray 1 @cpu(0)>)

1.4 Create Data Loaders

In this section, we have created train and test data loaders using datasets we created earlier. They will be used during the training process to loop through data in batches. We have created data loaders using DataLoader() constructor available from mxnet. We have decided to keep the batch size of 1024 hence each batch will have 1024 examples. We have also provided our vectorization function we created in previous section to batchify_fn parameter of DataLoader() constructor.

In [9]:
from mxnet.gluon.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader  = DataLoader(test_dataset,  batch_size=1024, batchify_fn=vectorize)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]
In [10]:
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(1024, 50) (1024,)

Approach 1 - Word Embeddings (Max Tokens=50, Embeddings Length=15)

Our first approach uses an embedding size of 15 which means that it'll assign a real-valued vector of length 15 to each token of our data. We have designed a simple network to classify text documents.

Define Text Classification Network

In this section, we have designed a simple neural network that consists of one embedding layer and 2 dense layers for our text classification task. The embedding layer has embeddings (real-valued vector) of length 15 for each token of our vocabulary. We have created an embedding layer using Embedding() constructor available from nn module of mxnet by giving a vocabulary length and embedding size of 15. This will internally create a weight vector of shape (vocab_len, 15). The embedding layer is simply used to map the index of the token to its embeddings from weights. It'll take as an input list of indexes and return embeddings of length 15 for each index. As our single example consists of 50 tokens, the embedding layer will return embeddings of length 15 for 50 tokens (50 x 15). So if we give the input of shape (1024, 50) to the embedding layer, it'll return the output of shape (1024,50,15) where 1024 is our batch size and 50 token indexes we kept per example using vectorize function.

The output of the embedding layer is flattened and given to the first dense layer. The output shape will change from (1024,50,15) to (1024, 50x15) = (1024, 750) after flattening. The flattened output will be given to a dense layer that has 128 output unit and applies relu activation to the output. The output of the first dense layer will be given to another dense layer that has 4 output units (same as a number of unique target labels). The output of the second dense layer will be our prediction.

After defining the network, we have also initialized it and performed a forward pass using random data for verification purposes.

We have designed the whole network using Sequential API of mxnet. It applies layers in sequence in which they are added to input data. If the reader does not have a background on how to create a network using MXNet then we recommend the below link that covers it in detail.

In [11]:
from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.seq  = nn.Sequential()
        self.seq.add(nn.Embedding(len(vocab), 15)) ## word embeddings length=15
        self.seq.add(nn.Flatten()) ## Embeddings flattened
        self.seq.add(nn.Dense(128, activation="relu"))
        self.seq.add(nn.Dense(len(target_classes)))

    def forward(self, x):
        logits = self.seq(x)
        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model
Out[11]:
EmbeddingClassifier(
  (seq): Sequential(
    (0): Embedding(66505 -> 15, float32)
    (1): Flatten
    (2): Dense(None -> 128, Activation(relu))
    (3): Dense(None -> 4, linear)
  )
)
In [12]:
from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,10000, shape=(10,50)))

preds.shape
Out[12]:
(10, 4)

Train Network

In this section, we have trained our network. In order to train it, we have designed a helper function that we'll call for training. The function takes Trainer object, train data loader, validation data loader, and a number of epochs as input.

The function executes the training loop number of epochs times. For each epoch, it loops through training data in batches using a train loader. For each batch, it calculates model predictions, calculates loss value, calculates gradients, and updates network parameters by calling step() function on Trainer object. The function records loss for each batch and prints the average loss at the end of the epoch. We have also calculated validation loss and validation accuracy at the end of each epoch. We have created helper functions to calculate validation loss and accuracy.

In [13]:
from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def MakePredictions(model, val_loader):
    Y_actuals, Y_preds = [], []
    for X_batch, Y_batch in val_loader:
        preds = model(X_batch)
        preds = nd.softmax(preds)
        Y_actuals.append(Y_batch)
        Y_preds.append(preds.argmax(axis=-1))

    Y_actuals, Y_preds = nd.concatenate(Y_actuals), nd.concatenate(Y_preds)
    return Y_actuals, Y_preds

def CalcValLoss(model, val_loader):
    losses = []
    for X_batch, Y_batch in val_loader:
        val_loss = loss_func(model(X_batch), Y_batch)
        val_loss = val_loss.mean().asscalar()
        losses.append(val_loss)
    print("Valid CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))

def TrainModelInBatches(trainer, train_loader, val_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
        CalcValLoss(model, val_loader)
        Y_actuals, Y_preds = MakePredictions(model, val_loader)
        print("Valid Accuracy : {:.3f}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))

Below, we have initiated the necessary parameters and performed training by calling the training routine we designed in the previous cell.

We have initialized the number of epochs to 15 and the learning rate to 0.001. Then, we have initialized our network, cross entropy loss, RMSProp optimizer, and Trainer object. At last, we have called our training function to perform training.

We can notice from the loss and accuracy getting printed after each epoch that our model seems to be doing a good job at the task.

In [14]:
from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.RMSProp(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:08<00:00, 13.69it/s]
Train CrossEntropyLoss : 1.202
Valid CrossEntropyLoss : 0.770
Valid Accuracy : 0.725
100%|██████████| 118/118 [00:07<00:00, 15.50it/s]
Train CrossEntropyLoss : 0.551
Valid CrossEntropyLoss : 0.471
Valid Accuracy : 0.832
100%|██████████| 118/118 [00:10<00:00, 11.53it/s]
Train CrossEntropyLoss : 0.399
Valid CrossEntropyLoss : 0.411
Valid Accuracy : 0.853
100%|██████████| 118/118 [00:10<00:00, 11.57it/s]
Train CrossEntropyLoss : 0.342
Valid CrossEntropyLoss : 0.388
Valid Accuracy : 0.861
100%|██████████| 118/118 [00:09<00:00, 12.67it/s]
Train CrossEntropyLoss : 0.307
Valid CrossEntropyLoss : 0.377
Valid Accuracy : 0.866
100%|██████████| 118/118 [00:07<00:00, 15.76it/s]
Train CrossEntropyLoss : 0.282
Valid CrossEntropyLoss : 0.373
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:08<00:00, 14.26it/s]
Train CrossEntropyLoss : 0.260
Valid CrossEntropyLoss : 0.374
Valid Accuracy : 0.871
100%|██████████| 118/118 [00:11<00:00,  9.99it/s]
Train CrossEntropyLoss : 0.241
Valid CrossEntropyLoss : 0.378
Valid Accuracy : 0.872
100%|██████████| 118/118 [00:11<00:00,  9.92it/s]
Train CrossEntropyLoss : 0.223
Valid CrossEntropyLoss : 0.386
Valid Accuracy : 0.871
100%|██████████| 118/118 [00:10<00:00, 11.28it/s]
Train CrossEntropyLoss : 0.205
Valid CrossEntropyLoss : 0.397
Valid Accuracy : 0.869
100%|██████████| 118/118 [00:10<00:00, 11.31it/s]
Train CrossEntropyLoss : 0.188
Valid CrossEntropyLoss : 0.410
Valid Accuracy : 0.868
100%|██████████| 118/118 [00:08<00:00, 13.35it/s]
Train CrossEntropyLoss : 0.170
Valid CrossEntropyLoss : 0.426
Valid Accuracy : 0.866
100%|██████████| 118/118 [00:12<00:00,  9.57it/s]
Train CrossEntropyLoss : 0.153
Valid CrossEntropyLoss : 0.445
Valid Accuracy : 0.864
100%|██████████| 118/118 [00:12<00:00,  9.77it/s]
Train CrossEntropyLoss : 0.136
Valid CrossEntropyLoss : 0.465
Valid Accuracy : 0.862
100%|██████████| 118/118 [00:08<00:00, 13.54it/s]
Train CrossEntropyLoss : 0.119
Valid CrossEntropyLoss : 0.491
Valid Accuracy : 0.859

Evaluate Network Performance

In this section, we have evaluated the performance of our network by calculating accuracy, classification report (precision, recall, and f1-score) and confusion matrix on test predictions. We have calculated all metrics using functions available from scikit-learn. We can notice from the classification report that our model is doing a good job at classifying text documents of categories World and Sports compared to Business and Sci/Tech.

If you are interested in learning about various ML metrics available from sklearn then please check the below link which covers the majority of them in detail.

In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8585526315789473
Classification Report :
              precision    recall  f1-score   support

       World       0.88      0.85      0.87      1900
      Sports       0.89      0.95      0.92      1900
    Business       0.81      0.83      0.82      1900
    Sci/Tech       0.85      0.81      0.83      1900

    accuracy                           0.86      7600
   macro avg       0.86      0.86      0.86      7600
weighted avg       0.86      0.86      0.86      7600


Confusion Matrix :
[[1617  103  116   64]
 [  44 1804   26   26]
 [  90   51 1568  191]
 [  84   59  221 1536]]

Explain Network Predictions Using LIME

In this section, we have tried to explain the prediction made by our network using LIME algorithm implementation available from lime library.

In order to explain prediction using LIME, we first need to create an instance of LimeTextExplainer from lime_text module of lime. Then, we need to call explain_instance() method on it to create an instance of Explanation. At last, we need to call show_in_notebook() method on Explanation instance to create a visualization that highlights words contributing to prediction.

If the reader does not have a background with LIME then we recommend going through the below tutorial that covers the basics and can get individuals started using it.

Below, we have simply retrieved test examples from the test dataset.

In [16]:
X_test, Y_test = [], []
for X, Y in test_dataset:
    X_test.append(X)
    Y_test.append(Y-1)

Below, we have first initialized LimeTextExplainer instance using target labels. Then, we have created a function that takes a list of text documents as input and returns their predicted probabilities by model. The function first tokenizes data, then vectorizes it using vocabulary, and then makes predictions using the model. It returns probabilities by applying softmax activation to the output of the model.

After defining a function, we randomly selected one sample from test examples and made predictions on it using our trained model. Our model correctly predicts the target label as 'Business' for the selected sample.

In [17]:
from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

def make_predictions(X_batch_text):
    X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_batch_text]
    X_batch = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X_batch] ## Bringing all samples to 50 length.
    logits = model(nd.array(X_batch, dtype=np.int32))
    preds = nd.softmax(logits)
    return preds.asnumpy()

rng = np.random.RandomState(123)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X_batch] ## Bringing all samples to 50 length.
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
print("Actual :     ", target_classes[Y_test[idx]])
Prediction :  Business
Actual :      Business

Below, we have first called explain_instance() method with selected text example, classification function, and target label. It'll return Explanation object. Then, we have called show_in_notebook() method on Explanation object to create visualization.

We can notice from the visualization that words like 'investor', 'forecasts', 'ticker', 'fullquote', etc are contributing to predicting the target category as 'Business' which makes sense as they are commonly used words in the business world.

In [ ]:
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

How to Use Word Embeddings With MXNet Networks?

Approach 2 - Word Embeddings (Max Tokens=50, Embeddings Length=50)

Our approach in this section is almost exactly the same as our approach in the previous section with only a difference in the length of the embeddings. We have kept the embeddings length of 50 per token in this section. The majority of the code in this section is the same as the code from the previous section.

Define Text Classification Network

Below, we have defined a network that we'll use for our text classification task in this section. The network is exactly the same as our network from the previous section with the only difference in embedding length provided to Embedding layer which is 50 in this case. The rest of the network is the same as earlier.

In [19]:
from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.seq  = nn.Sequential()
        self.seq.add(nn.Embedding(len(vocab), 50)) ## word embeddings length = 50
        self.seq.add(nn.Flatten()) ## Embeddings flattened
        self.seq.add(nn.Dense(128, activation="relu"))
        self.seq.add(nn.Dense(len(target_classes)))

    def forward(self, x):
        logits = self.seq(x)
        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model
Out[19]:
EmbeddingClassifier(
  (seq): Sequential(
    (0): Embedding(66505 -> 50, float32)
    (1): Flatten
    (2): Dense(None -> 128, Activation(relu))
    (3): Dense(None -> 4, linear)
  )
)

Train Network

Below, we have trained our network using exactly the same settings we had used in the previous section. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job.

In [20]:
from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.RMSProp(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:13<00:00,  8.53it/s]
Train CrossEntropyLoss : 1.124
Valid CrossEntropyLoss : 0.711
Valid Accuracy : 0.721
100%|██████████| 118/118 [00:12<00:00,  9.44it/s]
Train CrossEntropyLoss : 0.513
Valid CrossEntropyLoss : 0.459
Valid Accuracy : 0.839
100%|██████████| 118/118 [00:12<00:00,  9.74it/s]
Train CrossEntropyLoss : 0.379
Valid CrossEntropyLoss : 0.405
Valid Accuracy : 0.855
100%|██████████| 118/118 [00:10<00:00, 11.50it/s]
Train CrossEntropyLoss : 0.323
Valid CrossEntropyLoss : 0.387
Valid Accuracy : 0.862
100%|██████████| 118/118 [00:10<00:00, 11.24it/s]
Train CrossEntropyLoss : 0.284
Valid CrossEntropyLoss : 0.379
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:11<00:00, 10.38it/s]
Train CrossEntropyLoss : 0.249
Valid CrossEntropyLoss : 0.381
Valid Accuracy : 0.868
100%|██████████| 118/118 [00:11<00:00, 10.24it/s]
Train CrossEntropyLoss : 0.216
Valid CrossEntropyLoss : 0.391
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:11<00:00, 10.16it/s]
Train CrossEntropyLoss : 0.181
Valid CrossEntropyLoss : 0.410
Valid Accuracy : 0.866
100%|██████████| 118/118 [00:13<00:00,  9.07it/s]
Train CrossEntropyLoss : 0.146
Valid CrossEntropyLoss : 0.436
Valid Accuracy : 0.863
100%|██████████| 118/118 [00:12<00:00,  9.68it/s]
Train CrossEntropyLoss : 0.112
Valid CrossEntropyLoss : 0.475
Valid Accuracy : 0.859
100%|██████████| 118/118 [00:11<00:00, 10.53it/s]
Train CrossEntropyLoss : 0.083
Valid CrossEntropyLoss : 0.517
Valid Accuracy : 0.856
100%|██████████| 118/118 [00:10<00:00, 10.85it/s]
Train CrossEntropyLoss : 0.058
Valid CrossEntropyLoss : 0.559
Valid Accuracy : 0.852
100%|██████████| 118/118 [00:10<00:00, 10.93it/s]
Train CrossEntropyLoss : 0.040
Valid CrossEntropyLoss : 0.611
Valid Accuracy : 0.851
100%|██████████| 118/118 [00:11<00:00, 10.57it/s]
Train CrossEntropyLoss : 0.027
Valid CrossEntropyLoss : 0.666
Valid Accuracy : 0.846
100%|██████████| 118/118 [00:12<00:00,  9.64it/s]
Train CrossEntropyLoss : 0.019
Valid CrossEntropyLoss : 0.714
Valid Accuracy : 0.848

Evaluate Network Performance

In this section, we have evaluated the performance of the network by calculating accuracy, confusion matrix and classification report metrics on test predictions. We can notice from the accuracy that it is actually a little less compared to our previous approach. The classification reports show that our model is doing a good job at classifying World and Sports categories compared to Business and Sci/Tech categories.

In [21]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8478947368421053
Classification Report :
              precision    recall  f1-score   support

       World       0.87      0.84      0.86      1900
      Sports       0.90      0.94      0.92      1900
    Business       0.80      0.81      0.80      1900
    Sci/Tech       0.81      0.81      0.81      1900

    accuracy                           0.85      7600
   macro avg       0.85      0.85      0.85      7600
weighted avg       0.85      0.85      0.85      7600


Confusion Matrix :
[[1597   91  115   97]
 [  46 1782   37   35]
 [  92   50 1531  227]
 [  94   51  221 1534]]

Explain Network Predictions Using LIME Algorithm

In this section, we have tried to explain a prediction made by the network using LIME algorithm. We have randomly selected a test example and our model correctly predicts the target label as 'Business' for it. Then, we have created a visualization explaining the network prediction. We can notice from the visualization that words like 'investor', 'fullquote', 'ticker', 'routers', 'forecasts', etc are contributing to the prediction.

In [ ]:
from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(123)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X_batch] ## Bringing all samples to 50 length.
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

How to Use Word Embeddings With MXNet Networks?

Approach 3 - Word Embeddings Averaged (Max Tokens=50, Embeddings Length=50)

Our approach in this section is almost the same as our approach in the previous section with a minor change in the way the output embeddings from the embedding layer are handled. Till now, both approaches that we tried, flattened the embeddings of tokens of single text examples. In this approach, we have taken the average of embeddings of tokens of a single text example. The only change done to implement this approach is in the forward pass of the network. The rest of the code is almost the same as the previous section.

Define Text Classification Network

Below, we have defined a network that we'll use for our task in this section. We have defined the layers that we'll use in init() method of the network class. During the forward pass, the output of the embedding layer is averaged at the tokens level instead of flattening it like in our previous sections. Then, we have applied both linear layers to averaged embeddings.

In [23]:
from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.word_embeddings = nn.Embedding(len(vocab), 50)
        self.linear1 = nn.Dense(128, activation="relu")
        self.linear2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.word_embeddings(x)
        x = x.mean(axis=1) ## Averaged Embeddings
        x = self.linear1(x)
        logits = self.linear2(x)

        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model
Out[23]:
EmbeddingClassifier(
  (word_embeddings): Embedding(66505 -> 50, float32)
  (flatten): Flatten
  (linear1): Dense(None -> 128, Activation(relu))
  (linear2): Dense(None -> 4, linear)
)
In [24]:
from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,10000, shape=(10,50)))

preds.shape
Out[24]:
(10, 4)

Train Network

In this section, we have trained our network using the same settings that we have been using for all our previous approaches. From the loss and accuracy value getting printed after each epoch, we can notice that our model is doing a good job.

In [25]:
from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.RMSProp(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:14<00:00,  8.33it/s]
Train CrossEntropyLoss : 1.385
Valid CrossEntropyLoss : 1.380
Valid Accuracy : 0.335
100%|██████████| 118/118 [00:13<00:00,  8.52it/s]
Train CrossEntropyLoss : 1.325
Valid CrossEntropyLoss : 1.208
Valid Accuracy : 0.439
100%|██████████| 118/118 [00:15<00:00,  7.67it/s]
Train CrossEntropyLoss : 1.066
Valid CrossEntropyLoss : 0.961
Valid Accuracy : 0.570
100%|██████████| 118/118 [00:15<00:00,  7.53it/s]
Train CrossEntropyLoss : 0.881
Valid CrossEntropyLoss : 0.839
Valid Accuracy : 0.650
100%|██████████| 118/118 [00:12<00:00,  9.14it/s]
Train CrossEntropyLoss : 0.773
Valid CrossEntropyLoss : 0.747
Valid Accuracy : 0.712
100%|██████████| 118/118 [00:12<00:00,  9.61it/s]
Train CrossEntropyLoss : 0.667
Valid CrossEntropyLoss : 0.640
Valid Accuracy : 0.765
100%|██████████| 118/118 [00:13<00:00,  9.04it/s]
Train CrossEntropyLoss : 0.560
Valid CrossEntropyLoss : 0.548
Valid Accuracy : 0.805
100%|██████████| 118/118 [00:13<00:00,  8.78it/s]
Train CrossEntropyLoss : 0.483
Valid CrossEntropyLoss : 0.494
Valid Accuracy : 0.824
100%|██████████| 118/118 [00:14<00:00,  7.98it/s]
Train CrossEntropyLoss : 0.436
Valid CrossEntropyLoss : 0.464
Valid Accuracy : 0.837
100%|██████████| 118/118 [00:12<00:00,  9.23it/s]
Train CrossEntropyLoss : 0.406
Valid CrossEntropyLoss : 0.445
Valid Accuracy : 0.845
100%|██████████| 118/118 [00:13<00:00,  8.90it/s]
Train CrossEntropyLoss : 0.383
Valid CrossEntropyLoss : 0.432
Valid Accuracy : 0.848
100%|██████████| 118/118 [00:15<00:00,  7.50it/s]
Train CrossEntropyLoss : 0.365
Valid CrossEntropyLoss : 0.421
Valid Accuracy : 0.854
100%|██████████| 118/118 [00:14<00:00,  7.92it/s]
Train CrossEntropyLoss : 0.348
Valid CrossEntropyLoss : 0.412
Valid Accuracy : 0.857
100%|██████████| 118/118 [00:14<00:00,  8.04it/s]
Train CrossEntropyLoss : 0.333
Valid CrossEntropyLoss : 0.404
Valid Accuracy : 0.860
100%|██████████| 118/118 [00:12<00:00,  9.24it/s]
Train CrossEntropyLoss : 0.320
Valid CrossEntropyLoss : 0.398
Valid Accuracy : 0.862

Evaluate Network Performance

In this section, we have evaluated the performance of the network as usual by calculating accuracy, confusion matrix and classification report metrics on test predictions. We can notice from the accuracy score that the network has good accuracy compared to our previous approaches. The classification report indicates that the network is good at classifying text documents of categories 'World', 'Sports' and 'Sci/Tech' compared to category 'Business'.

In [26]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8623684210526316
Classification Report :
              precision    recall  f1-score   support

       World       0.86      0.87      0.87      1900
      Sports       0.92      0.95      0.93      1900
    Business       0.81      0.84      0.82      1900
    Sci/Tech       0.86      0.79      0.82      1900

    accuracy                           0.86      7600
   macro avg       0.86      0.86      0.86      7600
weighted avg       0.86      0.86      0.86      7600


Confusion Matrix :
[[1659   79  117   45]
 [  62 1797   12   29]
 [ 110   24 1597  169]
 [  96   57  246 1501]]

Explain Network Predictions Using LIME Algorithm

Here, we have tried to explain the prediction made by our network using LIME algorithm. We randomly selected a sample and predicted its label using our trained network. Our network correctly predicts the target label as 'Business' for it. Then, we generated a visualization explaining the prediction. We can notice from the visualization that words like 'stocks', 'investor', 'fullquote', 'earnings', etc are contributing to the prediction which makes sense as they are commonly used words in business.

In [ ]:
from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(123)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X_batch] ## Bringing all samples to 50 length.
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

How to Use Word Embeddings With MXNet Networks?

Approach 4 - Word Embeddings Summed (Max Tokens=50, Embeddings Length=50)

Our approach in this section is almost exactly the same as our approach in the previous section with one minor change. The embeddings from the embedding layer were averaged in the previous approach whereas here, we have summed the embeddings from the embedding layer. We have summed embeddings of all tokens of a single text example before giving it to a dense layer. The rest of the code is almost the same as our previous approaches.

Define Text Classification Network

Below, we have defined a network that we'll use for our text classification task in this section. The network design is exactly the same as our previous section with only a change in the forward pass where we have summed embeddings instead of averaging them like in the previous section. The rest of the code is the same as earlier.

In [28]:
from mxnet.gluon import nn

class EmbeddingClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(EmbeddingClassifier, self).__init__(**kwargs)
        self.word_embeddings = nn.Embedding(len(vocab), 50)
        self.linear1 = nn.Dense(128, activation="relu")
        self.linear2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.word_embeddings(x)
        x = x.sum(axis=1)  ## Embeddings summed
        x = self.linear1(x)
        logits = self.linear2(x)

        return logits #nd.softmax(logits)

model = EmbeddingClassifier()

model
Out[28]:
EmbeddingClassifier(
  (word_embeddings): Embedding(66505 -> 50, float32)
  (flatten): Flatten
  (linear1): Dense(None -> 128, Activation(relu))
  (linear2): Dense(None -> 4, linear)
)
In [29]:
from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(nd.random.randint(1,10000, shape=(10,50)))

preds.shape
Out[29]:
(10, 4)

Train Network

Here, we have trained our network using exactly the same parameter settings that we have used for all our previous approaches. The loss and accuracy values getting printed after each epoch hint that our model is doing a good job at the given task.

In [30]:
from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = EmbeddingClassifier()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.RMSProp(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:14<00:00,  8.00it/s]
Train CrossEntropyLoss : 0.850
Valid CrossEntropyLoss : 0.517
Valid Accuracy : 0.815
100%|██████████| 118/118 [00:12<00:00,  9.50it/s]
Train CrossEntropyLoss : 0.420
Valid CrossEntropyLoss : 0.415
Valid Accuracy : 0.854
100%|██████████| 118/118 [00:15<00:00,  7.64it/s]
Train CrossEntropyLoss : 0.350
Valid CrossEntropyLoss : 0.392
Valid Accuracy : 0.862
100%|██████████| 118/118 [00:14<00:00,  8.05it/s]
Train CrossEntropyLoss : 0.313
Valid CrossEntropyLoss : 0.381
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:13<00:00,  8.45it/s]
Train CrossEntropyLoss : 0.288
Valid CrossEntropyLoss : 0.377
Valid Accuracy : 0.869
100%|██████████| 118/118 [00:13<00:00,  8.63it/s]
Train CrossEntropyLoss : 0.268
Valid CrossEntropyLoss : 0.377
Valid Accuracy : 0.869
100%|██████████| 118/118 [00:12<00:00,  9.11it/s]
Train CrossEntropyLoss : 0.251
Valid CrossEntropyLoss : 0.382
Valid Accuracy : 0.870
100%|██████████| 118/118 [00:11<00:00, 10.07it/s]
Train CrossEntropyLoss : 0.237
Valid CrossEntropyLoss : 0.388
Valid Accuracy : 0.870
100%|██████████| 118/118 [00:13<00:00,  8.56it/s]
Train CrossEntropyLoss : 0.222
Valid CrossEntropyLoss : 0.397
Valid Accuracy : 0.870
100%|██████████| 118/118 [00:12<00:00,  9.31it/s]
Train CrossEntropyLoss : 0.210
Valid CrossEntropyLoss : 0.405
Valid Accuracy : 0.869
100%|██████████| 118/118 [00:13<00:00,  8.85it/s]
Train CrossEntropyLoss : 0.197
Valid CrossEntropyLoss : 0.420
Valid Accuracy : 0.868
100%|██████████| 118/118 [00:12<00:00,  9.66it/s]
Train CrossEntropyLoss : 0.185
Valid CrossEntropyLoss : 0.434
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:12<00:00,  9.64it/s]
Train CrossEntropyLoss : 0.174
Valid CrossEntropyLoss : 0.447
Valid Accuracy : 0.867
100%|██████████| 118/118 [00:12<00:00,  9.70it/s]
Train CrossEntropyLoss : 0.164
Valid CrossEntropyLoss : 0.463
Valid Accuracy : 0.868
100%|██████████| 118/118 [00:12<00:00,  9.79it/s]
Train CrossEntropyLoss : 0.153
Valid CrossEntropyLoss : 0.480
Valid Accuracy : 0.866

Evaluate Network Performance

In this section, we have evaluated the performance of the network as usual by calculating accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that our accuracy is the highest of all approaches we tried till now. The classification report indicates that the network is doing a good job in categories 'World', 'Sports' and 'Sci/Tech' compared to category 'Business'.

In [31]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8660526315789474
Classification Report :
              precision    recall  f1-score   support

       World       0.88      0.86      0.87      1900
      Sports       0.91      0.95      0.93      1900
    Business       0.82      0.84      0.83      1900
    Sci/Tech       0.85      0.82      0.84      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600


Confusion Matrix :
[[1627   89  121   63]
 [  49 1804   25   22]
 [  90   42 1587  181]
 [  78   53  205 1564]]

Explain Network Predictions Using LIME Algorithm

In this section, we have tried to explain the prediction made by the network using LIME algorithm. We randomly selected a test example and made predictions on it using our trained network. The network correctly predicts the target label as 'Business' for the selected test example. Then, we have created a visualization to explain the prediction made by the network. We can notice from the visualization that words like 'fullquote', 'investor', 'stocks', 'ticker', 'forecasts', etc are contributing to the prediction.

In [ ]:
from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(123)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X_batch] ## Bringing all samples to 50 length.
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

How to Use Word Embeddings With MXNet Networks?

Summarized Results Of Approaches And Suggestions

The below table highlights the approaches tried and their accuracy on the test set.

Approach Test Accuracy
Approach 1 - Word Embeddings (Max Tokens=50, Embeddings Length=15) 85.85 %
Approach 2 - Word Embeddings (Max Tokens=50, Embeddings Length=50) 84.78 %
Approach 3 - Word Embeddings Averaged (Max Tokens=50, Embeddings Length=50) 86.23 %
Approach 4 - Word Embeddings Summed (Max Tokens=50, Embeddings Length=50) 86.60 %

Further Suggestions

Below, we have listed a few more suggestions on what can be further tried to improve network performance further.

  • Try different embeddings sizes.
  • Try different token sizes per text example.
  • Try different weight initializers.
  • Try the different numbers of linear layers and their output units.
  • Train network for more epochs (using Adam optimizer).
  • Try different tokenizers available from gluonnlp. We have only used words but including punctuation marks, and special characters might improve results further.
  • Try character tokenizers instead of word tokenizers and try different char n-grams.
  • Try trained word embeddings like GloVe, FastText, etc.

There are many more things that can be tried to improve network performance further but it'll require further experimentation to check which one works.

This ends our small tutorial explaining how we can use word embeddings approach for text classification tasks by designing networks using MXnet. We also explained various functionalities available from gluonnlp module. Please feel free to let us know your views in the comments section.

References

Sunny Solanki  Sunny Solanki

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to let us know in the comments section below (Guest Comments are allowed). We appreciate and value your feedbacks.

If you like our work please give a thumbs-up to our article in the comments section below. You can also support us with a small contribution by clicking on Support Us link in the footer section.