Updated On : May-07,2022 Time Investment : ~30 mins

MXNet: LSTM Networks For Text Classification Tasks

Recurrent neural networks (RNNs) are very commonly used when working with data that involves some kind of internal sequence like time-series, text data, speech data, etc. In these kinds of datasets, The example at any stage is dependent on previous examples and the previous few examples are the best estimate to predict the current example. The traditional neural networks consisting of only dense layers are not good at capturing these kinds of sequences. It does not have memory and can't remember previous data it saw. Unlike them, RNNs are quite good at remembering sequences. Though in theory vanilla RNNs should be able to capture sequences of any length, in practice they are not that good at them due to exploding gradients problem. Hence, a version of RNNs named Long Short-Term Memory (LSTM) was invented which solves an exploding gradient problem with vanilla RNNs and is quite good at capturing sequences. Below, we have included an image showing one cell of the LSTM layer. Many such LSTM cells are laid next to each other to create an LSTM layer. LSTM network can consist of single or more LSTM layers.

MXNet: LSTM  Networks For Text Classification Tasks

As a part of this tutorial, we have explained how we can design LSTM Networks using Python deep learning library MXNet (from Apache) for solving text classification tasks. We have used word embeddings approach for encoding text data. The tutorial explains various ways of using LSTM layers in the network by trying various approaches and then comparing their results at the end. We have also evaluated the performance of networks by calculating various ML metrics. Apart from this, we have even explained predictions made by the network using LIME algorithm.

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial

  1. Prepare Data
    • 1.1 Load Dataset
    • 1.2 Tokenize Text Data And Populate Vocabulary
    • 1.3 Define Vectorization Function
    • 1.4 Create Data Loaders
  2. Approach 1: Single LSTM Layer Network (Max Tokens=25, Embeddings Length=40, LSTM Output=75)
    • Define LSTM Network
    • Train Network
    • Evaluate Network Performance
    • Explain Network Predictions using LIME Algorithm
  3. Approach 2: Single LSTM Layer Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)
  4. Approach 3: Single Bidirectional LSTM Layer Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)
  5. Approach 4: Multiple LSTM Layers Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)
  6. Approach 5: Stacking Multiple LSTM Layer (Max Tokens=50, Embeddings Length=40, LSTM Output=50,60,75)
  7. Approach 6: Multiple Bidirectional LSTM Layers (Max Tokens=50, Embeddings Length=30, LSTM Output=50)
  8. Results Summary and Further Recommendations

Below, we have imported the necessary python libraries that we are going to use in this tutorial and printed the versions of them as well.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.0
import gluonnlp

print("GluonNLP Version : {}".format(gluonnlp.__version__))
GluonNLP Version : 0.10.0
import torchtext

print("TorchText Version : {}".format(torchtext.__version__))
TorchText Version : 0.10.1

1. Prepare Data

In this section, we have prepared our data to be given to the neural network for training and evaluation purposes. We are going to use word embeddings approach to encoding text data. We'll follow the below steps to encode text data to real-valued data using this approach.

  1. Loop through all text examples, tokenize them to generate tokens (words), and create a vocabulary of all unique tokens from the corpus. Vocabulary is a simple mapping from token to integer index. Each token is assigned a unique index starting from 0.
  2. Tokenize each text example to generate tokens and retrieve indexes of those tokens using vocabulary.
  3. Retrieve embeddings (real-valued vector) for these indexes.

The first two steps mentioned above will be completed in this section where we have created data loaders that returns a list of indexes for text examples. The third step will be implemented in the neural network as an embedding layer that returns embeddings from indexes.

MXNet: LSTM Networks For Text Classification Tasks

1.1 Load Dataset

In this section, we have loaded AG NEWS dataset that we are going to use for our classification task. The dataset has text documents for 4 different categories (["World", "Sports", "Business", "Sci/Tech"]) of news. The dataset is already divided into train and test sets. After loading datasets, we have also wrapped them in ArrayDataset object which is a standard MXNet data structure to maintain data.

from mxnet.gluon.data import ArrayDataset

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

Y_train, X_train = zip(*list(train_dataset))
Y_test,  X_test  = zip(*list(test_dataset))

train_dataset = ArrayDataset(X_train, Y_train)
test_dataset  = ArrayDataset(X_test, Y_test)
train.csv: 29.5MB [00:00, 79.5MB/s]
test.csv: 1.86MB [00:00, 54.6MB/s]

1.2 Tokenize Text Data Populate Vocabulary

Below, we have first defined a tokenizer. The tokenizer is a function that takes a text document as input and returns a list of tokens (words). The vocabulary will maintain all such unique words. We have created a tokenizer using regular expression that catches words consisting of one or more consecutive alphabets. We have used partial() function from functools Python library to create tokenization function.

import re
from functools import partial

tokenizer = partial(lambda X: re.findall(r"\w+", X))

tokenizer("Hello, How are you?")
['Hello', 'How', 'are', 'you']

Below, we are populating the vocabulary of all unique tokens. In order to create vocabulary, we need to populate Counter object with all unique tokens from datasets with their respective frequencies. The Counter object is a kind of a dictionary available from Python collections module that maintained count tokens. We have started by creating an empty Counter object. Then, we are looping through each text example of the train and test dataset. We have called count_tokens() method available from data sub-module of gluonnlp Python module. on tokens of each text example. We have also given our Counter object to this method. It keeps on updating the counter object with tokens and their respective frequencies.

In the end, we have called Vocab() constructor from gluonnlp with Counter object to create a vocabulary. The Vocab object has a mapping from tokens to indexes. We have also printed the size of the vocabulary at the end.

from collections import Counter

counter = Counter()

for dataset in [train_dataset, test_dataset]:
    for X, Y in dataset:
        gluonnlp.data.count_tokens(tokenizer(X), to_lower=True, counter=counter)

vocab = gluonnlp.Vocab(counter=counter, special_token="<unk>", min_freq=1)

print("Vocabulary Size : {}".format(len(vocab)))
Vocabulary Size : 66505

1.3 Define Vectorization Function

In this section, we have defined a simple vectorization function that will take a batch of data (text examples and their target labels) as input and return a list of indexes for each text example. This function will be used by data loaders later for preprocessing batch of data. The function tokenizes each text example into tokens and retrieves indexes of those tokens using vocabulary. We have decided to keep maximum of 25 tokens per text example. Each text example can have a different number of tokens based on the number of sentences and each sentence can have a different number of words. We have truncated tokens beyond 25 tokens for text examples that have more than 25 tokens and for examples that have less than 25 tokens, we have padded them with 0s (<unk> token). After mapping tokens to their indexes, we have returned them and target labels as MXNet ndarray objects.

import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np

max_tokens = 25

clip_seq = gluonnlp.data.ClipSequence(max_tokens)
pad_seq  = gluonnlp.data.PadSequence(length=max_tokens, pad_val=0, clip=True)

def vectorize(batch):
    X, Y = list(zip(*batch))
    X = [[vocab(word) for word in tokenizer(sample)] for sample in X]
    #X = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X] .
    X = [pad_seq(tokens) for tokens in X] ## Bringing all samples to max_tokens length
    return nd.array(X, dtype=np.int32), nd.array(Y, dtype=np.int32) - 1 # Subtracting 1 from labels to bring them in range 0-3 from 1-4

vectorize([["how are you", 1]])
(
 [[381  44 175   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
     0   0   0   0   0   0   0]]
 <NDArray 1x25 @cpu(0)>,

 [0]
 <NDArray 1 @cpu(0)>)

1.4 Create Data Loaders

In this section, we have created train and test data loaders using datasets we created earlier. The data loaders are used to loop through training data in batches during the training process. We have kept batch size at 1024. We have also provided the vectorization function defined in the previous cell to batchify_fn parameter. This function will be applied to each batch of data before giving data to the neural network.

from mxnet.gluon.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader  = DataLoader(test_dataset,  batch_size=1024, batchify_fn=vectorize)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(1024, 25) (1024,)

Approach 1: Single LSTM Layer Network (Max Tokens=25, Embeddings Length=40, LSTM Output=75)

Our approach in this section uses Recurrent Neural Network with a single LSTM layer. The network has three layers, embedding layer, LSTM layer, and dense layer. The embedding layer maps token indexes to their embeddings, the LSTM layer processes embeddings to understand the sequence, and the dense layer processes the output of the LSTM layer to generate 4 probabilities (for 4 target classes) per text example. After training this network, we have also evaluated the performance by calculating various ML metrics. Apart from this, we have tried to explain the predictions made by the network using LIME algorithm.

Define LSTM Network

Here, we have defined a network that we'll use for our text classification task. The network consists of 3 layers.

  1. Embedding Layer
  2. LSTM Layer
  3. Dense Layer

The first layer of the network is the embedding layer. We have created embedding layer using Embedding() constructor available from 'nn' sub-module of 'gluon' sub-module of mxnet library. We have provided it with a length of vocabulary (number of unique tokens) and embedding length (40). The embedding length of 40 means that each token will be assigned a real-valued vector of length 40. When we create this layer, it internally creates a weight matrix of shape (vocab_len, embed_len). When we provide the network with a list of token indexes, it'll retrieve embeddings by indexing this weight matrix with token indexes. The input shape to embedding layer is (batch_size, max_tokens) = (batch_size, 25) and output shape is (batch_size, max_tokens, embed_len) = (batch_size, 25, 40).

The second layer of the network is LSTM layer. It takes the output of the embedding layer and loops through embeddings of each text example to process them. We have create lstm layer using LSTM() constructor available from 'rnn' sub-module of 'gluon' sub-module of mxnet. We have hidden_size of LSTM layer to 75. We can also stack more than one LSTM layer by providing a count greater than 1 to n_layers parameters. Here, we have set it to 1. The input data shape to LSTM layer is (batch_size, max_tokens, embed_len) and output shape is (batch_size, max_tokens, hidden_size) = (batch_size, 25, 75).

The third layer is the Dense layer with 4 output units (same as a number of target labels). The output of LSTM is given to a dense layer whose output is returned as predictions.

After defining the network, we initialized it and made predictions using random data for verification purposes. We have also printed the shape of weights/biases of layers of the network for information purposes.

We have not covered how to design networks using MXNet in detail here. If you are someone new to MXNet and want to learn how to create a neural network using it then please feel free to check the below link. It'll get you started with the library.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim = 75
n_layers = 1

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
        self.dense = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)

        return self.dense(x[:, -1])

model = LSTMTextClassifier()

model
LSTMTextClassifier(
  (embedding): Embedding(66505 -> 40, float32)
  (lstm): LSTM(40 -> 75, NTC)
  (dense): Dense(None -> 4, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

preds.shape
(10, 4)
for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding0_weight         : (66505, 40)
lstm0_l0_i2h_weight       : (300, 40)
lstm0_l0_h2h_weight       : (300, 75)
lstm0_l0_i2h_bias         : (300,)
lstm0_l0_h2h_bias         : (300,)
dense0_weight             : (4, 75)
dense0_bias               : (4,)

Train Network

In this section, we are training the network that we designed in the previous cell. We have defined a function for training our network. The function takes the trainer object (network parameters), train data loader, validation data loader, and the number of epochs as input. It then executes training loop number of epochs times. For each epoch, it loops through training data in batches using a train data loader. Using each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network weights. It keeps track of loss for each batch and prints the average loss of all batches at the end of each epoch. We have also created helper functions that help us calculate validation loss and accuracy.

from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def MakePredictions(model, val_loader):
    Y_actuals, Y_preds = [], []
    for X_batch, Y_batch in val_loader:
        preds = model(X_batch)
        preds = nd.softmax(preds)
        Y_actuals.append(Y_batch)
        Y_preds.append(preds.argmax(axis=-1))

    Y_actuals, Y_preds = nd.concatenate(Y_actuals), nd.concatenate(Y_preds)
    return Y_actuals, Y_preds

def CalcValLoss(model, val_loader):
    losses = []
    for X_batch, Y_batch in val_loader:
        val_loss = loss_func(model(X_batch), Y_batch)
        val_loss = val_loss.mean().asscalar()
        losses.append(val_loss)
    print("Valid CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))

def TrainModelInBatches(trainer, train_loader, val_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
        CalcValLoss(model, val_loader)
        Y_actuals, Y_preds = MakePredictions(model, val_loader)
        print("Valid Accuracy : {:.3f}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))

Below, we are actually training our network using the function we defined in the previous cell. We have first initialized a number of epochs to 15 and the learning rate to 0.001. Then, we have initialized our LSTM text classifier, cross entropy loss, Adam optimizer and Trainer object. At last, we have called training routine with the necessary parameters to perform the training process. We can notice from the loss and accuracy getting printed after each epoch that our network is doing a good job at the text classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:30<00:00,  3.89it/s]
Train CrossEntropyLoss : 1.053
Valid CrossEntropyLoss : 0.719
Valid Accuracy : 0.717
100%|██████████| 118/118 [00:29<00:00,  3.95it/s]
Train CrossEntropyLoss : 0.567
Valid CrossEntropyLoss : 0.548
Valid Accuracy : 0.810
100%|██████████| 118/118 [00:29<00:00,  3.97it/s]
Train CrossEntropyLoss : 0.437
Valid CrossEntropyLoss : 0.518
Valid Accuracy : 0.823
100%|██████████| 118/118 [00:29<00:00,  4.01it/s]
Train CrossEntropyLoss : 0.374
Valid CrossEntropyLoss : 0.512
Valid Accuracy : 0.826
100%|██████████| 118/118 [00:29<00:00,  3.97it/s]
Train CrossEntropyLoss : 0.334
Valid CrossEntropyLoss : 0.533
Valid Accuracy : 0.822
100%|██████████| 118/118 [00:29<00:00,  3.98it/s]
Train CrossEntropyLoss : 0.308
Valid CrossEntropyLoss : 0.538
Valid Accuracy : 0.825
100%|██████████| 118/118 [00:30<00:00,  3.93it/s]
Train CrossEntropyLoss : 0.301
Valid CrossEntropyLoss : 0.538
Valid Accuracy : 0.821
100%|██████████| 118/118 [00:29<00:00,  3.97it/s]
Train CrossEntropyLoss : 0.284
Valid CrossEntropyLoss : 0.575
Valid Accuracy : 0.815
100%|██████████| 118/118 [00:29<00:00,  3.94it/s]
Train CrossEntropyLoss : 0.266
Valid CrossEntropyLoss : 0.594
Valid Accuracy : 0.817
100%|██████████| 118/118 [00:29<00:00,  3.99it/s]
Train CrossEntropyLoss : 0.255
Valid CrossEntropyLoss : 0.605
Valid Accuracy : 0.817
100%|██████████| 118/118 [00:30<00:00,  3.93it/s]
Train CrossEntropyLoss : 0.249
Valid CrossEntropyLoss : 0.608
Valid Accuracy : 0.816
100%|██████████| 118/118 [00:29<00:00,  3.98it/s]
Train CrossEntropyLoss : 0.250
Valid CrossEntropyLoss : 0.660
Valid Accuracy : 0.801
100%|██████████| 118/118 [00:29<00:00,  3.97it/s]
Train CrossEntropyLoss : 0.255
Valid CrossEntropyLoss : 0.652
Valid Accuracy : 0.803
100%|██████████| 118/118 [00:29<00:00,  3.96it/s]
Train CrossEntropyLoss : 0.244
Valid CrossEntropyLoss : 0.626
Valid Accuracy : 0.812
100%|██████████| 118/118 [00:29<00:00,  3.96it/s]
Train CrossEntropyLoss : 0.223
Valid CrossEntropyLoss : 0.656
Valid Accuracy : 0.810

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model has done a good job at the classification task. We have calculated ML metrics using functions available from the python library scikit-learn.

The scikit-learn provides functions to calculate many ML metrics. If you are interested in learning about various functions available from sklearn to calculate ML metrics then please check the below link which covers the majority of them in detail.

Apart from calculations, we have also plotted confusion matrix metric using Python library scikit-plot. From the visualization, we can notice that our model is doing quite a good job at classifying text documents in the category Sports compared to other categories.

Please feel free to check the below link if you are new to scikit-plot as it provides visualizations for many ML metrics.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8103947368421053
Classification Report :
              precision    recall  f1-score   support

       World       0.81      0.82      0.81      1900
      Sports       0.85      0.90      0.87      1900
    Business       0.76      0.79      0.77      1900
    Sci/Tech       0.82      0.74      0.78      1900

    accuracy                           0.81      7600
   macro avg       0.81      0.81      0.81      7600
weighted avg       0.81      0.81      0.81      7600


Confusion Matrix :
[[1550  124  144   82]
 [  78 1706   69   47]
 [ 139   78 1498  185]
 [ 143   95  257 1405]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions made by our network using LIME algorithm. We'll be using the Python library lime which has an implementation of the algorithm. It let us create visualizations highlighting words in text documents that contributed to predicting a particular target label.

If you are someone who is new to the concept of LIME and want to learn about it in depth then we would suggest that you go through the below links.

Below, we have simply loaded text examples from the test dataset.

X_test, Y_test = [], []
for X, Y in test_dataset:
    X_test.append(X)
    Y_test.append(Y-1)

Below, we have first created an instance of LimeTextExplainer which we'll use to create an explanation object later for explaining network predictions.

Then, we have defined a prediction function. The function takes a batch of text examples as input and returns their probabilities predicted by the network. It tokenizes text examples, retrieves their indexes, and gives them to the network to make predictions. It also applies softmax activation function to the output of the network to generate probabilities. We'll use this function later when generating an explanation for the text example.

After defining a function, we randomly selected a text example from the test dataset and made predictions on it using our trained network. Our network correctly predicts the target label as Sci/Tech for the selected text example. Next, we'll explain this prediction.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

def make_predictions(X_batch_text):
    X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_batch_text]
    X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
    logits = model(nd.array(X_batch, dtype=np.int32))
    preds = nd.softmax(logits)
    return preds.asnumpy()

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
Actual :      Sci/Tech
Prediction :  Sci/Tech

Below, we have first called explain_instance() method on LimeTextExplainer object. We have provided selected a text example, prediction function, and target label to the method. This method will create an instance of Explanation object which has details about words contributing to predictions.

Next, we have called show_in_notebook() method on an instance of Explanation object to generate a visualization of the explanation. We can notice from the visualization that words like 'remote', 'software', 'umbrella', 'departments', etc are contributing to predicting the target label as Sci/Tech.

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

Approach 2: Single LSTM Layer Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)

Our approach in this section is exactly the same as our approach from the previous section with a minor change being that we are using 50 max tokens per text example this time. The majority of the code is the same as earlier.

Create Data Loaders

Below, we have reinitialized data loaders so that they use new max tokens which are set at 50 this time.

from mxnet.gluon.data import DataLoader

max_tokens = 50
pad_seq  = gluonnlp.data.PadSequence(length=max_tokens, pad_val=0, clip=True)

train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader  = DataLoader(test_dataset,  batch_size=1024, batchify_fn=vectorize)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(1024, 50) (1024,)

Define Network

In this section, we have defined the network that we'll use for our task in this section. The definition of the network is exactly the same as earlier. After defining the network, we have also initialized it and printed the shape of weights/biases of layers.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim = 75
n_layers = 1

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
        self.dense1 = nn.Dense(100, activation="relu")
        self.dense2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)

        x = self.dense1(x[:, -1])

        return self.dense2(x)

model = LSTMTextClassifier()

model

from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding2_weight         : (66505, 40)
lstm2_l0_i2h_weight       : (300, 40)
lstm2_l0_h2h_weight       : (300, 75)
lstm2_l0_i2h_bias         : (300,)
lstm2_l0_h2h_bias         : (300,)
dense2_weight             : (100, 75)
dense2_bias               : (100,)
dense3_weight             : (4, 100)
dense3_bias               : (4,)

Train Network

In this section, we have trained our network using exactly the same settings that we have used in our previous section. We'll be keeping these settings the same for all approaches so that comparison becomes easier. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the text classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 1.265
Valid CrossEntropyLoss : 0.985
Valid Accuracy : 0.520
100%|██████████| 118/118 [00:52<00:00,  2.24it/s]
Train CrossEntropyLoss : 0.811
Valid CrossEntropyLoss : 0.650
Valid Accuracy : 0.766
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 0.484
Valid CrossEntropyLoss : 0.450
Valid Accuracy : 0.843
100%|██████████| 118/118 [00:51<00:00,  2.31it/s]
Train CrossEntropyLoss : 0.356
Valid CrossEntropyLoss : 0.483
Valid Accuracy : 0.843
100%|██████████| 118/118 [00:52<00:00,  2.26it/s]
Train CrossEntropyLoss : 0.319
Valid CrossEntropyLoss : 0.534
Valid Accuracy : 0.841
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 0.307
Valid CrossEntropyLoss : 0.494
Valid Accuracy : 0.844
100%|██████████| 118/118 [00:51<00:00,  2.31it/s]
Train CrossEntropyLoss : 0.285
Valid CrossEntropyLoss : 0.473
Valid Accuracy : 0.845
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 0.273
Valid CrossEntropyLoss : 0.511
Valid Accuracy : 0.836
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 0.265
Valid CrossEntropyLoss : 0.480
Valid Accuracy : 0.847
100%|██████████| 118/118 [00:51<00:00,  2.31it/s]
Train CrossEntropyLoss : 0.247
Valid CrossEntropyLoss : 0.462
Valid Accuracy : 0.847
100%|██████████| 118/118 [00:51<00:00,  2.29it/s]
Train CrossEntropyLoss : 0.241
Valid CrossEntropyLoss : 0.494
Valid Accuracy : 0.846
100%|██████████| 118/118 [00:51<00:00,  2.28it/s]
Train CrossEntropyLoss : 0.237
Valid CrossEntropyLoss : 0.601
Valid Accuracy : 0.818
100%|██████████| 118/118 [00:50<00:00,  2.32it/s]
Train CrossEntropyLoss : 0.234
Valid CrossEntropyLoss : 0.571
Valid Accuracy : 0.823
100%|██████████| 118/118 [00:52<00:00,  2.26it/s]
Train CrossEntropyLoss : 0.231
Valid CrossEntropyLoss : 0.502
Valid Accuracy : 0.851
100%|██████████| 118/118 [00:52<00:00,  2.27it/s]
Train CrossEntropyLoss : 0.214
Valid CrossEntropyLoss : 0.544
Valid Accuracy : 0.852

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating the accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is quite better compared to our previous approach. It seems that increasing the maximum tokens per text example has helped us improve accuracy. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8515789473684211
Classification Report :
              precision    recall  f1-score   support

       World       0.88      0.83      0.85      1900
      Sports       0.87      0.97      0.92      1900
    Business       0.81      0.82      0.82      1900
    Sci/Tech       0.85      0.79      0.82      1900

    accuracy                           0.85      7600
   macro avg       0.85      0.85      0.85      7600
weighted avg       0.85      0.85      0.85      7600


Confusion Matrix :
[[1568  151  130   51]
 [  32 1836   20   12]
 [  70   67 1567  196]
 [ 119   59  221 1501]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions made by our network using LIME algorithm. First, we have randomly selected a text example from the test dataset and our network correctly predicts the target label as Sci/Tech for it. Then, we generated an explanation visualization for the selected text example. The visualization highlights that words like 'management', 'wireless', 'software', 'technology', 'departments', 'remote', etc are contributing to predicting the target label as Sci/Tech which makes sense as they are commonly used words in the technology world.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

Approach 3: Single Bidirectional LSTM Layer Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)

Our approach in this section is exactly the same as our approach in the previous section with one minor difference. We have used bidirectional LSTM layer instead this time. By default, LSTM layers are unidirectional which means that they go through the sequence of data in the forward direction to understand them. In case of the bidirectional LSTM layer, it goes through data in both forward and backward directions to find patterns. The majority of the code is the same as earlier with the only change in the LSTM layer being bidirectional.

Define Network

Below, we have defined the network that we'll use for the text classification task. The definition of a network is exactly the same as our previous two approaches with only one minor change bidirectional LSTM layer. Inside of LSTM constructor, we have set bidirectional parameter to True to create bidirectional LSTM layer. As usual, after defining the network, we have initialized it and printed the shape of weights/biases of layers.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim = 75
n_layers = 1

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers,
                             layout="NTC", input_size=embed_len,
                             bidirectional=True # Bidirectional LSTM
                            )
        self.dense1 = nn.Dense(100, activation="relu")
        self.dense2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)
        ## Output shape of LSTM will be 150 (2*hidden_dim) because one for forward seq cycle and one for backward (Bidirectional)
        x = self.dense1(x[:, -1])

        return self.dense2(x)

model = LSTMTextClassifier()

model
LSTMTextClassifier(
  (embedding): Embedding(66505 -> 40, float32)
  (lstm): LSTM(40 -> 75, NTC, bidirectional)
  (dense1): Dense(None -> 100, Activation(relu))
  (dense2): Dense(None -> 4, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding4_weight         : (66505, 40)
lstm4_l0_i2h_weight       : (300, 40)
lstm4_l0_h2h_weight       : (300, 75)
lstm4_l0_i2h_bias         : (300,)
lstm4_l0_h2h_bias         : (300,)
lstm4_r0_i2h_weight       : (300, 40)
lstm4_r0_h2h_weight       : (300, 75)
lstm4_r0_i2h_bias         : (300,)
lstm4_r0_h2h_bias         : (300,)
dense6_weight             : (100, 150)
dense6_bias               : (100,)
dense7_weight             : (4, 100)
dense7_bias               : (4,)

Train Network

In this section, we have trained our network using the same settings we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [01:35<00:00,  1.23it/s]
Train CrossEntropyLoss : 1.265
Valid CrossEntropyLoss : 0.960
Valid Accuracy : 0.562
100%|██████████| 118/118 [01:35<00:00,  1.24it/s]
Train CrossEntropyLoss : 0.756
Valid CrossEntropyLoss : 0.633
Valid Accuracy : 0.766
100%|██████████| 118/118 [01:36<00:00,  1.23it/s]
Train CrossEntropyLoss : 0.505
Valid CrossEntropyLoss : 0.482
Valid Accuracy : 0.836
100%|██████████| 118/118 [01:35<00:00,  1.23it/s]
Train CrossEntropyLoss : 0.366
Valid CrossEntropyLoss : 0.444
Valid Accuracy : 0.850
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.322
Valid CrossEntropyLoss : 0.489
Valid Accuracy : 0.837
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.302
Valid CrossEntropyLoss : 0.537
Valid Accuracy : 0.836
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.284
Valid CrossEntropyLoss : 0.489
Valid Accuracy : 0.852
100%|██████████| 118/118 [01:36<00:00,  1.23it/s]
Train CrossEntropyLoss : 0.270
Valid CrossEntropyLoss : 0.540
Valid Accuracy : 0.840
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.269
Valid CrossEntropyLoss : 0.454
Valid Accuracy : 0.848
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.237
Valid CrossEntropyLoss : 0.538
Valid Accuracy : 0.839
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.232
Valid CrossEntropyLoss : 0.553
Valid Accuracy : 0.843
100%|██████████| 118/118 [01:37<00:00,  1.21it/s]
Train CrossEntropyLoss : 0.235
Valid CrossEntropyLoss : 0.574
Valid Accuracy : 0.837
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.233
Valid CrossEntropyLoss : 0.662
Valid Accuracy : 0.826
100%|██████████| 118/118 [01:36<00:00,  1.22it/s]
Train CrossEntropyLoss : 0.228
Valid CrossEntropyLoss : 0.633
Valid Accuracy : 0.830
100%|██████████| 118/118 [01:35<00:00,  1.23it/s]
Train CrossEntropyLoss : 0.209
Valid CrossEntropyLoss : 0.631
Valid Accuracy : 0.840

Evaluate Network Performance

Below, we have evaluated the performance of our trained network by calculating the accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is a little lower compared to our previous approach. The bidirectional LSTM layer was not able to improve the performance of the network further in this case. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8402631578947368
Classification Report :
              precision    recall  f1-score   support

       World       0.85      0.82      0.84      1900
      Sports       0.85      0.97      0.90      1900
    Business       0.80      0.83      0.81      1900
    Sci/Tech       0.87      0.75      0.80      1900

    accuracy                           0.84      7600
   macro avg       0.84      0.84      0.84      7600
weighted avg       0.84      0.84      0.84      7600


Confusion Matrix :
[[1558  169  129   44]
 [  39 1835   21    5]
 [  76   86 1572  166]
 [ 157   70  252 1421]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions of the trained network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization highlights that words like 'management', 'wireless', 'software', 'devices', 'department', 'technology', 'executive', etc are contributing to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

Approach 4: Multiple LSTM Layers Network (Max Tokens=50, Embeddings Length=40, LSTM Output=75)

Our approach in this section uses a recurrent neural network with multiple LSTM layers. We have introduced more than one LSTM layer in the architecture of the network to see whether stacking multiple LSTM layers helps improve accuracy or not. The majority of the code is almost the same as earlier with only a change in the architecture of the network.

Define Network

Below, we have defined a network that we'll use for our classification task in this section. The code for the the architecture is same as earlier with only one change that we have set num_layers parameter in LSTM() constructor to 3 asking it to create 3 consecutive LSTM layers. That is the only difference in the code. The rest is the same as earlier. As usual, after defining the network, we have initialized it and printed the shape of weights/biases of layers.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim = 75
n_layers = 3

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
        self.dense1 = nn.Dense(100, activation="relu")
        self.dense2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)

        x = self.dense1(x[:, -1])

        return self.dense2(x)

model = LSTMTextClassifier()

model
LSTMTextClassifier(
  (embedding): Embedding(66505 -> 40, float32)
  (lstm): LSTM(40 -> 75, NTC, num_layers=3)
  (dense1): Dense(None -> 100, Activation(relu))
  (dense2): Dense(None -> 4, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding6_weight         : (66505, 40)
lstm6_l0_i2h_weight       : (300, 40)
lstm6_l0_h2h_weight       : (300, 75)
lstm6_l0_i2h_bias         : (300,)
lstm6_l0_h2h_bias         : (300,)
lstm6_l1_i2h_weight       : (300, 75)
lstm6_l1_h2h_weight       : (300, 75)
lstm6_l1_i2h_bias         : (300,)
lstm6_l1_h2h_bias         : (300,)
lstm6_l2_i2h_weight       : (300, 75)
lstm6_l2_h2h_weight       : (300, 75)
lstm6_l2_i2h_bias         : (300,)
lstm6_l2_h2h_bias         : (300,)
dense10_weight            : (100, 75)
dense10_bias              : (100,)
dense11_weight            : (4, 100)
dense11_bias              : (4,)

Train Network

In this section, we have trained our network using exactly the same settings that we have used for all our approaches. We can notice from the loss and accuracy values that our network is doing a good job at the text classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [02:10<00:00,  1.11s/it]
Train CrossEntropyLoss : 1.171
Valid CrossEntropyLoss : 0.768
Valid Accuracy : 0.660
100%|██████████| 118/118 [02:10<00:00,  1.11s/it]
Train CrossEntropyLoss : 0.614
Valid CrossEntropyLoss : 0.502
Valid Accuracy : 0.823
100%|██████████| 118/118 [02:11<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.413
Valid CrossEntropyLoss : 0.478
Valid Accuracy : 0.839
100%|██████████| 118/118 [02:11<00:00,  1.11s/it]
Train CrossEntropyLoss : 0.357
Valid CrossEntropyLoss : 0.477
Valid Accuracy : 0.843
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.323
Valid CrossEntropyLoss : 0.449
Valid Accuracy : 0.854
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.301
Valid CrossEntropyLoss : 0.497
Valid Accuracy : 0.847
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.297
Valid CrossEntropyLoss : 0.465
Valid Accuracy : 0.849
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.272
Valid CrossEntropyLoss : 0.475
Valid Accuracy : 0.849
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.252
Valid CrossEntropyLoss : 0.497
Valid Accuracy : 0.852
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.246
Valid CrossEntropyLoss : 0.475
Valid Accuracy : 0.853
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.251
Valid CrossEntropyLoss : 0.507
Valid Accuracy : 0.842
100%|██████████| 118/118 [02:13<00:00,  1.13s/it]
Train CrossEntropyLoss : 0.234
Valid CrossEntropyLoss : 0.604
Valid Accuracy : 0.822
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.219
Valid CrossEntropyLoss : 0.585
Valid Accuracy : 0.841
100%|██████████| 118/118 [02:12<00:00,  1.12s/it]
Train CrossEntropyLoss : 0.207
Valid CrossEntropyLoss : 0.567
Valid Accuracy : 0.849
100%|██████████| 118/118 [02:12<00:00,  1.13s/it]
Train CrossEntropyLoss : 0.204
Valid CrossEntropyLoss : 0.578
Valid Accuracy : 0.845

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that it is quite good but a little less compared to our second approach. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8447368421052631
Classification Report :
              precision    recall  f1-score   support

       World       0.81      0.87      0.84      1900
      Sports       0.90      0.93      0.92      1900
    Business       0.80      0.82      0.81      1900
    Sci/Tech       0.87      0.76      0.81      1900

    accuracy                           0.84      7600
   macro avg       0.85      0.84      0.84      7600
weighted avg       0.85      0.84      0.84      7600


Confusion Matrix :
[[1648   89  124   39]
 [  95 1769   26   10]
 [ 137   36 1551  176]
 [ 149   67  232 1452]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions made by our trained network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization highlights that words like 'wireless', 'software', 'technology', 'management', 'devices', 'remote', etc contributed to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

Approach 5: Stacking Multiple LSTM Layers (Max Tokens=50, Embeddings Length=40, LSTM Output=[50,60,75])

Our approach in this section again creates a recurrent neural network with multiple LSTM layers but this time the output units of each LSTM layer are different, unlike the previous approach where all were the same. The majority of the code in the section is the same as earlier with only a change in network architecture.

Define Network

Below, we have defined a network that we'll use for our text classification task in this section. The network has an embedding layer like earlier. Three LSTM layers are defined independently this time with output units 50, 60, and 75 respectively. The three LSTM layers are applied one by one to the output of the embedding layer. The output of the third LSTM layer is given to the first dense layer whose output is given to the second dense layer for processing. The output of the second dense layer is the prediction of network as usual.

After defining the network, we initialized it and printed the shape of weights/biases of layers.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim1 = 50
hidden_dim2 = 60
hidden_dim3 = 75
n_layers = 1

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm1 = rnn.LSTM(hidden_size=hidden_dim1, num_layers=n_layers, layout="NTC", input_size=embed_len)
        self.lstm2 = rnn.LSTM(hidden_size=hidden_dim2, num_layers=n_layers, layout="NTC", input_size=hidden_dim1)
        self.lstm3 = rnn.LSTM(hidden_size=hidden_dim3, num_layers=n_layers, layout="NTC", input_size=hidden_dim2)
        self.dense1 = nn.Dense(100, activation="relu")
        self.dense2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm1(x)
        x = self.lstm2(x)
        x = self.lstm3(x)

        x = self.dense1(x[:, -1])

        return self.dense2(x)

model = LSTMTextClassifier()

model
LSTMTextClassifier(
  (embedding): Embedding(66505 -> 40, float32)
  (lstm1): LSTM(40 -> 50, NTC)
  (lstm2): LSTM(50 -> 60, NTC)
  (lstm3): LSTM(60 -> 75, NTC)
  (dense1): Dense(None -> 100, Activation(relu))
  (dense2): Dense(None -> 4, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding8_weight         : (66505, 40)
lstm8_l0_i2h_weight       : (200, 40)
lstm8_l0_h2h_weight       : (200, 50)
lstm8_l0_i2h_bias         : (200,)
lstm8_l0_h2h_bias         : (200,)
lstm9_l0_i2h_weight       : (240, 50)
lstm9_l0_h2h_weight       : (240, 60)
lstm9_l0_i2h_bias         : (240,)
lstm9_l0_h2h_bias         : (240,)
lstm10_l0_i2h_weight      : (300, 60)
lstm10_l0_h2h_weight      : (300, 75)
lstm10_l0_i2h_bias        : (300,)
lstm10_l0_h2h_bias        : (300,)
dense14_weight            : (100, 75)
dense14_bias              : (100,)
dense15_weight            : (4, 100)
dense15_bias              : (4,)

Train Network

In this section, we have trained our network using exactly the same settings that we have used for all our previous approaches. We can notice from the loss and accuracy values that our network is doing a good job at the classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [01:52<00:00,  1.04it/s]
Train CrossEntropyLoss : 1.244
Valid CrossEntropyLoss : 1.022
Valid Accuracy : 0.498
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.793
Valid CrossEntropyLoss : 0.534
Valid Accuracy : 0.802
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.434
Valid CrossEntropyLoss : 0.443
Valid Accuracy : 0.849
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.356
Valid CrossEntropyLoss : 0.469
Valid Accuracy : 0.844
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.329
Valid CrossEntropyLoss : 0.456
Valid Accuracy : 0.850
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.298
Valid CrossEntropyLoss : 0.447
Valid Accuracy : 0.854
100%|██████████| 118/118 [01:54<00:00,  1.03it/s]
Train CrossEntropyLoss : 0.282
Valid CrossEntropyLoss : 0.521
Valid Accuracy : 0.840
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.284
Valid CrossEntropyLoss : 0.485
Valid Accuracy : 0.848
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.261
Valid CrossEntropyLoss : 0.482
Valid Accuracy : 0.850
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.239
Valid CrossEntropyLoss : 0.501
Valid Accuracy : 0.851
100%|██████████| 118/118 [01:54<00:00,  1.03it/s]
Train CrossEntropyLoss : 0.236
Valid CrossEntropyLoss : 0.521
Valid Accuracy : 0.847
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.226
Valid CrossEntropyLoss : 0.602
Valid Accuracy : 0.826
100%|██████████| 118/118 [01:52<00:00,  1.05it/s]
Train CrossEntropyLoss : 0.224
Valid CrossEntropyLoss : 0.541
Valid Accuracy : 0.838
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.218
Valid CrossEntropyLoss : 0.512
Valid Accuracy : 0.851
100%|██████████| 118/118 [01:53<00:00,  1.04it/s]
Train CrossEntropyLoss : 0.216
Valid CrossEntropyLoss : 0.608
Valid Accuracy : 0.836

Evaluate Network Performance

In this section, we have evaluated the performance of our network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is good but not the best score. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8364473684210526
Classification Report :
              precision    recall  f1-score   support

       World       0.75      0.90      0.82      1900
      Sports       0.91      0.92      0.92      1900
    Business       0.81      0.78      0.80      1900
    Sci/Tech       0.89      0.74      0.81      1900

    accuracy                           0.84      7600
   macro avg       0.84      0.84      0.84      7600
weighted avg       0.84      0.84      0.84      7600


Confusion Matrix :
[[1719   74   76   31]
 [ 123 1752   15   10]
 [ 260   28 1484  128]
 [ 187   62  249 1402]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions made by our network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization shows that words like 'management', 'devices', 'wireless', 'software', 'remote', 'technology', etc are contributing to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

Approach 6: Multiple Bidirectional LSTM Layers (Max Tokens=50, Embeddings Length=40, LSTM Output=75)

Our approach in this section is the same as our fourth approach with the only difference being that we have used bidirectional LSTM layers instead. The majority of the code is the same as earlier with the only difference being network architecture.

Define Network

Below, we have defined the network that we'll use for our text classification task in this section. The architecture of the network is same as our fourth approach where we had used multiple LSTM layers by setting num_layers to 3. The only difference in this section is that we have set bidirectional parameter to True in LSTM() constructor to inform it to create bidirectional layers.

After defining the network, we initialized it and printed the shape of weights/biases of layers.

from mxnet.gluon import nn, rnn

embed_len = 40
hidden_dim = 75
n_layers = 3

class LSTMTextClassifier(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextClassifier, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers,
                             layout="NTC", input_size=embed_len,
                             bidirectional=True # Bidirectional RNN
                          )
        self.dense1 = nn.Dense(100, activation="relu")
        self.dense2 = nn.Dense(len(target_classes))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)
        ## Output shape of LSTM will be 150 (2*hidden_dim) because one for forward seq cycle and one for backward (Bidirectional)
        x = self.dense1(x[:, -1])

        return self.dense2(x)

model = LSTMTextClassifier()

model
LSTMTextClassifier(
  (embedding): Embedding(66505 -> 40, float32)
  (lstm): LSTM(40 -> 75, NTC, num_layers=3, bidirectional)
  (dense1): Dense(None -> 100, Activation(relu))
  (dense2): Dense(None -> 4, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))

for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding10_weight        : (66505, 40)
lstm14_l0_i2h_weight      : (300, 40)
lstm14_l0_h2h_weight      : (300, 75)
lstm14_l0_i2h_bias        : (300,)
lstm14_l0_h2h_bias        : (300,)
lstm14_r0_i2h_weight      : (300, 40)
lstm14_r0_h2h_weight      : (300, 75)
lstm14_r0_i2h_bias        : (300,)
lstm14_r0_h2h_bias        : (300,)
lstm14_l1_i2h_weight      : (300, 150)
lstm14_l1_h2h_weight      : (300, 75)
lstm14_l1_i2h_bias        : (300,)
lstm14_l1_h2h_bias        : (300,)
lstm14_r1_i2h_weight      : (300, 150)
lstm14_r1_h2h_weight      : (300, 75)
lstm14_r1_i2h_bias        : (300,)
lstm14_r1_h2h_bias        : (300,)
lstm14_l2_i2h_weight      : (300, 150)
lstm14_l2_h2h_weight      : (300, 75)
lstm14_l2_i2h_bias        : (300,)
lstm14_l2_h2h_bias        : (300,)
lstm14_r2_i2h_weight      : (300, 150)
lstm14_r2_h2h_weight      : (300, 75)
lstm14_r2_i2h_bias        : (300,)
lstm14_r2_h2h_bias        : (300,)
dense18_weight            : (100, 150)
dense18_bias              : (100,)
dense19_weight            : (4, 100)
dense19_bias              : (4,)

Train Network

Here, we have trained our network using exactly the same settings that we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the classification task.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=15
learning_rate = 0.001

model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, test_loader, epochs)
100%|██████████| 118/118 [05:51<00:00,  2.98s/it]
Train CrossEntropyLoss : 1.240
Valid CrossEntropyLoss : 0.905
Valid Accuracy : 0.591
100%|██████████| 118/118 [05:52<00:00,  2.99s/it]
Train CrossEntropyLoss : 0.710
Valid CrossEntropyLoss : 0.500
Valid Accuracy : 0.818
100%|██████████| 118/118 [05:36<00:00,  2.85s/it]
Train CrossEntropyLoss : 0.410
Valid CrossEntropyLoss : 0.476
Valid Accuracy : 0.841
100%|██████████| 118/118 [05:46<00:00,  2.94s/it]
Train CrossEntropyLoss : 0.340
Valid CrossEntropyLoss : 0.436
Valid Accuracy : 0.857
100%|██████████| 118/118 [05:29<00:00,  2.80s/it]
Train CrossEntropyLoss : 0.303
Valid CrossEntropyLoss : 0.540
Valid Accuracy : 0.828
100%|██████████| 118/118 [05:46<00:00,  2.94s/it]
Train CrossEntropyLoss : 0.295
Valid CrossEntropyLoss : 0.480
Valid Accuracy : 0.850
100%|██████████| 118/118 [05:32<00:00,  2.82s/it]
Train CrossEntropyLoss : 0.281
Valid CrossEntropyLoss : 0.433
Valid Accuracy : 0.857
100%|██████████| 118/118 [05:47<00:00,  2.94s/it]
Train CrossEntropyLoss : 0.261
Valid CrossEntropyLoss : 0.467
Valid Accuracy : 0.854
100%|██████████| 118/118 [05:48<00:00,  2.95s/it]
Train CrossEntropyLoss : 0.249
Valid CrossEntropyLoss : 0.469
Valid Accuracy : 0.854
100%|██████████| 118/118 [05:31<00:00,  2.81s/it]
Train CrossEntropyLoss : 0.246
Valid CrossEntropyLoss : 0.483
Valid Accuracy : 0.845
100%|██████████| 118/118 [05:49<00:00,  2.96s/it]
Train CrossEntropyLoss : 0.253
Valid CrossEntropyLoss : 0.449
Valid Accuracy : 0.852
100%|██████████| 118/118 [05:32<00:00,  2.82s/it]
Train CrossEntropyLoss : 0.234
Valid CrossEntropyLoss : 0.586
Valid Accuracy : 0.838
100%|██████████| 118/118 [05:46<00:00,  2.93s/it]
Train CrossEntropyLoss : 0.210
Valid CrossEntropyLoss : 0.618
Valid Accuracy : 0.838
100%|██████████| 118/118 [05:31<00:00,  2.81s/it]
Train CrossEntropyLoss : 0.200
Valid CrossEntropyLoss : 0.636
Valid Accuracy : 0.838
100%|██████████| 118/118 [05:52<00:00,  2.99s/it]
Train CrossEntropyLoss : 0.189
Valid CrossEntropyLoss : 0.623
Valid Accuracy : 0.832

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_actuals, Y_preds = MakePredictions(model, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
Test Accuracy : 0.8322368421052632
Classification Report :
              precision    recall  f1-score   support

       World       0.77      0.88      0.82      1900
      Sports       0.95      0.87      0.91      1900
    Business       0.77      0.83      0.80      1900
    Sci/Tech       0.87      0.75      0.80      1900

    accuracy                           0.83      7600
   macro avg       0.84      0.83      0.83      7600
weighted avg       0.84      0.83      0.83      7600


Confusion Matrix :
[[1675   56  125   44]
 [ 197 1657   27   19]
 [ 165   15 1572  148]
 [ 146   21  312 1421]]
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Reds",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

MXNet: LSTM  Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained the prediction made by our network on a random test example using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization shows that words like 'management', 'technology', 'wireless', 'devices', 'software', 'departments', etc are contributing to predicting target label as Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes)

rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))

X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)

print("Actual :     ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])

explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

MXNet: LSTM  Networks For Text Classification Tasks

8. Results Summary and Further Recommendations

Approach Max Tokens Embedding Length LSTM Output Test Accuracy (%)
Single LSTM Layer 25 40 75 81.03
Single LSTM Layer 50 40 75 85.15
Single Bidirectional LSTM Layer 50 40 75 84.02
Multiple LSTM Layers 50 40 75 84.47
Stacking Multiple LSTM Layer 50 40 50,60,75 83.64
Multiple Bidirectional LSTM Layers 50 40 75 83.22

Further Recommendations

  • Try different max token sizes.
  • Try different embedding lengths.
  • Try different LSTM layer output units.
  • Try different activation functions.
  • Try different weight initializations.
  • Add more dense layers to the network.
  • Train network for more epochs.
  • Try learning rate schedules
Sunny Solanki  Sunny Solanki

YouTube Subscribe Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Need Help Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.


Subscribe to Our YouTube Channel

YouTube SubScribe

Newsletter Subscription