Recurrent neural networks (RNNs) are very commonly used when working with data that involves some kind of internal sequence like time-series, text data, speech data, etc. In these kinds of datasets, The example at any stage is dependent on previous examples and the previous few examples are the best estimate to predict the current example. The traditional neural networks consisting of only dense layers are not good at capturing these kinds of sequences. It does not have memory and can't remember previous data it saw. Unlike them, RNNs are quite good at remembering sequences. Though in theory vanilla RNNs should be able to capture sequences of any length, in practice they are not that good at them due to exploding gradients problem. Hence, a version of RNNs named Long Short-Term Memory (LSTM) was invented which solves an exploding gradient problem with vanilla RNNs and is quite good at capturing sequences. Below, we have included an image showing one cell of the LSTM layer. Many such LSTM cells are laid next to each other to create an LSTM layer. LSTM network can consist of single or more LSTM layers.
As a part of this tutorial, we have explained how we can design LSTM Networks using Python deep learning library MXNet (from Apache) for solving text classification tasks. We have used word embeddings approach for encoding text data. The tutorial explains various ways of using LSTM layers in the network by trying various approaches and then comparing their results at the end. We have also evaluated the performance of networks by calculating various ML metrics. Apart from this, we have even explained predictions made by the network using LIME algorithm.
Below, we have listed important sections of tutorial to give an overview of the material covered.
Below, we have imported the necessary python libraries that we are going to use in this tutorial and printed the versions of them as well.
import mxnet
print("MXNet Version : {}".format(mxnet.__version__))
import gluonnlp
print("GluonNLP Version : {}".format(gluonnlp.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
In this section, we have prepared our data to be given to the neural network for training and evaluation purposes. We are going to use word embeddings approach to encoding text data. We'll follow the below steps to encode text data to real-valued data using this approach.
The first two steps mentioned above will be completed in this section where we have created data loaders that returns a list of indexes for text examples. The third step will be implemented in the neural network as an embedding layer that returns embeddings from indexes.
In this section, we have loaded AG NEWS dataset that we are going to use for our classification task. The dataset has text documents for 4 different categories (["World", "Sports", "Business", "Sci/Tech"]) of news. The dataset is already divided into train and test sets. After loading datasets, we have also wrapped them in ArrayDataset object which is a standard MXNet data structure to maintain data.
from mxnet.gluon.data import ArrayDataset
train_dataset, test_dataset = torchtext.datasets.AG_NEWS()
Y_train, X_train = zip(*list(train_dataset))
Y_test, X_test = zip(*list(test_dataset))
train_dataset = ArrayDataset(X_train, Y_train)
test_dataset = ArrayDataset(X_test, Y_test)
Below, we have first defined a tokenizer. The tokenizer is a function that takes a text document as input and returns a list of tokens (words). The vocabulary will maintain all such unique words. We have created a tokenizer using regular expression that catches words consisting of one or more consecutive alphabets. We have used partial() function from functools Python library to create tokenization function.
import re
from functools import partial
tokenizer = partial(lambda X: re.findall(r"\w+", X))
tokenizer("Hello, How are you?")
Below, we are populating the vocabulary of all unique tokens. In order to create vocabulary, we need to populate Counter object with all unique tokens from datasets with their respective frequencies. The Counter object is a kind of a dictionary available from Python collections module that maintained count tokens. We have started by creating an empty Counter object. Then, we are looping through each text example of the train and test dataset. We have called count_tokens() method available from data sub-module of gluonnlp Python module. on tokens of each text example. We have also given our Counter object to this method. It keeps on updating the counter object with tokens and their respective frequencies.
In the end, we have called Vocab() constructor from gluonnlp with Counter object to create a vocabulary. The Vocab object has a mapping from tokens to indexes. We have also printed the size of the vocabulary at the end.
from collections import Counter
counter = Counter()
for dataset in [train_dataset, test_dataset]:
for X, Y in dataset:
gluonnlp.data.count_tokens(tokenizer(X), to_lower=True, counter=counter)
vocab = gluonnlp.Vocab(counter=counter, special_token="<unk>", min_freq=1)
print("Vocabulary Size : {}".format(len(vocab)))
In this section, we have defined a simple vectorization function that will take a batch of data (text examples and their target labels) as input and return a list of indexes for each text example. This function will be used by data loaders later for preprocessing batch of data. The function tokenizes each text example into tokens and retrieves indexes of those tokens using vocabulary. We have decided to keep maximum of 25 tokens per text example. Each text example can have a different number of tokens based on the number of sentences and each sentence can have a different number of words. We have truncated tokens beyond 25 tokens for text examples that have more than 25 tokens and for examples that have less than 25 tokens, we have padded them with 0s (<unk>
token). After mapping tokens to their indexes, we have returned them and target labels as MXNet ndarray objects.
import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np
max_tokens = 25
clip_seq = gluonnlp.data.ClipSequence(max_tokens)
pad_seq = gluonnlp.data.PadSequence(length=max_tokens, pad_val=0, clip=True)
def vectorize(batch):
X, Y = list(zip(*batch))
X = [[vocab(word) for word in tokenizer(sample)] for sample in X]
#X = [sample+([0]* (50-len(sample))) if len(sample)<50 else sample[:50] for sample in X] .
X = [pad_seq(tokens) for tokens in X] ## Bringing all samples to max_tokens length
return nd.array(X, dtype=np.int32), nd.array(Y, dtype=np.int32) - 1 # Subtracting 1 from labels to bring them in range 0-3 from 1-4
vectorize([["how are you", 1]])
In this section, we have created train and test data loaders using datasets we created earlier. The data loaders are used to loop through training data in batches during the training process. We have kept batch size at 1024. We have also provided the vectorization function defined in the previous cell to batchify_fn parameter. This function will be applied to each batch of data before giving data to the neural network.
from mxnet.gluon.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader = DataLoader(test_dataset, batch_size=1024, batchify_fn=vectorize)
target_classes = ["World", "Sports", "Business", "Sci/Tech"]
for X, Y in train_loader:
print(X.shape, Y.shape)
break
Our approach in this section uses Recurrent Neural Network with a single LSTM layer. The network has three layers, embedding layer, LSTM layer, and dense layer. The embedding layer maps token indexes to their embeddings, the LSTM layer processes embeddings to understand the sequence, and the dense layer processes the output of the LSTM layer to generate 4 probabilities (for 4 target classes) per text example. After training this network, we have also evaluated the performance by calculating various ML metrics. Apart from this, we have tried to explain the predictions made by the network using LIME algorithm.
Here, we have defined a network that we'll use for our text classification task. The network consists of 3 layers.
The first layer of the network is the embedding layer. We have created embedding layer using Embedding() constructor available from 'nn' sub-module of 'gluon' sub-module of mxnet library. We have provided it with a length of vocabulary (number of unique tokens) and embedding length (40). The embedding length of 40 means that each token will be assigned a real-valued vector of length 40. When we create this layer, it internally creates a weight matrix of shape (vocab_len, embed_len). When we provide the network with a list of token indexes, it'll retrieve embeddings by indexing this weight matrix with token indexes. The input shape to embedding layer is (batch_size, max_tokens) = (batch_size, 25) and output shape is (batch_size, max_tokens, embed_len) = (batch_size, 25, 40).
The second layer of the network is LSTM layer. It takes the output of the embedding layer and loops through embeddings of each text example to process them. We have create lstm layer using LSTM() constructor available from 'rnn' sub-module of 'gluon' sub-module of mxnet. We have hidden_size of LSTM layer to 75. We can also stack more than one LSTM layer by providing a count greater than 1 to n_layers parameters. Here, we have set it to 1. The input data shape to LSTM layer is (batch_size, max_tokens, embed_len) and output shape is (batch_size, max_tokens, hidden_size) = (batch_size, 25, 75).
The third layer is the Dense layer with 4 output units (same as a number of target labels). The output of LSTM is given to a dense layer whose output is returned as predictions.
After defining the network, we initialized it and made predictions using random data for verification purposes. We have also printed the shape of weights/biases of layers of the network for information purposes.
We have not covered how to design networks using MXNet in detail here. If you are someone new to MXNet and want to learn how to create a neural network using it then please feel free to check the below link. It'll get you started with the library.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim = 75
n_layers = 1
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
self.dense = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
return self.dense(x[:, -1])
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
preds.shape
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we are training the network that we designed in the previous cell. We have defined a function for training our network. The function takes the trainer object (network parameters), train data loader, validation data loader, and the number of epochs as input. It then executes training loop number of epochs times. For each epoch, it loops through training data in batches using a train data loader. Using each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network weights. It keeps track of loss for each batch and prints the average loss of all batches at the end of each epoch. We have also created helper functions that help us calculate validation loss and accuracy.
from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score
def MakePredictions(model, val_loader):
Y_actuals, Y_preds = [], []
for X_batch, Y_batch in val_loader:
preds = model(X_batch)
preds = nd.softmax(preds)
Y_actuals.append(Y_batch)
Y_preds.append(preds.argmax(axis=-1))
Y_actuals, Y_preds = nd.concatenate(Y_actuals), nd.concatenate(Y_preds)
return Y_actuals, Y_preds
def CalcValLoss(model, val_loader):
losses = []
for X_batch, Y_batch in val_loader:
val_loss = loss_func(model(X_batch), Y_batch)
val_loss = val_loss.mean().asscalar()
losses.append(val_loss)
print("Valid CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
def TrainModelInBatches(trainer, train_loader, val_loader, epochs):
for i in range(1, epochs+1):
losses = [] ## Record loss of each batch
for X_batch, Y_batch in tqdm(train_loader):
with autograd.record():
preds = model(X_batch) ## Forward pass to make predictions
train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
train_loss.backward() ## Calculate Gradients
train_loss = train_loss.mean().asscalar()
losses.append(train_loss)
trainer.step(len(X_batch)) ## Update weights
print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
CalcValLoss(model, val_loader)
Y_actuals, Y_preds = MakePredictions(model, val_loader)
print("Valid Accuracy : {:.3f}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
Below, we are actually training our network using the function we defined in the previous cell. We have first initialized a number of epochs to 15 and the learning rate to 0.001. Then, we have initialized our LSTM text classifier, cross entropy loss, Adam optimizer and Trainer object. At last, we have called training routine with the necessary parameters to perform the training process. We can notice from the loss and accuracy getting printed after each epoch that our network is doing a good job at the text classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model has done a good job at the classification task. We have calculated ML metrics using functions available from the python library scikit-learn.
The scikit-learn provides functions to calculate many ML metrics. If you are interested in learning about various functions available from sklearn to calculate ML metrics then please check the below link which covers the majority of them in detail.
Apart from calculations, we have also plotted confusion matrix metric using Python library scikit-plot. From the visualization, we can notice that our model is doing quite a good job at classifying text documents in the category Sports compared to other categories.
Please feel free to check the below link if you are new to scikit-plot as it provides visualizations for many ML metrics.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions made by our network using LIME algorithm. We'll be using the Python library lime which has an implementation of the algorithm. It let us create visualizations highlighting words in text documents that contributed to predicting a particular target label.
If you are someone who is new to the concept of LIME and want to learn about it in depth then we would suggest that you go through the below links.
Below, we have simply loaded text examples from the test dataset.
X_test, Y_test = [], []
for X, Y in test_dataset:
X_test.append(X)
Y_test.append(Y-1)
Below, we have first created an instance of LimeTextExplainer which we'll use to create an explanation object later for explaining network predictions.
Then, we have defined a prediction function. The function takes a batch of text examples as input and returns their probabilities predicted by the network. It tokenizes text examples, retrieves their indexes, and gives them to the network to make predictions. It also applies softmax activation function to the output of the network to generate probabilities. We'll use this function later when generating an explanation for the text example.
After defining a function, we randomly selected a text example from the test dataset and made predictions on it using our trained network. Our network correctly predicts the target label as Sci/Tech for the selected text example. Next, we'll explain this prediction.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
def make_predictions(X_batch_text):
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_batch_text]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
logits = model(nd.array(X_batch, dtype=np.int32))
preds = nd.softmax(logits)
return preds.asnumpy()
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
Below, we have first called explain_instance() method on LimeTextExplainer object. We have provided selected a text example, prediction function, and target label to the method. This method will create an instance of Explanation object which has details about words contributing to predictions.
Next, we have called show_in_notebook() method on an instance of Explanation object to generate a visualization of the explanation. We can notice from the visualization that words like 'remote', 'software', 'umbrella', 'departments', etc are contributing to predicting the target label as Sci/Tech.
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section is exactly the same as our approach from the previous section with a minor change being that we are using 50 max tokens per text example this time. The majority of the code is the same as earlier.
Below, we have reinitialized data loaders so that they use new max tokens which are set at 50 this time.
from mxnet.gluon.data import DataLoader
max_tokens = 50
pad_seq = gluonnlp.data.PadSequence(length=max_tokens, pad_val=0, clip=True)
train_loader = DataLoader(train_dataset, batch_size=1024, batchify_fn=vectorize)
test_loader = DataLoader(test_dataset, batch_size=1024, batchify_fn=vectorize)
target_classes = ["World", "Sports", "Business", "Sci/Tech"]
for X, Y in train_loader:
print(X.shape, Y.shape)
break
In this section, we have defined the network that we'll use for our task in this section. The definition of the network is exactly the same as earlier. After defining the network, we have also initialized it and printed the shape of weights/biases of layers.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim = 75
n_layers = 1
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
self.dense1 = nn.Dense(100, activation="relu")
self.dense2 = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
x = self.dense1(x[:, -1])
return self.dense2(x)
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we have trained our network using exactly the same settings that we have used in our previous section. We'll be keeping these settings the same for all approaches so that comparison becomes easier. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the text classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
In this section, we have evaluated the performance of our trained network by calculating the accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is quite better compared to our previous approach. It seems that increasing the maximum tokens per text example has helped us improve accuracy. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions made by our network using LIME algorithm. First, we have randomly selected a text example from the test dataset and our network correctly predicts the target label as Sci/Tech for it. Then, we generated an explanation visualization for the selected text example. The visualization highlights that words like 'management', 'wireless', 'software', 'technology', 'departments', 'remote', etc are contributing to predicting the target label as Sci/Tech which makes sense as they are commonly used words in the technology world.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes)
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section is exactly the same as our approach in the previous section with one minor difference. We have used bidirectional LSTM layer instead this time. By default, LSTM layers are unidirectional which means that they go through the sequence of data in the forward direction to understand them. In case of the bidirectional LSTM layer, it goes through data in both forward and backward directions to find patterns. The majority of the code is the same as earlier with the only change in the LSTM layer being bidirectional.
Below, we have defined the network that we'll use for the text classification task. The definition of a network is exactly the same as our previous two approaches with only one minor change bidirectional LSTM layer. Inside of LSTM constructor, we have set bidirectional parameter to True to create bidirectional LSTM layer. As usual, after defining the network, we have initialized it and printed the shape of weights/biases of layers.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim = 75
n_layers = 1
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers,
layout="NTC", input_size=embed_len,
bidirectional=True # Bidirectional LSTM
)
self.dense1 = nn.Dense(100, activation="relu")
self.dense2 = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
## Output shape of LSTM will be 150 (2*hidden_dim) because one for forward seq cycle and one for backward (Bidirectional)
x = self.dense1(x[:, -1])
return self.dense2(x)
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we have trained our network using the same settings we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
Below, we have evaluated the performance of our trained network by calculating the accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is a little lower compared to our previous approach. The bidirectional LSTM layer was not able to improve the performance of the network further in this case. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions of the trained network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization highlights that words like 'management', 'wireless', 'software', 'devices', 'department', 'technology', 'executive', etc are contributing to predicting target label as Sci/Tech.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes)
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section uses a recurrent neural network with multiple LSTM layers. We have introduced more than one LSTM layer in the architecture of the network to see whether stacking multiple LSTM layers helps improve accuracy or not. The majority of the code is almost the same as earlier with only a change in the architecture of the network.
Below, we have defined a network that we'll use for our classification task in this section. The code for the the architecture is same as earlier with only one change that we have set num_layers parameter in LSTM() constructor to 3 asking it to create 3 consecutive LSTM layers. That is the only difference in the code. The rest is the same as earlier. As usual, after defining the network, we have initialized it and printed the shape of weights/biases of layers.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim = 75
n_layers = 3
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
self.dense1 = nn.Dense(100, activation="relu")
self.dense2 = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
x = self.dense1(x[:, -1])
return self.dense2(x)
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we have trained our network using exactly the same settings that we have used for all our approaches. We can notice from the loss and accuracy values that our network is doing a good job at the text classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that it is quite good but a little less compared to our second approach. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions made by our trained network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization highlights that words like 'wireless', 'software', 'technology', 'management', 'devices', 'remote', etc contributed to predicting target label as Sci/Tech.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes)
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section again creates a recurrent neural network with multiple LSTM layers but this time the output units of each LSTM layer are different, unlike the previous approach where all were the same. The majority of the code in the section is the same as earlier with only a change in network architecture.
Below, we have defined a network that we'll use for our text classification task in this section. The network has an embedding layer like earlier. Three LSTM layers are defined independently this time with output units 50, 60, and 75 respectively. The three LSTM layers are applied one by one to the output of the embedding layer. The output of the third LSTM layer is given to the first dense layer whose output is given to the second dense layer for processing. The output of the second dense layer is the prediction of network as usual.
After defining the network, we initialized it and printed the shape of weights/biases of layers.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim1 = 50
hidden_dim2 = 60
hidden_dim3 = 75
n_layers = 1
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm1 = rnn.LSTM(hidden_size=hidden_dim1, num_layers=n_layers, layout="NTC", input_size=embed_len)
self.lstm2 = rnn.LSTM(hidden_size=hidden_dim2, num_layers=n_layers, layout="NTC", input_size=hidden_dim1)
self.lstm3 = rnn.LSTM(hidden_size=hidden_dim3, num_layers=n_layers, layout="NTC", input_size=hidden_dim2)
self.dense1 = nn.Dense(100, activation="relu")
self.dense2 = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm1(x)
x = self.lstm2(x)
x = self.lstm3(x)
x = self.dense1(x[:, -1])
return self.dense2(x)
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we have trained our network using exactly the same settings that we have used for all our previous approaches. We can notice from the loss and accuracy values that our network is doing a good job at the classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
In this section, we have evaluated the performance of our network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is good but not the best score. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions made by our network using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization shows that words like 'management', 'devices', 'wireless', 'software', 'remote', 'technology', etc are contributing to predicting target label as Sci/Tech.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes)
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section is the same as our fourth approach with the only difference being that we have used bidirectional LSTM layers instead. The majority of the code is the same as earlier with the only difference being network architecture.
Below, we have defined the network that we'll use for our text classification task in this section. The architecture of the network is same as our fourth approach where we had used multiple LSTM layers by setting num_layers to 3. The only difference in this section is that we have set bidirectional parameter to True in LSTM() constructor to inform it to create bidirectional layers.
After defining the network, we initialized it and printed the shape of weights/biases of layers.
from mxnet.gluon import nn, rnn
embed_len = 40
hidden_dim = 75
n_layers = 3
class LSTMTextClassifier(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextClassifier, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers,
layout="NTC", input_size=embed_len,
bidirectional=True # Bidirectional RNN
)
self.dense1 = nn.Dense(100, activation="relu")
self.dense2 = nn.Dense(len(target_classes))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
## Output shape of LSTM will be 150 (2*hidden_dim) because one for forward seq cycle and one for backward (Bidirectional)
x = self.dense1(x[:, -1])
return self.dense2(x)
model = LSTMTextClassifier()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier())
preds = model(nd.random.randint(1,len(vocab), shape=(10,max_tokens)))
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
Here, we have trained our network using exactly the same settings that we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the classification task.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=15
learning_rate = 0.001
model = LSTMTextClassifier()
model.initialize(initializer.Xavier())
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, test_loader, epochs)
In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Y_actuals, Y_preds = MakePredictions(model, test_loader)
print("Test Accuracy : {}".format(accuracy_score(Y_actuals.asnumpy(), Y_preds.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_actuals.asnumpy(), Y_preds.asnumpy(), target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actuals.asnumpy(), Y_preds.asnumpy()))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actuals.asnumpy()], [target_classes[i] for i in Y_preds.asnumpy().astype(int)],
normalize=True,
title="Confusion Matrix",
cmap="Reds",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained the prediction made by our network on a random test example using LIME algorithm. The network correctly predicts the target label as Sci/Tech for the selected text example. The visualization shows that words like 'management', 'technology', 'wireless', 'devices', 'software', 'departments', etc are contributing to predicting target label as Sci/Tech.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes)
rng = np.random.RandomState(124)
idx = rng.randint(1, len(X_test))
X_batch = [[vocab(word) for word in tokenizer(sample)] for sample in X_test[idx:idx+1]]
X_batch = [pad_seq(tokens) for tokens in X_batch] ## Bringing all samples to 50 length
preds = model(nd.array(X_batch)).argmax(axis=-1)
print("Actual : ", target_classes[Y_test[idx]])
print("Prediction : ", target_classes[int(preds.asnumpy()[0])])
explanation = explainer.explain_instance(X_test[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Approach | Max Tokens | Embedding Length | LSTM Output | Test Accuracy (%) |
---|---|---|---|---|
Single LSTM Layer | 25 | 40 | 75 | 81.03 |
Single LSTM Layer | 50 | 40 | 75 | 85.15 |
Single Bidirectional LSTM Layer | 50 | 40 | 75 | 84.02 |
Multiple LSTM Layers | 50 | 40 | 75 | 84.47 |
Stacking Multiple LSTM Layer | 50 | 40 | 50,60,75 | 83.64 |
Multiple Bidirectional LSTM Layers | 50 | 40 | 75 | 83.22 |
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to