Updated On : Apr-13,2022 Time Investment : ~45 mins

PyTorch LSTM For Text Classification Tasks (Word Embeddings)¶

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network that is better at remembering sequence order compared to simple RNN. The traditional RNN can not learn sequence order for very long sequences in practice even though in theory it seems to be possible. It suffers from a problem called vanishing gradient. On the other hand, an advanced version of RNN like LSTM can remember the order of sequences for very long sequences and solves vanishing gradient problem to an extent. Here, by sequences, we mean data that has order like time-series data, speech data, text data, etc. LSTM helps us capture order better compared to our dense layer networks. Below, we have included an image of one cell of LSTM. Inside of LSTM layer, many LSTM cells like those below are laid next to each other to remember the sequence of data.

As a part of this tutorial, we are going to explain how we can design various LSTM networks using PyTorch to solve a text classification task. We have tried different approaches to using LSTM networks to solve the tasks. The tutorial does not cover the theoretical aspect of LSTM. Please check the below link if you are looking for it.

Recurrent Neural Network (RNN)

We also recommend that readers go through our tutorial on designing PyTorch RNN networks for text classification tasks that use vanilla RNN layers for text classification.

PyTorch RNN For Text Classification Tasks

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Populate Vocabulary
Approach 1: Single LSTM Layer (Tokens Per Text Example=25, Embeddings Length=50, LSTM Output=75)
- Load Dataset And Create Data Loaders
- Define LSTM Network
- Train Network
- Evaluate Network Performance
- Explain Predictions Using LIME Algorithm
Approach 2: Single LSTM Layer (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75)
Approach 3: Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75)
Approach 4: Stacking Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75)
Approach 5: Multiple Bidirectional LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75)
Results Summary and Suggestions

Below, we have imported the necessary libraries and printed the versions that we have used in our tutorial.

import torch

print("PyTorch Version : {}".format(torch.__version__))

PyTorch Version : 1.9.1+cpu

import torchtext

print("TorchText Version : {}".format(torchtext.__version__))

TorchText Version : 0.10.1

1. Populate Vocabulary ¶

In this section, we have populated a vocabulary with the tokens of the text of the dataset. The vocabulary is a simple mapping of tokens to an integer index. Index starting from integer 1 will be used to represent each token.

Below, we have first loaded AG NEWS dataset available from datasets sub-module of torchtext library. The dataset has news text examples for four different categories(["World", "Sports", "Business", "Sci/Tech"]). The dataset is already divided into train and test sets.

from torch.utils.data import DataLoader

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()

train.csv: 29.5MB [00:00, 77.5MB/s]
test.csv: 1.86MB [00:00, 46.8MB/s]

Below, we have first declared a tokenizer. The tokenizer is a function that splits text into a list of tokens. These tokens are generally words but they can be punctuation and symbols as well.

After defining tokenizer, we have populated vocabulary using build_vocab_from_iterator() function available from vocab sub-module of torchtext library. As the name of the function suggests, it creates vocabulary from an iterator. We have created a simple iterator named build_vocabulary() that takes a list of datasets as input. It then loops through each dataset and each text example of the dataset yielding a list of tokens for each text example using a tokenizer. We have called build_vocab_from_iterator() function by giving this iterator with train and test datasets as arguments. The build_vocab_from_iterator() function will populate vocabulary from tokens yielded by this iterator. We have set min_freq parameter to 1 which indicates that we'll keep all words whose word frequency is at least one.

After populating the vocabulary, we have also printed the length of the vocabulary. We have also explained with a simple example how we can convert text to a list of tokens and then a list of indexes.

from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")

def build_vocabulary(datasets):
    for dataset in datasets:
        for _, text in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, test_dataset]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

len(vocab)

tokens = tokenizer("Hello how are you?, Welcome to CoderzColumn!!")
indexes = vocab(tokens)

tokens, indexes

(['hello',
  'how',
  'are',
  'you',
  '?',
  ',',
  'welcome',
  'to',
  'coderzcolumn',
  '!',
  '!'],
 [12388, 355, 42, 164, 80, 3, 3298, 4, 0, 747, 747])

vocab["<UNK>"] ## Coderzcolumn word is mapped to unknown as it's new and not present in vocabulary

Approach 1: Single LSTM Layer (Tokens Per Text Example=25, Embeddings Length=50, LSTM Output=75) ¶

In our first approach to using LSTM network for the text classification tasks, we have developed a simple neural network with one LSTM layer which has an output length of 75. We have used word embeddings approach for encoding text using vocabulary populated earlier. We have trained the network, evaluated its performance by calculating various ML metrics, and also explained network predictions using LIME algorithm.

Load Dataset And Create Data Loaders¶

In this section, we have first loaded our datasets (train and test) and then created data loaders from them which will be used during the training process to loop through training data in batches.

Below, we have first loaded our datasets. Then, we have created data loaders using these datasets. We have provided a batch size of 1024 samples per batch. We have also provided vectorization function to collate_fn argument of the DataLoader() constructors. This function takes batches of text and their respective target labels. It then tokenizes these text examples and maps tokens to indexes using our vocabulary. This function will be applied to all batches of data.

We have set max_words to 25 which is to inform data loaders that we want to keep a maximum of 25 tokens per text example. If the text example has tokens less than it then it'll be appended with 0s and more than it then it'll be truncated after the first 25 tokens.

At last, we return torch tensors from the vectorize_batch() function. The function returns indexes for each text example and their respective target labels as torch tensors. Please make a NOTE that we have deducted 1 from the original target labels as they are in range 1-4 and we want labels in range 0-3.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset  = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

max_words = 25

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y) - 1 ## We have deducted 1 from target names to get them in range [0,1,2,3] from [1,2,3,4]


train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset , batch_size=1024, collate_fn=vectorize_batch)

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 25]) torch.Size([1024])

Define LSTM Network¶

In this section, we have defined our LSTM network which consists of 3 layers.

Embedding Layer
LSTM Layer
Linear Layer

The Embedding Layer takes the input list of indexes generated by the vectorization function. We have initialized a layer with a number of embeddings equal to the length of vocabulary and embedding length to 50. This initialization will create a weight tensor of shape (vocab_len, embed_len) which has an embedding vector of length 50 for each token of vocabulary. The layer is responsible for mapping the index of each token to a float vector of length 50 because we have set the embedding length to 50. This layer takes tensor of shape (batch_size, max_tokens) and outputs tensor of shape (batch_size, max_tokens, embed_len). Each token gets assigned its respective embedding vector based on the index value by this layer. If you are new to the concept of word embeddings then we recommend that you go through the below tutorial as it'll help you understand it in detail.

Word Embeddings for PyTorch Text Classification Networks

The LSTM Layer takes embeddings generated by the embedding layer as input. We have initialized LSTM layer with a number of subsequent LSTM layers set to 1, output/hidden shape of LSTM set to 75 and input shape set to the same as embedding length. The LSTM layer internally loops through embeddings of each text example and generates hidden and output tensors. The layer is basically a loop through embeddings of single text examples one by one generating output after each token. It takes the hidden state and the carry as input which are generally random numbers and required by first token only. The hidden and carry tensors for subsequent tokens are generated by the LSTM function. The input to LSTM is of shape (batch_size, max_tokens, embed_len) and output of shape (batch_size, max_tokens, hidden_dim). Please make a NOTE that we have used only a single LSTM layer in this approach. We'll be using multiple in our upcoming examples.

The last layer of the network is Linear layer which has 4 output units that are the same as a number of target classes. It takes the last input of the LSTM layer and returns the prediction of the network. Please take a look at the input given to Linear Layer in the forward pass. We have given the last output of each example generated by LSTM. The output of LSTM is the output of each token of an example but we want the output of the last token for each example which generally has captured the context of the whole example.

After defining a network, we initialized it, printed the shape of weights/biases of each layer of the network, and performed a forward pass through random data to verify the network.

Please take a look at the below tutorial if you are new to PyTorch and want to learn how to create a neural network using it first. It'll help you sail through this tutorial faster.

Simple Guide to Create PyTorch Neural Networks using High-level "nn" Module

from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim = 75
n_layers=1

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

lstm_classifier = LSTMClassifier()

lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(98635, 50)
  (lstm): LSTM(50, 75, batch_first=True)
  (linear): Linear(in_features=75, out_features=4, bias=True)
)

for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(98635, 50)
Parameters :
torch.Size([98635, 50])

Layer : LSTM(50, 75, batch_first=True)
Parameters :
torch.Size([300, 50])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])

Layer : Linear(in_features=75, out_features=4, bias=True)
Parameters :
torch.Size([4, 75])
torch.Size([4])

out = lstm_classifier(torch.randint(0, len(vocab), (1024, max_words)))

out.shape

torch.Size([1024, 4])

Train Network¶

In this section, we have trained a network that we defined in the previous section. In order to train the network, we have defined a function that will perform training when called.

The function takes model, loss function, optimizer, train data loader, validation data loader, and a number of epochs as input. It then executes a training loop number of epochs time. For each epoch, it loops through whole training data in batches using a train data loader. For each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network weights. It records loss for each batch and prints the average loss of all batches of the epoch at the end. It also calculates validation loss and accuracy using a helper function and prints it as well.

from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gc

def CalcValLossAndAccuracy(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X, Y in val_loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))


def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X) ## Make Predictions

            loss = loss_fn(Y_preds, Y) ## Calculate Loss
            losses.append(loss.item())

            optimizer.zero_grad() ## Clear previously calculated gradients
            loss.backward() ## Calculates Gradients
            optimizer.step() ## Update network weights.

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        CalcValLossAndAccuracy(model, loss_fn, val_loader)

Below, we are actually training our network. We have initialized a number of epochs to 10 and the learning rate to 0.001. Then, we have initialized cross entropy loss function, our LSTM Text Classifier and Adam optimizer. At last, we have called our training function with the necessary arguments to perform training. We can notice from the loss and accuracy value getting printed after each epoch that our model is doing a good job at the classification task.

from torch.optim import Adam

epochs = 10
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:30<00:00,  3.92it/s]

Train Loss : 1.151
Valid Loss : 0.812
Valid Acc  : 0.685

100%|██████████| 118/118 [00:28<00:00,  4.08it/s]

Train Loss : 0.659
Valid Loss : 0.550
Valid Acc  : 0.795

100%|██████████| 118/118 [00:28<00:00,  4.08it/s]

Train Loss : 0.487
Valid Loss : 0.468
Valid Acc  : 0.833

100%|██████████| 118/118 [00:28<00:00,  4.07it/s]

Train Loss : 0.406
Valid Loss : 0.420
Valid Acc  : 0.851

100%|██████████| 118/118 [00:30<00:00,  3.90it/s]

Train Loss : 0.352
Valid Loss : 0.397
Valid Acc  : 0.861

100%|██████████| 118/118 [00:31<00:00,  3.80it/s]

Train Loss : 0.313
Valid Loss : 0.366
Valid Acc  : 0.870

100%|██████████| 118/118 [00:31<00:00,  3.79it/s]

Train Loss : 0.282
Valid Loss : 0.354
Valid Acc  : 0.878

100%|██████████| 118/118 [00:32<00:00,  3.63it/s]

Train Loss : 0.257
Valid Loss : 0.354
Valid Acc  : 0.878

100%|██████████| 118/118 [00:31<00:00,  3.79it/s]

Train Loss : 0.234
Valid Loss : 0.350
Valid Acc  : 0.884

100%|██████████| 118/118 [00:31<00:00,  3.70it/s]

Train Loss : 0.214
Valid Loss : 0.353
Valid Acc  : 0.882

Evaluate Network Performance¶

In this section, we have evaluated the performance of our network by calculating ML metrics like accuracy, classification report (precision, recall, and f1-score per target class) and confusion matrix on test predictions. We have created a helper function that takes the model and data loader as input and returns predictions. We can notice from the accuracy score that our model seems to have done a good job at the text classification task.

We have used various functions available from scikit-learn to calculate ML Metrics. Please feel free to check the below link if you want to learn about various ML metrics available from sklearn in-depth.

Scikit-Learn - Model Evaluation & Scoring Metrics

We have also created a visualization of the confusion matrix using scikit-plot. We can notice from the visualization that our model seems to be doing a good job for Sports and World categories compared to Business and Sci/Tech.

Please feel free to check the below link if you want to learn about scikit-plot and various ML metrics visualizations available from it.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

def MakePredictions(model, loader):
    Y_shuffled, Y_preds = [], []
    for X, Y in loader:
        preds = model(X)
        Y_preds.append(preds)
        Y_shuffled.append(Y)
    gc.collect()
    Y_preds, Y_shuffled = torch.cat(Y_preds), torch.cat(Y_shuffled)

    return Y_shuffled.detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).detach().numpy()

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8826315789473684

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.89      0.89      1900
      Sports       0.92      0.96      0.94      1900
    Business       0.84      0.86      0.85      1900
    Sci/Tech       0.87      0.83      0.85      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


Confusion Matrix :
[[1685   68   89   58]
 [  35 1822   27   16]
 [  72   42 1633  153]
 [  88   46  198 1568]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions Using LIME Algorithm¶

In this section, we have explained the predictions made by our model using LIME algorithm which is a commonly used library to explain predictions of black-box neural network models. It let us create a visualization explaining words that contributed to predicting a particular target label/category.

If you are new to LIME and have no background on it then we recommend that you go through the below link to understand it.

How to Use LIME to Understand sklearn Models Predictions?

Below, we have simply loaded samples of the test dataset and their target labels.

X_test_text, Y_test = [], []
for Y, X in test_dataset:
    X_test_text.append(X)
    Y_test.append(Y-1)

len(X_test_text)

In order to explain predictions using LIME, we need to create an instance of LimeTextExplainer first. Then, we need to call explain_instance() method on it to create Explanation object which has explanation details. At last, we need to call show_in_notebook() method on Explanation object to create a visualization showing an explanation that has words highlighted from the text that contributed to predicting a particular target label.

Below, we have first initialized LimeTextExplainer. Then, we have defined a helper function that takes a list of text examples as input and returns their predicted probabilities. The function performs tokenizing and vectorization of text before giving it to the network to make predictions.

Then, we randomly selected a text example from the test dataset. We have made predictions on that example using our train model. Our model correctly predicts the target category as Business for it.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True) ## Define Explainer

def make_predictions(X_batch_text): ## Prediction Function
    X = [vocab(tokenizer(text)) for text in X_batch_text]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
    logits = lstm_classifier(torch.tensor(X, dtype=torch.int32))
    preds = F.softmax(logits, dim=-1)
    return preds.detach().numpy()

## Randomly Select test example and make prediction on it.
rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = [vocab(tokenizer(text)) for text in X_test_text[idx:idx+1]]
X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
preds = lstm_classifier(torch.tensor(X, dtype=torch.int32))
preds = F.softmax(preds, dim=-1)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

Prediction :  Business
Actual :      Business

Below, we have called explain_instance() method with selected text example, helper function, and target label of text example. The method returned an Explanation object on which we have called show_in_notebook() method to generate visualization explaining prediction. We can notice from the visualization that words like 'financing', 'united', 'concessions', 'bankruptcy', 'cuts', etc are contributing to predicting the target label as Busines which makes sense as these are commonly used words in the business world.

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 2: Single LSTM Layer (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Our approach in this section is almost the same as our approach in the previous section as it uses again single LSTM layer in the network. The only difference is in max tokens that we keep per text example. We have increased the number of tokens per text example to 50. The majority of the code in this section is the same as the previous section with only a change in max tokens per example.

Load Dataset And Create Data Loaders¶

Below, we have again loaded our datasets and created a data loader from it. This time, we have set max_words to 50 to keep the first 50 tokens per text example.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset  = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

max_words = 50

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset , batch_size=1024, collate_fn=vectorize_batch)

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 50]) torch.Size([1024])

Define LSTM Network¶

Below, we have defined our LSTM network which is exactly the same as our previous approach consisting of three layers (Embedding, LSTM, and Linear).

from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim = 75
n_layers=1

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

Train Network¶

Here, we have trained our network for 10 epochs and a learning rate of 0.001. The settings of training are the same as our previous approach. We'll be training all our networks using the same settings to make comparison easy. We can notice from the loss and accuracy getting printed after each epoch that our model seems to be doing a good job.

from torch.optim import Adam

epochs = 10
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [01:01<00:00,  1.92it/s]

Train Loss : 1.335
Valid Loss : 1.194
Valid Acc  : 0.461

100%|██████████| 118/118 [01:01<00:00,  1.91it/s]

Train Loss : 0.889
Valid Loss : 0.675
Valid Acc  : 0.745

100%|██████████| 118/118 [00:57<00:00,  2.04it/s]

Train Loss : 0.542
Valid Loss : 0.521
Valid Acc  : 0.818

100%|██████████| 118/118 [00:56<00:00,  2.08it/s]

Train Loss : 0.421
Valid Loss : 0.431
Valid Acc  : 0.850

100%|██████████| 118/118 [01:00<00:00,  1.94it/s]

Train Loss : 0.352
Valid Loss : 0.382
Valid Acc  : 0.868

100%|██████████| 118/118 [00:57<00:00,  2.06it/s]

Train Loss : 0.310
Valid Loss : 0.356
Valid Acc  : 0.877

100%|██████████| 118/118 [00:57<00:00,  2.04it/s]

Train Loss : 0.273
Valid Loss : 0.350
Valid Acc  : 0.881

100%|██████████| 118/118 [00:56<00:00,  2.10it/s]

Train Loss : 0.247
Valid Loss : 0.339
Valid Acc  : 0.884

100%|██████████| 118/118 [00:54<00:00,  2.16it/s]

Train Loss : 0.225
Valid Loss : 0.330
Valid Acc  : 0.888

100%|██████████| 118/118 [00:51<00:00,  2.30it/s]

Train Loss : 0.208
Valid Loss : 0.324
Valid Acc  : 0.895

Evaluate Network Performance¶

In this section, we have evaluated the performance of our network by calculating accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that it has improved a little bit compared to our previous approach. We can notice from the confusion matrix plot that our model is good at classifying text examples of categories Sci/Tech, Sports, and World compared to category Business.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8957894736842106

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.89      0.90      1900
      Sports       0.93      0.97      0.95      1900
    Business       0.89      0.83      0.86      1900
    Sci/Tech       0.85      0.89      0.87      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600


Confusion Matrix :
[[1699   71   64   66]
 [  21 1839   13   27]
 [  82   23 1582  213]
 [  63   34  115 1688]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions Using LIME Algorithm¶

In this section, we have again tried to explain the prediction made by our trained model on the randomly selected text example from the test dataset using LIME algorithm. Our model has correctly predicted the target label as Business for randomly selected text example. From the visualization generated using lime, we can notice that words like 'bankruptcy', 'employees', 'airlines', 'cuts', 'pensions', 'labor', 'concessions', etc are contributing to predicting target label as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = [vocab(tokenizer(text)) for text in X_test_text[idx:idx+1]]
X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
preds = lstm_classifier(torch.tensor(X, dtype=torch.int32))
preds = F.softmax(preds, dim=-1)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 3: Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Till now, both approaches that we tried used a single LSTM layer. Our approach in this section uses 3 LSTM layers in sequence. We have tried this approach to check whether it further helps improve the performance of the network. The majority of the code is the same as our previous approaches with minor changes in network definition.

Define LSTM Network¶

Below, we have defined the network that we'll be using in this section. The network definition is exactly the same as our previous approaches with only a change in n_layers which is set to 3 to inform LSTM() constructor to create three LSTM layers.

As usual, after defining the network, we have initialized it and printed the shape of weights/biases of layers.

from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim = 75
n_layers=3

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

lstm_classifier = LSTMClassifier()

lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(98635, 50)
  (lstm): LSTM(50, 75, num_layers=3, batch_first=True)
  (linear): Linear(in_features=75, out_features=4, bias=True)
)

for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(98635, 50)
Parameters :
torch.Size([98635, 50])

Layer : LSTM(50, 75, num_layers=3, batch_first=True)
Parameters :
torch.Size([300, 50])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 75])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 75])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])

Layer : Linear(in_features=75, out_features=4, bias=True)
Parameters :
torch.Size([4, 75])
torch.Size([4])

Train Network¶

Below, we have trained our network using the same settings that we have been using for all our approaches till now. We can notice from the loss and accuracy getting printed after each epoch that the model is doing a good job at the text classification task.

from torch.optim import Adam

epochs = 10
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [02:09<00:00,  1.10s/it]

Train Loss : 1.209
Valid Loss : 0.953
Valid Acc  : 0.573

100%|██████████| 118/118 [02:18<00:00,  1.18s/it]

Train Loss : 0.775
Valid Loss : 0.642
Valid Acc  : 0.739

100%|██████████| 118/118 [02:15<00:00,  1.15s/it]

Train Loss : 0.526
Valid Loss : 0.506
Valid Acc  : 0.818

100%|██████████| 118/118 [02:06<00:00,  1.07s/it]

Train Loss : 0.404
Valid Loss : 0.425
Valid Acc  : 0.847

100%|██████████| 118/118 [02:05<00:00,  1.06s/it]

Train Loss : 0.339
Valid Loss : 0.367
Valid Acc  : 0.873

100%|██████████| 118/118 [02:07<00:00,  1.08s/it]

Train Loss : 0.297
Valid Loss : 0.348
Valid Acc  : 0.884

100%|██████████| 118/118 [02:06<00:00,  1.07s/it]

Train Loss : 0.268
Valid Loss : 0.344
Valid Acc  : 0.882

100%|██████████| 118/118 [02:12<00:00,  1.12s/it]

Train Loss : 0.246
Valid Loss : 0.323
Valid Acc  : 0.890

100%|██████████| 118/118 [02:07<00:00,  1.08s/it]

Train Loss : 0.223
Valid Loss : 0.332
Valid Acc  : 0.889

100%|██████████| 118/118 [02:10<00:00,  1.10s/it]

Train Loss : 0.207
Valid Loss : 0.306
Valid Acc  : 0.897

Evaluate Network Performance¶

In this section, we have evaluated the accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that it is almost the same as our accuracy from the previous section with not much improvement. When we look at the confusion matrix plot generated using scikit-plot, we can notice that the model has not good accuracy in all target categories.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8951315789473684

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.88      0.89      1900
      Sports       0.94      0.96      0.95      1900
    Business       0.86      0.87      0.87      1900
    Sci/Tech       0.87      0.87      0.87      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.89      7600
weighted avg       0.90      0.90      0.89      7600


Confusion Matrix :
[[1666   71   80   83]
 [  35 1833   20   12]
 [  64   24 1651  161]
 [  65   18  164 1653]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions Using LIME Algorithm¶

In this section, we have again explained the prediction made by our trained network on a randomly selected test example using LIME algorithm. Our network correctly predicts the target label as Business for the selected text example. The visualization shows that words like 'bankruptcy', 'employees', 'airlines', 'labor', 'cuts', 'pensions', etc are contributing to predicting target label as Business which makes sense.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = [vocab(tokenizer(text)) for text in X_test_text[idx:idx+1]]
X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
preds = lstm_classifier(torch.tensor(X, dtype=torch.int32))
preds = F.softmax(preds, dim=-1)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 4: Stacking Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Our approach in this section again tries to use multiple LSTM layers in the network but this time, the output shape of each LSTM layer is different, unlike our previous approach where it was the same for all 3 LSTM layers. The only change in the code in this section is in the definition of network, the rest of the code is exactly the same as our previous approaches.

Define LSTM Network¶

Below, we have defined the network that we'll be using for our task in this section. We have defined three LSTM layers this time with different hidden_size this time (50, 60, and 75). The output of the embedding layer is given to the first LSTM whose output is given to the second LSTM layer. The output of the second LSTM layer is given to the third LSTM and the output of the last LSTM layer is given to the Linear layer. Please make a NOTE that we have defined hidden and carry for each LSTM layer separately.

After defining the network, we initialized it and printed the shape of weights/biases of layers.

from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim1 = 50
hidden_dim2 = 60
hidden_dim3 = 75
n_layers=1

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm1 = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim1, num_layers=1, batch_first=True)
        self.lstm2 = nn.LSTM(input_size=hidden_dim1, hidden_size=hidden_dim2, num_layers=1, batch_first=True)
        self.lstm3 = nn.LSTM(input_size=hidden_dim2, hidden_size=hidden_dim3, num_layers=1, batch_first=True)
        self.linear = nn.Linear(hidden_dim3, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim1), torch.randn(n_layers, len(X_batch), hidden_dim1)
        output, (hidden, carry) = self.lstm1(embeddings, (hidden, carry))

        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim2), torch.randn(n_layers, len(X_batch), hidden_dim2)
        output, (hidden, carry) = self.lstm2(output, (hidden, carry))

        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim3), torch.randn(n_layers, len(X_batch), hidden_dim3)
        output, (hidden, carry) = self.lstm3(output, (hidden, carry))
        return self.linear(output[:,-1])

lstm_classifier = LSTMClassifier()

lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(98635, 50)
  (lstm1): LSTM(50, 50, batch_first=True)
  (lstm2): LSTM(50, 60, batch_first=True)
  (lstm3): LSTM(60, 75, batch_first=True)
  (linear): Linear(in_features=75, out_features=4, bias=True)
)

for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(98635, 50)
Parameters :
torch.Size([98635, 50])

Layer : LSTM(50, 50, batch_first=True)
Parameters :
torch.Size([200, 50])
torch.Size([200, 50])
torch.Size([200])
torch.Size([200])

Layer : LSTM(50, 60, batch_first=True)
Parameters :
torch.Size([240, 50])
torch.Size([240, 60])
torch.Size([240])
torch.Size([240])

Layer : LSTM(60, 75, batch_first=True)
Parameters :
torch.Size([300, 60])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])

Layer : Linear(in_features=75, out_features=4, bias=True)
Parameters :
torch.Size([4, 75])
torch.Size([4])

Train Network¶

Below, we have trained our network using the same settings that we have been using for all our approaches. We can notice from the loss and accuracy that our model is doing a good job.

from torch.optim import Adam

epochs = 10
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [01:32<00:00,  1.27it/s]

Train Loss : 1.245
Valid Loss : 1.030
Valid Acc  : 0.505

100%|██████████| 118/118 [01:32<00:00,  1.27it/s]

Train Loss : 0.860
Valid Loss : 0.715
Valid Acc  : 0.711

100%|██████████| 118/118 [01:35<00:00,  1.24it/s]

Train Loss : 0.542
Valid Loss : 0.483
Valid Acc  : 0.824

100%|██████████| 118/118 [01:37<00:00,  1.21it/s]

Train Loss : 0.408
Valid Loss : 0.427
Valid Acc  : 0.843

100%|██████████| 118/118 [01:36<00:00,  1.22it/s]

Train Loss : 0.346
Valid Loss : 0.391
Valid Acc  : 0.864

100%|██████████| 118/118 [01:36<00:00,  1.22it/s]

Train Loss : 0.306
Valid Loss : 0.373
Valid Acc  : 0.873

100%|██████████| 118/118 [01:34<00:00,  1.24it/s]

Train Loss : 0.275
Valid Loss : 0.366
Valid Acc  : 0.876

100%|██████████| 118/118 [01:34<00:00,  1.25it/s]

Train Loss : 0.252
Valid Loss : 0.343
Valid Acc  : 0.883

100%|██████████| 118/118 [01:36<00:00,  1.22it/s]

Train Loss : 0.230
Valid Loss : 0.339
Valid Acc  : 0.882

100%|██████████| 118/118 [01:57<00:00,  1.01it/s]

Train Loss : 0.210
Valid Loss : 0.370
Valid Acc  : 0.878

Evaluate Network Performance¶

Below, we have evaluated the performance of our trained network by calculating accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that our model has the least accuracy of all approaches, we tried till now. The confusion matrix plot shows that model is doing a good job at classifying text documents of categories Sci/Tech, Sports, and World compared to category Business.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8803947368421052

Classification Report :
              precision    recall  f1-score   support

       World       0.86      0.91      0.88      1900
      Sports       0.93      0.97      0.95      1900
    Business       0.90      0.76      0.83      1900
    Sci/Tech       0.83      0.88      0.86      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


Confusion Matrix :
[[1722   75   50   53]
 [  18 1842    4   36]
 [ 177   24 1447  252]
 [  80   42   98 1680]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions Using LIME Algorithm¶

In this section, we have again explained the prediction made by our network on a randomly selected test example using LIME algorithm. The network correctly predicts the target category as Business for the selected sample. The visualization shows that words like 'bankruptcy', 'airlines', 'pensions', 'pensions', 'labor', 'employees', 'financing', 'concessions', etc are contributing to predicting target category as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = [vocab(tokenizer(text)) for text in X_test_text[idx:idx+1]]
X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
preds = lstm_classifier(torch.tensor(X, dtype=torch.int32))
preds = F.softmax(preds, dim=-1)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 5: Multiple Bidirectional LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Our approach in this section has the same network definition as our third approach. We have again used a network with 3 LSTM layers but this time, we have kept all LSTM layers bidirectional. As we had said earlier, the LSTM layer goes through each token of the text example generating output that it uses for the calculation of the next token in the same text example. This process of going through all tokens of text examples happens only in one direction (forward) in the normal LSTM layer. But in the case of the bidirectional LSTM layer, it goes through text tokens of text example in both forward and backward directions. It tries to capture some pattern if present in the backward direction as well.

Define LSTM Network¶

Our network definition in this section is exactly the same as our network definition from the third approach with minor changes. We have set bidirectional parameter to True in LSTM() constructor to inform it to create bidirectional LSTM layers. The input units of Linear layer are 2 times the output size of the LSTM layer because the now output of the LSTM layer contains combine output generated in both directions (forward and backward).

After defining the network, we initialized it and printed the shape of weights/biases of each layer of the network.

from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim = 75
n_layers=3

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True,
                            bidirectional=True)
        self.linear = nn.Linear(2*hidden_dim, len(target_classes))  ## Input dimension are 2 times hidden dimensions due to bidirectional results

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(2*n_layers, len(X_batch), hidden_dim), torch.randn(2*n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

lstm_classifier = LSTMClassifier()

lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(98635, 50)
  (lstm): LSTM(50, 75, num_layers=3, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=150, out_features=4, bias=True)
)

for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(98635, 50)
Parameters :
torch.Size([98635, 50])

Layer : LSTM(50, 75, num_layers=3, batch_first=True, bidirectional=True)
Parameters :
torch.Size([300, 50])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 50])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 150])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 150])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 150])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])
torch.Size([300, 150])
torch.Size([300, 75])
torch.Size([300])
torch.Size([300])

Layer : Linear(in_features=150, out_features=4, bias=True)
Parameters :
torch.Size([4, 150])
torch.Size([4])

Train Network¶

Below, we have trained our network using the same settings that we have been using for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job.

from torch.optim import Adam

epochs = 10
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [05:16<00:00,  2.68s/it]

Train Loss : 1.262
Valid Loss : 1.019
Valid Acc  : 0.537

100%|██████████| 118/118 [05:21<00:00,  2.72s/it]

Train Loss : 0.756
Valid Loss : 0.554
Valid Acc  : 0.796

100%|██████████| 118/118 [05:05<00:00,  2.59s/it]

Train Loss : 0.464
Valid Loss : 0.438
Valid Acc  : 0.841

100%|██████████| 118/118 [04:54<00:00,  2.49s/it]

Train Loss : 0.365
Valid Loss : 0.378
Valid Acc  : 0.865

100%|██████████| 118/118 [04:16<00:00,  2.17s/it]

Train Loss : 0.309
Valid Loss : 0.344
Valid Acc  : 0.878

100%|██████████| 118/118 [04:20<00:00,  2.21s/it]

Train Loss : 0.271
Valid Loss : 0.338
Valid Acc  : 0.885

100%|██████████| 118/118 [04:16<00:00,  2.18s/it]

Train Loss : 0.242
Valid Loss : 0.322
Valid Acc  : 0.891

100%|██████████| 118/118 [04:41<00:00,  2.38s/it]

Train Loss : 0.218
Valid Loss : 0.322
Valid Acc  : 0.889

100%|██████████| 118/118 [05:08<00:00,  2.61s/it]

Train Loss : 0.195
Valid Loss : 0.314
Valid Acc  : 0.895

100%|██████████| 118/118 [04:49<00:00,  2.45s/it]

Train Loss : 0.179
Valid Loss : 0.324
Valid Acc  : 0.894

Evaluate Network Performance¶

In this section, we have evaluated the performance of the network as usual by calculating accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that the model has the almost same accuracy as our model from the second approach. The network seems to be doing good for all target categories from the confusion matrix plot.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8956578947368421

Classification Report :
              precision    recall  f1-score   support

       World       0.88      0.92      0.90      1900
      Sports       0.97      0.94      0.95      1900
    Business       0.88      0.85      0.86      1900
    Sci/Tech       0.86      0.87      0.87      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600


Confusion Matrix :
[[1746   33   67   54]
 [  70 1780   20   30]
 [  88   12 1621  179]
 [  84   16  140 1660]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Predictions Using LIME Algorithm¶

In this section, we have explained the prediction made by our trained network on a randomly selected test example using LIME algorithm. The network correctly predicts the target label as Business for the selected text example. The words like 'pensions', 'million', 'concessions', 'labor', 'employees', etc are contributing to predicting target label as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = [vocab(tokenizer(text)) for text in X_test_text[idx:idx+1]]
X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
preds = lstm_classifier(torch.tensor(X, dtype=torch.int32))
preds = F.softmax(preds, dim=-1)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
                                         labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

7. Results Summary and Further Recommendations

The below table highlights the settings and performance of different network approaches that we tried.

Approach	Max Tokens	Embedding Length	LSTM Hidden Dimension	Test Accuracy (%)
Single LSTM Layer	25	50	75	88.26
Single LSTM Layer	50	50	75	89.57
Multiple LSTM Layers (3 Layers)	50	50	75	89.51
Stacking Multiple LSTM Layers (3 Layers)	50	50	50,60,75	88.03
Multiple Bidirectional LSTM Layers (3 Layers)	50	50	75	89.56

Further Suggestions¶

Below, we have listed a few suggestions that can be tried to further improve network accuracy.

Try different tokens per text example.
Try different word embedding lengths.
Try different LSTM hidden sizes.
Add a few more linear/dense layers after LSTM layers.
Stack more LSTM layers.
Try different weight initialization techniques.
Try learning rate scheduling.
Train network for more epochs.

This ends our small tutorial explaining how we can create a neural network with LSTM layers using PyTorch and use it for text classification tasks. Please feel free to let us know your views in the comments section.

References¶

PyTorch RNN For Text Classification Tasks

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

pytorch, LSTM, text-classification, word-embeddings

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

PyTorch LSTM For Text Classification Tasks (Word Embeddings)¶

Important Sections Of Tutorial¶

1. Populate Vocabulary ¶

Approach 1: Single LSTM Layer (Tokens Per Text Example=25, Embeddings Length=50, LSTM Output=75) ¶

Load Dataset And Create Data Loaders¶

Define LSTM Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions Using LIME Algorithm¶

Approach 2: Single LSTM Layer (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Load Dataset And Create Data Loaders¶

Define LSTM Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions Using LIME Algorithm¶

Approach 3: Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Define LSTM Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions Using LIME Algorithm¶

Approach 4: Stacking Multiple LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Define LSTM Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions Using LIME Algorithm¶

Approach 5: Multiple Bidirectional LSTM Layers (Tokens Per Text Example=50, Embeddings Length=50, LSTM Output=75) ¶

Define LSTM Network¶

Train Network¶

Evaluate Network Performance¶

Explain Predictions Using LIME Algorithm¶

7. Results Summary and Further Recommendations

Further Suggestions¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription