Updated On : May-05,2022 Tags keras, LSTM, test-classi…

Keras: LSTM Networks For Text Classification Tasks

Recurrent Neural Networks (RNNs) is the preferred network when working with data that has sequences in it like time-series data, text data, etc. These kinds of datasets have an internal sequence that can not be captured by a neural network consisting of dense layers because it does not take previous examples into consideration when making predictions of current examples. RNNs can capture the hidden sequence of such datasets by taking past examples into consideration when making decisions as they have an influence on the current example which needs to be captured. Over the years, different versions of RNNs like LSTM and GRU have been implemented, which perform better than vanilla RNN networks. Below, we have included an image of one cell of the LSTM network. Many such cells are laid out next to each other in one LSTM layer to capture the sequence.

Keras: LSTM Networks For Text Classification Tasks

Please feel free to check the below link for a theoretical overview of Recurrent Neural Networks.

As a part of this tutorial, we have explained various ways to design LSTM networks using Python deep learning library Keras for solving text classification tasks. The tutorial explains different approaches to using LSTM layers in a network. To encode text data, we have used word embeddings approach. We have used AG NEWS dataset available from torchtext for our purpose. After training the network, we have also evaluated their performance by calculating various ML metrics and explained prediction using LIME algorithm. Please feel free to check the below tutorial if you are looking for a guide to using vanilla RNN layers.

Below, we have listed essential sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial

  1. Prepare Data
    • 1.1 Load Datasets
    • 1.2 Tokenize Text Data
  2. Approach 1: Single LSTM Layer Network (Max Tokens=25, Embedding Length=25, LSTM Output=50)
    • Truncate Data to Max Tokens
    • Define Network
    • Compile Network
    • Train Network
    • Evaluate Network Performance
    • Explain Network Predictions using LIME Algorithm
  3. Approach 2: Single LSTM Layer Network (Max Tokens=50, Embedding Length=25, LSTM Output=50)
  4. Approach 3: Single Bidirectional LSTM Layer Network (Max Tokens=50, Embedding Length=25, LSTM Output=50)
  5. Approach 4: Multiple LSTM Layers Network (Max Tokens=50, Embedding Length=25, LSTM Output=50,60,75)
  6. Approach 5: Multiple Bidirectional LSTM Layers Network (Max Tokens=50, Embedding Length=30, LSTM Output=50,60,75)
  7. Results Summary And Further Suggestions

Below, we have imported the necessary Python libraries and printed the versions we used in our tutorial.

In [1]:
from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))
Keras Version : 2.6.0
In [2]:
import torchtext

print("TorchText Version : {}".format(torchtext.__version__))
TorchText Version : 0.10.1

1. Prepare Data

In this section, we are preparing data to be given to the neural network. We are going to use word embeddings approach to encode text data where we break text into tokens (words, punctuation marks, etc.) and each token is assigned a real-valued vector. This encoding approach will be completed in three steps.

  1. Create vocabulary of all unique tokens. The vocabulary is a simple mapping from token to integer index. Each token is assigned a unique integer index starting from 0.
  2. Tokenize each text example into tokens and assign indexes to each token using vocabulary.
  3. Map indexes to real-valued vectors.

The first two steps will be completed in this section whereas the third step will be implemented as an embedding layer in the neural network which will map token indexes to their real-valued vectors.

Below, we have included an image of word embeddings for explanation purposes.

Keras: LSTM Networks For Text Classification Tasks

1.1 Load Datasets

In this section, we have simply loaded AG NEWS dataset available from torchtext library. The dataset has text documents for 4 different categories (["World", "Sports", "Business", "Sci/Tech"]) of news. We'll be using this dataset for our text classification task.

In [3]:
import numpy as np

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

X_train_text, Y_train = [], []
for Y, X in train_dataset:
    X_train_text.append(X)
    Y_train.append(Y-1)

X_test_text, Y_test = [], []
for Y, X in test_dataset:
    X_test_text.append(X)
    Y_test.append(Y-1)

Y_train, Y_test = np.array(Y_train), np.array(Y_test)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

len(X_train_text), len(X_test_text)
train.csv: 29.5MB [00:00, 87.0MB/s]
test.csv: 1.86MB [00:00, 32.3MB/s]
Out[3]:
(120000, 7600)

1.2 Tokenize And Vectorize Text Data

In this section, we have completed the first two steps of our word embedding encoding process explained earlier.

First, we have populated a vocabulary of all unique tokens present in datasets (train and test) by creating an instance of Tokenizer and calling fit_on_texts() function on it with train and test text examples.

Then, we have called texts_to_sequences() method on Tokenizer object with train and text examples. This step will tokenize each text example into a list of tokens and retrieve their respective indexes from the vocabulary. The output of this step will be a list of indexes per text example.

Below, we have explained with one simple example how vectorization works.

text = "Hello, How are you? Where are you planning to go?"

tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
            'are', 'you', 'planning', 'to', 'go', '?']

vocab = {
    'hello': 0,
    'bye': 1,
    'how': 2,
    'the': 3,
    'welcome': 4,
    'are': 5,
    'you': 6,
    'to': 7,
    '<unk>': 8,
}

vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]
In [4]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)

X_train = tokenizer.texts_to_sequences(X_train_text)
X_test  = tokenizer.texts_to_sequences(X_test_text)

Approach 1: Single LSTM Layer Network (Max Tokens=25, Embedding Length=25, LSTM Output=50)

The first approach that we'll try for our text classification task consists of a neural network that has only one LSTM layer. Apart from the LSTM layer, the network has an embedding layer for encoding purposes and a dense layer for generating probabilities of target classes. We have trained this network and evaluated its performance by calculating various ML metrics. We have also tried to explain network predictions using LIME algorithm.

Truncate Data to Max Tokens

We have decided that for our approach in this section, we'll keep 25 tokens per text example. By default, a text document can be of any length hence we have used pad_sequences() method to truncate all text examples that have more than 25 tokens and pad examples that has less than 25 tokens with 0s to bring all to the length of 25. The output of this step will be used during the training of the neural network.

In [5]:
from keras.preprocessing.sequence import pad_sequences

max_tokens=25

X_train_pad = pad_sequences(X_train, maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_pad  = pad_sequences(X_test , maxlen=max_tokens, padding="post", truncating="post", value=0)

X_train_pad.shape, X_test_pad.shape
Out[5]:
((120000, 25), (7600, 25))

Define Network

In this section, we have defined the first neural network that we'll be using for our text classification task. The network consists of 3 layers.

  1. Embedding Layer
  2. LSTM Layer
  3. Dense Layer

The first layer of the network is Embedding layer which is responsible for mapping token indexes to their respective real-valued vectors. We have defined the embedding layer using Embedding() constructor available from layers sub-module of keras. We have given a number of tokens as vocabulary length and embedding length of 25 to the constructor. The initialization of this layer will create an internal weight matrix of shape (vocab_len, embed_len) which has embeddings for each token of vocabulary. When we give a list of indexes to the network, this layer will simply map embeddings of token indexes using this matrix. The input shape to this will be (batch_size, max_tokens) = (batch_size, 25) and the output shape will be (batch_size, max_tokens, embed_len) = (batch_size, 25, 25).

The output of the embedding layer will be given to LSTM layer. The LSTM layer is created with output units of size 50. The LSTM layer will takes input of shape (batch_size, max_tokens, embed_len) and returns output of shape (batch_size, max_tokens, lstm_out) = (batch_size, 25, 50). The LSTM layer goes through tokens of each text example one by one to capture the sequence.

The output of the LSTM layer is given to Dense layer that has 4 output units (same as a number of target classes) and softmax activation** function.

In [6]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

embed_len = 25
lstm_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    LSTM(lstm_out),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()
2022-04-09 07:10:13.421020: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 25, 25)            1800075
_________________________________________________________________
lstm (LSTM)                  (None, 50)                15200
_________________________________________________________________
dense (Dense)                (None, 4)                 204
=================================================================
Total params: 1,815,479
Trainable params: 1,815,479
Non-trainable params: 0
_________________________________________________________________

Compile Network

Below, we have compiled our network to use rmsprop optimizer, cross entropy loss, and accuracy metric.

In [7]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network

Here, we have trained our network by calling fit() method. We have trained the network for 5 epochs with a batch size of 512. We have provided train data for training and test data for validation purposes. We can notice from the loss and accuracy values getting printed at the end of each epoch that our model is doing a good job.

In [8]:
model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))
2022-04-09 07:10:14.139671: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
235/235 [==============================] - 20s 73ms/step - loss: 0.5642 - accuracy: 0.7678 - val_loss: 0.3262 - val_accuracy: 0.8914
Epoch 2/5
235/235 [==============================] - 16s 67ms/step - loss: 0.2668 - accuracy: 0.9118 - val_loss: 0.2843 - val_accuracy: 0.9033
Epoch 3/5
235/235 [==============================] - 16s 70ms/step - loss: 0.2165 - accuracy: 0.9286 - val_loss: 0.2932 - val_accuracy: 0.9017
Epoch 4/5
235/235 [==============================] - 16s 67ms/step - loss: 0.1916 - accuracy: 0.9361 - val_loss: 0.2861 - val_accuracy: 0.9045
Epoch 5/5
235/235 [==============================] - 17s 71ms/step - loss: 0.1757 - accuracy: 0.9412 - val_loss: 0.2928 - val_accuracy: 0.9018
Out[8]:
<keras.callbacks.History at 0x7f355446afd0>

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report (precision, recall and f1-score per target class) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model has done a good job at the text classification task. We have calculated ML metrics using various functions available from scikit-learn.

Please feel free to check the below link if you are interested in learning about various ML metrics available from sklearn.

Apart from the calculation, we have also plotted the confusion matrix. We can notice from the plot that our model is doing a good job at text classification tasks for all categories except Business where accuracy is a little low compared to other categories.

We have created a confusion matrix plot using Python library scikit-plot. Please feel free to check the below link if you want to learn about it as it provides plots for many ML metrics.

In [9]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
Test Accuracy : 0.9018421052631579

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.91      0.91      1900
      Sports       0.95      0.96      0.96      1900
    Business       0.91      0.82      0.86      1900
    Sci/Tech       0.84      0.92      0.88      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600


Confusion Matrix :
[[1730   49   63   58]
 [  25 1828   19   28]
 [  75   28 1556  241]
 [  65   20   75 1740]]
In [ ]:
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Keras: LSTM Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have tried to explain predictions made by our network using LIME algorithm. The python library lime has an implementation of the algorithm. It let us generate visualization which highlights words contributing to predicting a particular target label.

Please feel free to check the below links if you are new to the LIME and want to learn about it in depth.

In order to explain predictions using lime, we first need to create an instance of LimeTextExplainer which we have done below.

In [11]:
from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

explainer
Out[11]:
<lime.lime_text.LimeTextExplainer at 0x7f3554352890>

Here, we have first declared a function that takes a list of text examples as input and returns their prediction probabilities using our trained model. It tokenizes and vectorizes text examples before giving them to the network to generate probabilities. We'll use this function in the next cell when generating an explanation.

Then, we randomly selected one text example from the test dataset. We have then made predictions on that example using our trained network. Our network correctly predicts the target category as Business for the selected example.

In [12]:
def make_predictions(X_batch_text):
    X = tokenizer.texts_to_sequences(X_batch_text)
    X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
    preds = model.predict(X)
    return preds

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])
Prediction :  Business
Actual :      Business

Below, we have generated an explanation visualization for the selected text example. To do that, we have first called explain_instance() method on LimeTextExplainer object. We have provided selected the text example, prediction function, and target label for the method. The method returns an Explanation instance.

At last, we have called show_in_notebook() method on the explanation instance to generate the visualization.

We can notice from the visualization that words like 'concessions', 'airlines', 'pensions', 'financing', 'bankruptcy', 'cuts', etc are used for predicting the target category Business which makes sense as they are commonly used words in the business world.

In [ ]:
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Keras: LSTM Networks For Text Classification Tasks

Approach 2: Single LSTM Layer Network (Max Tokens=50, Embedding Length=25, LSTM Output=50)

Our approach in this section is exactly the same as our approach in the previous section the only difference is that we have increased the number of tokens per example to 50 from 25. We have increased to tokens to see whether it improves accuracy or not. The majority of the code is exactly the same as in our previous section.

Truncate Data to Max Tokens

Below, we have truncated our text examples to maintain 50 tokens per example by calling pad_sequences() method.

In [14]:
from keras.preprocessing.sequence import pad_sequences

max_tokens=50

X_train_pad = pad_sequences(X_train, maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_pad  = pad_sequences(X_test , maxlen=max_tokens, padding="post", truncating="post", value=0)

X_train_pad.shape, X_test_pad.shape
Out[14]:
((120000, 50), (7600, 50))

Define Network

Below, we have defined a network that we'll use for our text classification task in this section. The network definition is exactly the same as in the previous section.

In [15]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

embed_len = 25
lstm_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    LSTM(lstm_out),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 50, 25)            1800075
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                15200
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 204
=================================================================
Total params: 1,815,479
Trainable params: 1,815,479
Non-trainable params: 0
_________________________________________________________________

Compile Network

Here, we have compiled our network to use rmsprop optimizer, cross entropy loss, and accuracy metric.

In [16]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network

Below, we have trained our network using exactly the same settings that we had used in the previous section. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a quite good job at the given classification task.

In [17]:
model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))
Epoch 1/5
235/235 [==============================] - 34s 133ms/step - loss: 0.7517 - accuracy: 0.6653 - val_loss: 0.4097 - val_accuracy: 0.8729
Epoch 2/5
235/235 [==============================] - 33s 141ms/step - loss: 0.2959 - accuracy: 0.9081 - val_loss: 0.2724 - val_accuracy: 0.9139
Epoch 3/5
235/235 [==============================] - 34s 145ms/step - loss: 0.2233 - accuracy: 0.9305 - val_loss: 0.2882 - val_accuracy: 0.9079
Epoch 4/5
235/235 [==============================] - 33s 142ms/step - loss: 0.1928 - accuracy: 0.9400 - val_loss: 0.2718 - val_accuracy: 0.9136
Epoch 5/5
235/235 [==============================] - 33s 141ms/step - loss: 0.1731 - accuracy: 0.9457 - val_loss: 0.2696 - val_accuracy: 0.9147
Out[17]:
<keras.callbacks.History at 0x7f3557aac6d0>

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network as usual by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is a little better compared to our previous approach. We have also created a confusion matrix plot for reference purposes.

In [18]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
Test Accuracy : 0.9147368421052632

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.92      0.91      1900
      Sports       0.95      0.98      0.97      1900
    Business       0.90      0.86      0.88      1900
    Sci/Tech       0.90      0.89      0.90      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600


Confusion Matrix :
[[1752   55   45   48]
 [  11 1868   16    5]
 [ 114   13 1637  136]
 [  66   24  115 1695]]
In [ ]:
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Keras: LSTM Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have tried to explain predictions made by our trained network using a LIME algorithm. Our network correctly predicts the target label as Business for randomly selected a text example from the test dataset. The visualization shows that words like 'concessions', 'employees', 'airlines', 'financing', 'bankruptcy', 'pensions', 'million', etc are contributing to predicting target label as Business.

In [ ]:
from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Keras: LSTM Networks For Text Classification Tasks

Approach 3: Single Bidirectional LSTM Layer Network (Max Tokens=50, Embedding Length=25, LSTM Output=50)

Our approach in this section again uses a single LSTM layer like our previous approach but this time bidirectional LSTM layer is used. By default, LSTM layers are unidirectional which means that they process sequences in the forward direction to find out patterns. In the case of bidirectional LSTM layers, it processes sequences in both forward and backward directions to find out patterns. This can sometimes help improve network performance. The majority of the code is the same as our earlier approaches with only a change in network architecture.

Define Network

Below, we have defined the network that we'll use for our text classification task in this section. We can create bidirectional LSTM layer by wrapping normal LSTM layer in Bidirectional() constructor. The output of bidirectional LSTM layer will be of shape (batch_size, max_tokens, 2 x lstm_out) = (batch_size, 25, 100) because it traverses sequence in both forward and backward directions.

In [21]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, Bidirectional

embed_len = 25
lstm_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    Bidirectional(LSTM(lstm_out)),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_2 (Embedding)      (None, 50, 25)            1800075
_________________________________________________________________
bidirectional (Bidirectional (None, 100)               30400
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 404
=================================================================
Total params: 1,830,879
Trainable params: 1,830,879
Non-trainable params: 0
_________________________________________________________________

Compile Network

Below, we have complied our network to use rmsprop optimizer, cross entropy, and accuracy metrics.

In [22]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network

Here, we have trained our network using the same settings we have been for all our approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the task.

In [23]:
model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))
Epoch 1/5
235/235 [==============================] - 67s 264ms/step - loss: 0.5461 - accuracy: 0.8048 - val_loss: 0.3109 - val_accuracy: 0.8983
Epoch 2/5
235/235 [==============================] - 61s 261ms/step - loss: 0.2677 - accuracy: 0.9141 - val_loss: 0.3373 - val_accuracy: 0.8916
Epoch 3/5
235/235 [==============================] - 60s 257ms/step - loss: 0.2170 - accuracy: 0.9309 - val_loss: 0.2894 - val_accuracy: 0.9045
Epoch 4/5
235/235 [==============================] - 60s 257ms/step - loss: 0.1884 - accuracy: 0.9384 - val_loss: 0.2530 - val_accuracy: 0.9155
Epoch 5/5
235/235 [==============================] - 62s 263ms/step - loss: 0.1674 - accuracy: 0.9458 - val_loss: 0.2543 - val_accuracy: 0.9154
Out[23]:
<keras.callbacks.History at 0x7f355764c410>

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is the highest of all our approaches. We have also plotted the confusion matrix for reference purposes.

In [24]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
Test Accuracy : 0.9153947368421053

Classification Report :
              precision    recall  f1-score   support

       World       0.95      0.89      0.92      1900
      Sports       0.96      0.98      0.97      1900
    Business       0.89      0.87      0.88      1900
    Sci/Tech       0.86      0.92      0.89      1900

    accuracy                           0.92      7600
   macro avg       0.92      0.92      0.92      7600
weighted avg       0.92      0.92      0.92      7600


Confusion Matrix :
[[1687   54   86   73]
 [  11 1856   11   22]
 [  41    6 1659  194]
 [  33   12  100 1755]]
In [ ]:
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Keras: LSTM Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained predictions made by our trained network using LIME algorithm. The network correctly predicts the target label as Business for the selected test example. The visualization created using LIME shows that words like 'airlines', 'concessions', 'bankruptcy', 'cuts', 'pensions', 'labor', 'financing', etc are contributing to predicting target label as Business.

In [ ]:
from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Keras: LSTM Networks For Text Classification Tasks

Approach 4: Multiple LSTM Layers Network (Max Tokens=50, Embedding Length=25, LSTM Output=50,60,75)

Our approach in this section creates a recurrent neural network consisting of multiple LSTM layers. The network uses 3 LSTM layers of different output units. The majority of the code in this section is the same as earlier with the only difference being network architecture.

Define Network

Below, we have defined the network that we have used for our text classification task. The network consists of 5 layers. The embedding layer is followed by 3 LSTM layers with output unit sizes of 50, 60, and 75 respectively. The output of the last LSTM layer is given to the Dense layer whose output is the prediction of network.

In [27]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding

embed_len = 25
lstm_out1 = 50
lstm_out2 = 60
lstm_out3 = 75

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    LSTM(lstm_out1, return_sequences=True),
                    LSTM(lstm_out2, return_sequences=True),
                    LSTM(lstm_out3),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_3 (Embedding)      (None, 50, 25)            1800075
_________________________________________________________________
lstm_3 (LSTM)                (None, 50, 50)            15200
_________________________________________________________________
lstm_4 (LSTM)                (None, 50, 60)            26640
_________________________________________________________________
lstm_5 (LSTM)                (None, 75)                40800
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 304
=================================================================
Total params: 1,883,019
Trainable params: 1,883,019
Non-trainable params: 0
_________________________________________________________________

Compile Network

Here, we have compiled our network to use RMSProp optimizer, cross entropy loss, and accuracy metric.

In [28]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network

In this section, we have trained our network using the same settings that we have been using for all our approaches. We can notice from the loss and accuracy values that the network is doing a good job at the classification task.

In [29]:
model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))
Epoch 1/5
235/235 [==============================] - 131s 525ms/step - loss: 0.5945 - accuracy: 0.7334 - val_loss: 0.3677 - val_accuracy: 0.8754
Epoch 2/5
235/235 [==============================] - 121s 515ms/step - loss: 0.2770 - accuracy: 0.9107 - val_loss: 0.2765 - val_accuracy: 0.9116
Epoch 3/5
235/235 [==============================] - 125s 532ms/step - loss: 0.2186 - accuracy: 0.9307 - val_loss: 0.2631 - val_accuracy: 0.9157
Epoch 4/5
235/235 [==============================] - 121s 517ms/step - loss: 0.1883 - accuracy: 0.9398 - val_loss: 0.2605 - val_accuracy: 0.9166
Epoch 5/5
235/235 [==============================] - 124s 527ms/step - loss: 0.1687 - accuracy: 0.9455 - val_loss: 0.3279 - val_accuracy: 0.8937
Out[29]:
<keras.callbacks.History at 0x7f34fc159750>

Evaluate Network Performance

In this section, we have evaluated the performance of the network by calculating the accuracy score, classification report and confusion matrix metrics. We can notice from the accuracy score that it's a little less compared to our single LSTM layer networks which is surprising as we had expected that stacking more layers could improve accuracy further.

In [30]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
Test Accuracy : 0.8936842105263157

Classification Report :
              precision    recall  f1-score   support

       World       0.96      0.86      0.90      1900
      Sports       0.92      0.98      0.95      1900
    Business       0.93      0.78      0.85      1900
    Sci/Tech       0.80      0.96      0.87      1900

    accuracy                           0.89      7600
   macro avg       0.90      0.89      0.89      7600
weighted avg       0.90      0.89      0.89      7600


Confusion Matrix :
[[1630  120   73   77]
 [   3 1865    6   26]
 [  41   17 1482  360]
 [  29   19   37 1815]]
In [ ]:
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Keras: LSTM Networks For Text Classification Tasks

Explain Network Predictions using LIME

In this section, we have explained predictions of the network using LIME algorithm. The network correctly predicts the target label as Business for the selected text example from the test dataset. The visualization created using LIME highlights that words like 'concessions', 'airlines', 'cuts', 'bankruptcy', 'pensions', 'financing', 'labor', etc are contributing to predicting target label as Business.

In [ ]:
from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Keras: LSTM Networks For Text Classification Tasks

Approach 5: Multiple Bidirectional LSTM Layers Network (Max Tokens=50, Embedding Length=30, LSTM Output=50,60,75)

Our approach in this section is exactly the same as our approach in the previous section with minor changes. Our approach in this section again uses a recurrent neural network with multiple LSTM layers but all layers are bidirectional this time. The majority of the code is the same as our earlier sections with the only difference being network architecture.

Define Network

Below, we have defined the network that we'll use for our text classification task in this section. The network architecture is the same as in the previous section with the only difference being that LSTM layers are wrapped in Bidirectional() constructor to make them bidirectional.

In [33]:
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, Bidirectional

embed_len = 30
lstm_out1 = 50
lstm_out2 = 60
lstm_out3 = 75

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    Bidirectional(LSTM(lstm_out1, return_sequences=True)),
                    Bidirectional(LSTM(lstm_out2, return_sequences=True)),
                    Bidirectional(LSTM(lstm_out3)),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_4 (Embedding)      (None, 50, 30)            2160090
_________________________________________________________________
bidirectional_1 (Bidirection (None, 50, 100)           32400
_________________________________________________________________
bidirectional_2 (Bidirection (None, 50, 120)           77280
_________________________________________________________________
bidirectional_3 (Bidirection (None, 150)               117600
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 604
=================================================================
Total params: 2,387,974
Trainable params: 2,387,974
Non-trainable params: 0
_________________________________________________________________

Compile Network

Here, we have compiled our network to use RMSProp optimizer, cross entropy loss, and accuracy metric.

In [34]:
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network

In this section, we have trained our network using the same settings that we have used for all our previous approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.

In [35]:
model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))
Epoch 1/5
235/235 [==============================] - 265s 1s/step - loss: 0.4645 - accuracy: 0.8214 - val_loss: 0.3048 - val_accuracy: 0.9005
Epoch 2/5
235/235 [==============================] - 250s 1s/step - loss: 0.2267 - accuracy: 0.9259 - val_loss: 0.2575 - val_accuracy: 0.9147
Epoch 3/5
235/235 [==============================] - 243s 1s/step - loss: 0.1847 - accuracy: 0.9392 - val_loss: 0.2528 - val_accuracy: 0.9164
Epoch 4/5
235/235 [==============================] - 255s 1s/step - loss: 0.1618 - accuracy: 0.9463 - val_loss: 0.2564 - val_accuracy: 0.9161
Epoch 5/5
235/235 [==============================] - 251s 1s/step - loss: 0.1459 - accuracy: 0.9514 - val_loss: 0.2807 - val_accuracy: 0.9101
Out[35]:
<keras.callbacks.History at 0x7f34d4429710>

Evaluate Network Performance

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy that it is doing a good job at the classification task. We have also plotted the confusion matrix for reference purposes.

In [36]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
Test Accuracy : 0.9101315789473684

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.92      0.91      1900
      Sports       0.94      0.98      0.96      1900
    Business       0.89      0.86      0.88      1900
    Sci/Tech       0.90      0.88      0.89      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600


Confusion Matrix :
[[1750   52   58   40]
 [  18 1863   14    5]
 [ 106   17 1639  138]
 [  68   43  124 1665]]
In [ ]:
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Keras: LSTM Networks For Text Classification Tasks

Explain Network Predictions using LIME Algorithm

In this section, we have explained the prediction made by our network on a random test example using LIME algorithm. The network correctly predicts the target label as Business for the selected sample. The visualization shows that words like 'concessions', 'financing', 'pensions', 'labor', 'cuts', etc are contributing to predicting target label as Business.

In [ ]:
from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Keras: LSTM Networks For Text Classification Tasks

7. Results Summary And Further Suggestions

Approach Max Tokens Embedding Length LSTM Output Test Accuracy (%)
Single LSTM Layer Network 25 25 50 90.18
Single LSTM Layer Network 50 25 50 91.47
Single Bidirectional LSTM Layer Network 50 25 50 91.53
Multiple LSTM Layers Network 50 25 50,60,75 89.36
Multiple Bidirectional LSTM Layers Network 50 30 50,60,75 91.01

Further Recommendations

  • Try different max tokens.
  • Try different embedding lengths.
  • Try different LSTM layer output units.
  • Train network for more epochs.
  • Try different optimizers (Adam).
  • Try different weight initialization for networks.
  • Try to stack the different number of LSTM layers.
  • Try learning rate schedulers
Sunny Solanki  Sunny Solanki

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking HERE.