Updated On : May-03,2022 Time Investment : ~30 mins

Keras: RNNs For Text Classification Tasks¶

When we work with datasets like time-series, text, speech, etc, any example of data has a dependency on previous examples that came before it. With such datasets, in order to predict the output of any text example, using the examples that came before it can give the best results. These kinds of datasets have sequences in data examples that can not be captured by a normal neural network of only dense layers as they can't capture sequences properly (It does not have any memory of previous examples to make predictions). To work with such datasets, Recurrent neural networks (RNNs) were developed. The RNNs and its variant (LSTM and GRU) are best suited for datasets like time series, text, speech, etc.

As a part of this tutorial, we have explained how we can create RNNs using Python deep learning library keras that can be used for text classification tasks. We have tried different versions of RNNs for solving the task. For encoding text data, we have used word embeddings approach. After training RNNs, we have also evaluated their performance by calculating various ML metrics and also explained predictions made by them using LIME algorithm.

In this tutorial, we have primarily created RNNs with vanilla Recurrent layer. If you are looking for a guide on a variant of RNN that consists of LSTM layers then please feel free to check the below link.

Keras: LSTM Networks For Text Classification Tasks

Below, we have listed essential sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Prepare Data
- 1.1 Load Dataset
- 1.2 Tokenize And Vectorize Text Data
Approach 1: Single Recurrent Layer Network (Max Tokens=25, Embedding Length=30, RNN Output=50)
- Truncate Data At Max Tokens
- Define Network
- Compile Network
- Train Network
- Evaluate Network Performance
- Explain Network Predictions using LIME Algorithm
Approach 2: Single Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50)
Approach 3: Single Bidirectional Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50)
Approach 4: Stacking Multiple Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75)
Approach 5: Stacking Multiple Bidirectional Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75)
Summarized Results And Further Recommendations

Below, we have imported the necessary libraries and printed the versions we used in our tutorial.

from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))

Keras Version : 2.6.0

import torchtext

print("TorchText Version : {}".format(torchtext.__version__))

TorchText Version : 0.10.1

1. Prepare Data ¶

In this section, we are preparing data to be given to the neural network. As we said earlier, we are going to use word embeddings approach for encoding text data. In order to encode text data using this approach, we need to follow 3 steps.

Create vocabulary of all unique tokens in the corpus. The vocabulary is a simple mapping from token to integer index. Each token is assigned a unique integer index starting from 0.
Tokenize each text example into tokens and assign indexes to each token using vocabulary.
Map indexes to real-valued vectors.

The first two steps will be completed in this section. The third step will be implemented through a neural network where we'll have an embedding layer that will map token indexes to their respective real-valued vectors (embeddings).

Below, we have included one image showing how word embeddings look.

1.1 Load Dataset¶

In this section, we have simply loaded AG NEWS dataset available from torchtext which we'll use for our tutorial. The dataset has text documents for four different categories (["World", "Sports", "Business", "Sci/Tech"]) of news.

import numpy as np

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

X_train_text, Y_train = [], []
for Y, X in train_dataset:
    X_train_text.append(X)
    Y_train.append(Y-1)

X_test_text, Y_test = [], []
for Y, X in test_dataset:
    X_test_text.append(X)
    Y_test.append(Y-1)

Y_train, Y_test = np.array(Y_train), np.array(Y_test)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

len(X_train_text), len(X_test_text)

train.csv: 29.5MB [00:00, 88.4MB/s]
test.csv: 1.86MB [00:00, 61.4MB/s]

(120000, 7600)

1.2 Tokenize And Vectorize Text Data¶

In this section, we have completed two steps of encoding mentioned earlier.

First, we have created an instance of Tokenizer() and called fit_on_texts() method on it with train and test text documents. This call will tokenize each text example and populate vocabulary inside of Tokenizer object.

Next, we have called texts_to_sequences() method on Tokenizer object with train and test datasets. It'll tokenize each text example from datasets and retrieve indexes of those tokens. The output of this function call will be a list of indexes of tokens of text examples.

Below, we have explained with a simple example how one text example will be mapped to indexes using vocabulary.

text = "Hello, How are you? Where are you planning to go?"

tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
            'are', 'you', 'planning', 'to', 'go', '?']

vocab = {
    'hello': 0,
    'bye': 1,
    'how': 2,
    'the': 3,
    'welcome': 4,
    'are': 5,
    'you': 6,
    'to': 7,
    '<unk>': 8,
}

vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]

from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)

X_train = tokenizer.texts_to_sequences(X_train_text)
X_test  = tokenizer.texts_to_sequences(X_test_text)

Approach 1: Single Recurrent Layer Network (Max Tokens=25, Embedding Length=30, RNN Output=50) ¶

Our first approach uses a single Recurrent layer for the text classification tasks. The network consists of an embedding layer, recurrent layer, and dense layer. The embedding layer is responsible for generating embeddings of input token indexes which will be given to the recurrent layer for processing. The output of the recurrent layer will be given to the dense layer for generating probabilities of target classes.

Truncate Data At Max Tokens¶

Earlier, when we converted our text examples to token indexes the length of each example is different because each text document has a different number of tokens (words). For our approach in this section, we have decided to use a maximum of 25 tokens per text example. To achieve this, we have called pad_sequences() function on our vectorized train and test datasets. This function will make sure that all examples have a maximum of 25 token indexes. For examples which has more than 25 tokens, it'll truncate tokens beyond 25 and for examples that have less than 25 tokens, it'll pad them with 0s to bring length to 25.

from keras.preprocessing.sequence import pad_sequences

max_tokens=25

X_train_pad = pad_sequences(X_train, maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_pad  = pad_sequences(X_test , maxlen=max_tokens, padding="post", truncating="post", value=0)

X_train_pad.shape, X_test_pad.shape

((120000, 25), (7600, 25))

Define Network¶

In this section, we have defined the RNN that we'll use for our text classification task. Our network consists of 3 layers.

Embedding Layer
Recurrent Layer
Dense Layer

The embedding layer is the first layer of the network. It takes a list of token indexes as input and returns their respective embeddings. We create embedding layer using Embedding() constructor available from layers sub-module of keras. The layer takes vocabulary and embedding length as input and creates a weight matrix of shape (vocab_length, embed_len). This weight matrix has embeddings for each token. The embedding layer is simply responsible for mapping token indexes to their respective embeddings from this matrix. It takes input of shape (batch_size, max_tokens) = (batch_size, 25) and returns output of shape (batch_size, max_tokens, embed_len) = (batch_size, 25, 30).

The output of the embedding layer is given to the Recurrent layer for processing. We have created a recurrent layer using SimpleRNN() constructor of keras. We have provided its output units as 50. The recurrent layer works by looping through embeddings of tokens of a single text example. It takes input of shape (batch_size, max_tokens, embed_len) and returns output of shape (batch_size, max_tokens, rnn_out).

The output of the Recurrent layer is given to Dense layer which has 4 output units that are the same as the number of target classes. The dense layer applies softmax activation to output to create probabilities.

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding

embed_len = 30
rnn_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    SimpleRNN(rnn_out),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 25, 30)            2160090
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 50)                4050
_________________________________________________________________
dense (Dense)                (None, 4)                 204
=================================================================
Total params: 2,164,344
Trainable params: 2,164,344
Non-trainable params: 0
_________________________________________________________________

2022-04-09 06:50:53.710515: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

Compile Network¶

In this section, we have compiled our network to use RMSProp optimizer, cross entropy loss, and accuracy metric.

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Here, we have trained our network by calling fit() method. We have asked it to use a batch size of 512. The method trains for 5 epochs. We have provided test data as validation data for verifying its accuracy during training. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.

model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))

2022-04-09 06:50:54.191767: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Epoch 1/5
235/235 [==============================] - 10s 35ms/step - loss: 0.5648 - accuracy: 0.7921 - val_loss: 0.3859 - val_accuracy: 0.8721
Epoch 2/5
235/235 [==============================] - 8s 36ms/step - loss: 0.2796 - accuracy: 0.9077 - val_loss: 0.3439 - val_accuracy: 0.8842
Epoch 3/5
235/235 [==============================] - 8s 33ms/step - loss: 0.2191 - accuracy: 0.9280 - val_loss: 0.3134 - val_accuracy: 0.8959
Epoch 4/5
235/235 [==============================] - 8s 33ms/step - loss: 0.1814 - accuracy: 0.9402 - val_loss: 0.3434 - val_accuracy: 0.8907
Epoch 5/5
235/235 [==============================] - 8s 33ms/step - loss: 0.1496 - accuracy: 0.9518 - val_loss: 0.3938 - val_accuracy: 0.8675

<keras.callbacks.History at 0x7fd1fa859210>

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model is doing a good job. We have calculated these metrics using functions available from scikit-learn.

If you want to learn about various ML metrics available from sklearn then please check the below link which covers the majority of them in-depth.

Scikit-Learn - Model Evaluation & Scoring Metrics

Apart from calculations, we have also plotted a confusion matrix for a better understanding of performance. We can notice from the visualization that the network is doing good at classifying text documents in Business category compared to all other categories.

We have created a confusion matrix plot using Python library scikit-plot. It provides visualization for many ML metrics. Please feel free to check the below link if you want to know about it in detail.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.8675

Classification Report :
              precision    recall  f1-score   support

       World       0.84      0.88      0.86      1900
      Sports       0.97      0.86      0.91      1900
    Business       0.80      0.90      0.85      1900
    Sci/Tech       0.88      0.83      0.85      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600


Confusion Matrix :
[[1681   26  116   77]
 [ 213 1627   44   16]
 [  51   14 1712  123]
 [  48   10  269 1573]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have tried to explain the predictions made by our trained network using LIME algorithm. We have used a python library named lime which provides an implementation of the algorithm. It let us create a visualization that highlights words of text documents that contributed to predicting a particular target label.

If you are new to the concept of LIME and want to learn about it in depth then we recommend that you go through the below links as it'll help you better understand it.

In order to create an explanation using LIME, we first need to create an instance of LimeTextExplainer which we have created below.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

explainer

<lime.lime_text.LimeTextExplainer at 0x7fd1d8337c90>

Below, we have first created a simple prediction function that takes a list of text examples as input and returns their prediction probabilities. Later on, we'll use this function to generate explanations.

After defining a function, we randomly selected a text example from the test dataset. Then, we have made predictions on it using our trained network. Our network correctly predicts the target label as Business for the selected text example. Next, we'll create an explanation for this selected example.

def make_predictions(X_batch_text):
    X = tokenizer.texts_to_sequences(X_batch_text)
    X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
    preds = model.predict(X)
    return preds

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

Prediction :  Business
Actual :      Business

Below, we have first called explain_instance() method on LimeTextExplainer to create Explanation object. We have provided a text example, prediction function, and target label to the method for creating an explanation. Then, we have called show_in_notebook() method on the explanation object to create the visualization. From visualization, we can notice that words like 'concessions', 'financing', 'airlines', 'pensions', 'bankruptcy', etc will contribute to predicting target label as Business.

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 2: Single Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50) ¶

Our approach in this section is precisely the same as our previous approach with only a difference in the maximum tokens we have used per text example. We have increased the token count to 50 per text example. The network architecture and rest of the code are exactly the same as the previous approach. We are doing this experiment to see whether increasing the tokens count per text example helps us improve accuracy further or not.

Truncate Data At Max Tokens¶

In this section, we have simply truncated our vectorized datasets to keep 50 tokens per text example.

from keras.preprocessing.sequence import pad_sequences

max_tokens=50

X_train_pad = pad_sequences(X_train, maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_pad  = pad_sequences(X_test , maxlen=max_tokens, padding="post", truncating="post", value=0)

X_train_pad.shape, X_test_pad.shape

((120000, 50), (7600, 50))

Define Network¶

In this section, we have redefined the network that we'll use for the classification task in this section. It is an exact copy of the network we used in the previous section.

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding

embed_len = 30
rnn_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    SimpleRNN(rnn_out),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 50, 30)            2160090
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 50)                4050
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 204
=================================================================
Total params: 2,164,344
Trainable params: 2,164,344
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have compiled a network to use RMSProp optimizer, cross entropy loss, and accuracy metric.

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network using the same settings that we had used in the previous section. The loss and accuracy values getting printed after each epoch hint that our network seems to be doing a good job at the text classification task.

model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))

Epoch 1/5
235/235 [==============================] - 15s 60ms/step - loss: 0.6415 - accuracy: 0.7711 - val_loss: 0.3759 - val_accuracy: 0.8828
Epoch 2/5
235/235 [==============================] - 15s 63ms/step - loss: 0.3240 - accuracy: 0.8973 - val_loss: 0.3286 - val_accuracy: 0.8942
Epoch 3/5
235/235 [==============================] - 14s 60ms/step - loss: 0.2514 - accuracy: 0.9213 - val_loss: 0.3318 - val_accuracy: 0.8989
Epoch 4/5
235/235 [==============================] - 14s 60ms/step - loss: 0.2122 - accuracy: 0.9330 - val_loss: 0.3433 - val_accuracy: 0.8988
Epoch 5/5
235/235 [==============================] - 14s 62ms/step - loss: 0.1810 - accuracy: 0.9425 - val_loss: 0.4012 - val_accuracy: 0.8778

<keras.callbacks.History at 0x7fd1cdcc6790>

Evaluate Network Performance¶

In this section, we have evaluated the performance of the trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it has improved a little bit compared to our previous approach. We have also plotted the confusion matrix for reference purposes. From the confusion matrix plot, we can notice that our network is doing a good job at classifying text documents of Sports and World categories compared to Sci/Tech and Business categories.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.8777631578947368

Classification Report :
              precision    recall  f1-score   support

       World       0.85      0.91      0.88      1900
      Sports       0.88      0.97      0.93      1900
    Business       0.87      0.81      0.84      1900
    Sci/Tech       0.91      0.81      0.86      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


Confusion Matrix :
[[1733   78   55   34]
 [  13 1850   28    9]
 [ 164   80 1546  110]
 [ 130   86  142 1542]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME¶

In this section, we have explained predictions made by our trained network using LIME algorithm. The network correctly predicts the target category as Business. The visualization shows that words like 'airlines', 'united', 'seeks', 'cuts', 'pensions', 'million', 'labor', etc are contributing to predicting target label as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 3: Single Bidirectional Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50) ¶

Our approach in this section again consists of a single recurrent layer but the recurrent layer is bidirectional this time. By default, recurrent layers in Keras are unidirectional which means that it process sequence in the forward direction from beginning to end. In the case of the bidirectional recurrent layer, the sequence is processed in both forward and backward directions to find out patterns in both directions. The majority of the code in this section is the same as the previous with the only difference being the usage of the bidirectional recurrent layer.

Define Network¶

Below, we have defined the network that we'll use in this section. It has the same architecture as our previous approach with the only difference that we have wrapped the recurrent layer inside of Bidirectional() constructor to make it bidirectional.

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, Bidirectional

embed_len = 30
rnn_out = 50

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    Bidirectional(SimpleRNN(rnn_out)),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_2 (Embedding)      (None, 50, 30)            2160090
_________________________________________________________________
bidirectional (Bidirectional (None, 100)               8100
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 404
=================================================================
Total params: 2,168,594
Trainable params: 2,168,594
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have compiled a network to use RMSProp as an optimizer, cross entropy as a loss function, and accuracy as an evaluation metric.

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network using the same settings we have been using for all our sections. We can notice from the accuracy and loss values getting printed after each epoch that our model is doing a good job at the classification task.

model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))

Epoch 1/5
235/235 [==============================] - 26s 104ms/step - loss: 0.6045 - accuracy: 0.7780 - val_loss: 0.3466 - val_accuracy: 0.8851
Epoch 2/5
235/235 [==============================] - 24s 104ms/step - loss: 0.2616 - accuracy: 0.9153 - val_loss: 0.2922 - val_accuracy: 0.9016
Epoch 3/5
235/235 [==============================] - 26s 110ms/step - loss: 0.2014 - accuracy: 0.9341 - val_loss: 0.2692 - val_accuracy: 0.9117
Epoch 4/5
235/235 [==============================] - 26s 110ms/step - loss: 0.1670 - accuracy: 0.9448 - val_loss: 0.2781 - val_accuracy: 0.9113
Epoch 5/5
235/235 [==============================] - 26s 111ms/step - loss: 0.1397 - accuracy: 0.9538 - val_loss: 0.2960 - val_accuracy: 0.9042

<keras.callbacks.History at 0x7fd1fb254d90>

Evaluate Network Performance¶

In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics. We can notice from the accuracy score that it is quite better compared to our previous approaches. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.9042105263157895

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.92      0.91      1900
      Sports       0.96      0.97      0.97      1900
    Business       0.86      0.88      0.87      1900
    Sci/Tech       0.91      0.84      0.87      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600


Confusion Matrix :
[[1755   52   51   42]
 [  25 1851   15    9]
 [ 101   18 1664  117]
 [  70   15  213 1602]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained predictions made by our network using LIME. The network correctly predicts the target label as Business for selected text example from the test set. The visualization shows that words like 'airlines', 'pensions', 'bankruptcy', 'concessions', 'labor', 'financing', etc are contributing to predicting target label as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 4: Stacking Multiple Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75) ¶

Our approach in this section uses RNN with multiple recurrent layers. We have stacked multiple recurrent layers to check whether they improve accuracy over single recurrent layers or not. The majority of the code in this section is the same as our earlier sections with the only difference being network architecture.

Define Network¶

Below, we have defined the network that we'll use for our task in this section. The network consists of a single embedding layer followed by 3 recurrent layers and one dense layer. The recurrent layers have 50,60 and 75 output units respectively.

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding

embed_len = 30
rnn_out1 = 50
rnn_out2 = 60
rnn_out3 = 75

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    SimpleRNN(rnn_out1, return_sequences=True),
                    SimpleRNN(rnn_out2, return_sequences=True),
                    SimpleRNN(rnn_out3),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_3 (Embedding)      (None, 50, 30)            2160090
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 50, 50)            4050
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 50, 60)            6660
_________________________________________________________________
simple_rnn_5 (SimpleRNN)     (None, 75)                10200
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 304
=================================================================
Total params: 2,181,304
Trainable params: 2,181,304
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have compiled our network to use RMSProp as an optimizer, cross entropy loss, and accuracy as an evaluation metric.

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network by using the same settings that we have used for all our previous approaches. We can notice from the loss and accuracy values that the network is doing a good job.

model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))

Epoch 1/5
235/235 [==============================] - 47s 188ms/step - loss: 0.7341 - accuracy: 0.7042 - val_loss: 0.4671 - val_accuracy: 0.8430
Epoch 2/5
235/235 [==============================] - 43s 185ms/step - loss: 0.3372 - accuracy: 0.8889 - val_loss: 0.3584 - val_accuracy: 0.8812
Epoch 3/5
235/235 [==============================] - 43s 182ms/step - loss: 0.2613 - accuracy: 0.9143 - val_loss: 0.3772 - val_accuracy: 0.8701
Epoch 4/5
235/235 [==============================] - 46s 195ms/step - loss: 0.2173 - accuracy: 0.9282 - val_loss: 0.3688 - val_accuracy: 0.8784
Epoch 5/5
235/235 [==============================] - 44s 186ms/step - loss: 0.1800 - accuracy: 0.9403 - val_loss: 0.4852 - val_accuracy: 0.8363

<keras.callbacks.History at 0x7fd1cec87210>

Evaluate Network Performance¶

In this section, we have evaluated the performance of our network by calculating accuracy, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is the worst of all the approaches we tried till now. This is surprising as we had expected that stacking multiple recurrent layers will give more power to the network to better capture patterns but it turned out to be otherwise. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.8363157894736842

Classification Report :
              precision    recall  f1-score   support

       World       0.93      0.72      0.81      1900
      Sports       0.85      0.96      0.91      1900
    Business       0.84      0.77      0.80      1900
    Sci/Tech       0.76      0.89      0.82      1900

    accuracy                           0.84      7600
   macro avg       0.84      0.84      0.83      7600
weighted avg       0.84      0.84      0.83      7600


Confusion Matrix :
[[1370  282   89  159]
 [  44 1832   15    9]
 [  36   19 1469  376]
 [  27   11  177 1685]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have explained network prediction using LIME. The network correctly predicts the target label as Business for the selected text example. According to visualization, words like 'united', 'further', etc are contributing to predicting the target label as Business.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 5: Stacking Multiple Bidirectional Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75) ¶

Our approach in this section is exactly the same as our previous approach with one minor change. We are again using multiple RNN layers in the network but this time all layers are bidirectional. The majority of the code is the same as in our previous section with only a change in network architecture.

Define Network¶

Below, we have defined a network that we'll be using in this section. The network architecture is the same as the previous approach with the only change that we have wrapped all RNN layers in Bidirectional() constructor to make them bidirectional. This will force all of them to look for sequences in both forward and backward directions.

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, Bidirectional

embed_len = 30
rnn_out1 = 50
rnn_out2 = 60
rnn_out3 = 75

model = Sequential([
                    Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=embed_len,  input_length=max_tokens),
                    Bidirectional(SimpleRNN(rnn_out1, return_sequences=True)),
                    Bidirectional(SimpleRNN(rnn_out2, return_sequences=True)),
                    Bidirectional(SimpleRNN(rnn_out3)),
                    Dense(len(target_classes), activation="softmax")
                ])

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_4 (Embedding)      (None, 50, 30)            2160090
_________________________________________________________________
bidirectional_1 (Bidirection (None, 50, 100)           8100
_________________________________________________________________
bidirectional_2 (Bidirection (None, 50, 120)           19320
_________________________________________________________________
bidirectional_3 (Bidirection (None, 150)               29400
_________________________________________________________________
dense_4 (Dense)              (None, 4)                 604
=================================================================
Total params: 2,217,514
Trainable params: 2,217,514
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have compiled our network to use RMSProp optimizer, cross entropy loss, and accuracy metric.

model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network using the same settings that we have been using for all our previous approaches. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.

model.fit(X_train_pad, Y_train, batch_size=512, epochs=5, validation_data=(X_test_pad, Y_test))

Epoch 1/5
235/235 [==============================] - 100s 400ms/step - loss: 0.5647 - accuracy: 0.7799 - val_loss: 0.3728 - val_accuracy: 0.8778
Epoch 2/5
235/235 [==============================] - 94s 401ms/step - loss: 0.2366 - accuracy: 0.9212 - val_loss: 0.2955 - val_accuracy: 0.9013
Epoch 3/5
235/235 [==============================] - 94s 400ms/step - loss: 0.1808 - accuracy: 0.9398 - val_loss: 0.3041 - val_accuracy: 0.9029
Epoch 4/5
235/235 [==============================] - 95s 404ms/step - loss: 0.1369 - accuracy: 0.9545 - val_loss: 0.3520 - val_accuracy: 0.8887
Epoch 5/5
235/235 [==============================] - 96s 407ms/step - loss: 0.0958 - accuracy: 0.9688 - val_loss: 0.3478 - val_accuracy: 0.9054

<keras.callbacks.History at 0x7fd1a86bfd10>

Evaluate Network Performance¶

In this section, we have evaluated the performance of the trained network as usual by calculating the accuracy score, classification report and confusion matrix metrics. We can notice from the accuracy score that it is the highest accuracy of all our approaches. Apart from metrics calculations, we have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_preds = model.predict(X_test_pad).argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))

Test Accuracy : 0.9053947368421053

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.91      0.91      1900
      Sports       0.94      0.98      0.96      1900
    Business       0.87      0.88      0.88      1900
    Sci/Tech       0.90      0.85      0.88      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.90      7600
weighted avg       0.91      0.91      0.90      7600


Confusion Matrix :
[[1720   57   72   51]
 [  14 1869   10    7]
 [  76   30 1680  114]
 [  83   34  171 1612]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

Here, we have tried to explain predictions of our trained network using LIME algorithm. Our network correctly predicts the target category as Business for the selected text example. The visualization created using LIME shows that words like 'concessions', 'financing', 'employees', 'cuts', 'bankruptcy', 'pensions', etc are used to predict the target label as Business which makes sense as they are commonly used words in the business world.

from lime import lime_text
import numpy as np

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)

print("Prediction : ", target_classes[preds.argmax()])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

7. Summarized Results And Further Recommendations ¶

Below, we have summarized all approaches we tried and their performance on the test dataset.

Approach	Max Tokens	Embedding Length	RNN Output	Test Accuracy (%)
Single Recurrent Layer Network	25	30	50	86.75
Single Recurrent Layer Network	50	30	50	87.77
Single Bidirectional Recurrent Layer Network	50	30	50	90.42
Stacking Multiple Recurrent Layers	50	30	50,60,75	83.63
Stacking Multiple Bidirectional Recurrent Layers	50	30	50,60,75	90.53

Further Suggestions¶

Try different max tokens.
Try different embedding lengths.
Try different RNN layer output units.
Train network for more epochs.
Try different optimizers (Adam).
Try different weight initialization for networks.
Try to stack the different number of RNN layers.
Try learning rate schedulers

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

keras, RNNs, text-classification

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Keras: RNNs For Text Classification Tasks¶

Important Sections Of Tutorial¶

1. Prepare Data ¶

1.1 Load Dataset¶

1.2 Tokenize And Vectorize Text Data¶

Approach 1: Single Recurrent Layer Network (Max Tokens=25, Embedding Length=30, RNN Output=50) ¶

Truncate Data At Max Tokens¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 2: Single Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50) ¶

Truncate Data At Max Tokens¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME¶

Approach 3: Single Bidirectional Recurrent Layer Network (Max Tokens=50, Embedding Length=30, RNN Output=50) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 4: Stacking Multiple Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 5: Stacking Multiple Bidirectional Recurrent Layers (Max Tokens=50, Embedding Length=30, RNN Output=50,60,75) ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

7. Summarized Results And Further Recommendations ¶

Further Suggestions¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription