Updated On : Jul-01,2022 Tags mxnet, text-generation, …

MXNet: Text Generation using LSTM Networks (Character-based RNNs)

Text Generation is an NLP task of generating new text using some trained network. The networks used for text generation tasks are referred to as Language Models and the process of training such networks is referred to as Language Modeling. Other tasks related to NLP like translation, text summarization, speech-to-text, conversational systems (chatbots), etc are also language modeling tasks. Nowadays deep neural networks are developed for language modeling. Generally, The Recurrent Neural Networks and their variants (LSTM, GRU, etc) outperforms other neural network architectures for text generation tasks. The reason behind this is that language sentences have structure. The word sequence is important for correct grammar and forming sentences. The RNNs are quite good at capturing and remembering those sequences hence commonly used for language modeling tasks.

As a part of this tutorial, we have explained how we can create Recurrent Neural Networks (RNNs) consisting of LSTM Layers using Python Deep Learning library MXNet for text generation tasks. We have used character-based approach for text generation which means we'll give the network a specified number of characters of a sentence and will make it predict the next character after them. The text data is encoded using bag of words approach. We have used Wikipedia dataset available from torchtext for our task. It has the text of well-curated Wikipedia articles. We have another tutorial on text generation using MXnet that uses character embeddings for encoding text data. Please check the below link to take a look at it.

Please make a NOTE those language models are quite hard to train on CPU hence we recommend using GPU for the task. We have used GPU for training in this tutorial (see below after library imports).

Below, we have listed essential sections of the tutorial to give an overview of the material covered.

Important Sections Of Tutorial

  1. Prepare Data
    • 1.1 Load Data
    • 1.2 Populate Vocabulary
    • 1.3 Reorganize Data for Task
    • 1.4 Create Dataset and Data Loader
  2. Define Network
  3. Train Network
  4. Generate Text
  5. Train Network More
  6. Generate Text
  7. Train Even More
  8. Generate Text
  9. Further Recommendations

Below, we have imported the necessary libraries and printed the versions that we have used in our tutorials.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.0
import gluonnlp

print("GluonNLP Version : {}".format(gluonnlp.__version__))
GluonNLP Version : 0.10.0
import torchtext

print("TorchText Version : {}".format(torchtext.__version__))
TorchText Version : 0.10.1
device = mxnet.gpu() if mxnet.test_utils.list_gpus() else mxnet.cpu()

device
gpu(0)

1. Prepare Data

In this section, we are preparing data for our text generation task. As discussed earlier, we'll be using character-based approach for the task. We'll be designing a network that takes 100 characters of data as input and predicts the next character. In order to train the network, we'll organize data following the below steps.

  1. Load dataset.
  2. Loop through each text example of the dataset and create a vocabulary of unique characters. A vocabulary is a simple mapping from a character to an integer index. Each character is assigned a unique index starting from 0.
  3. Move the window of size 100 through text examples of data to create data features (X) and target values (Y). To explain with an example,
    • Characters 1-100 will be data features (X) and character 101 will be the target value (Y).
    • Move the window by one character.
    • Characters 2-101 will be data features (X) and character 102 will be target value (Y).
    • Move the window by one character.
    • Characters 3-102 will be data features (X) and character 103 will be target value (Y).
    • Move the window by one character.
    • and so on.
  4. After creating data features and target values, retrieve character index for characters present in them using vocabulary.
  5. Create data loaders to train the network.

The output of the 5th step will be used to train the network. It's okay if you don't understand the steps exactly, they will become clear as we implement them below.

1.1 Load Data

In this section, we have loaded our Wikipedia dataset from torchtext library. The dataset is already divided into train, test, and validation sets. We'll be using only train dataset for our purpose. It has nearly ~36k text documents each representing a unique Wikipedia article.

from mxnet.gluon.data import ArrayDataset

train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()

X_train = list(train_dataset)

train_dataset = ArrayDataset(X_train)
wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:00<00:00, 10.2MB/s]
len(train_dataset)
36718

1.2 Populate Vocabulary

In this section, we have populated the vocabulary of unique characters. In order to populate vocabulary, we have used count_tokens() helper functions available from gluonnlp helper Python library from MXnet. We have first created a Counter object available from collections Python library. This object is a kind of dictionary object that maintains characters and their count. After defining the counter object, we are looping through each text example of the dataset calling count_tokens() method with a list of characters and counter object. This method will keep updating the counter object with frequency of characters. After completion of the loop, Counter object will have all characters and their counts in the dataset.

Then, we have created a vocabulary by calling Vocab() constructor available from gluonnlp module with Counter object. This Vocab object holds our vocabulary of characters. We have printed a number of entries present in vocabulary as well as vocabulary contents itself.

from collections import Counter

counter = Counter()

for dataset in [train_dataset, ]:
    for X in dataset:
        gluonnlp.data.count_tokens(list(X), to_lower=True, counter=counter)

vocab = gluonnlp.Vocab(counter=counter, min_freq=1)

print("Vocabulary Size : {}".format(len(vocab)))
print(vocab.token_to_idx)
Vocabulary Size : 247
{'<unk>': 0, '<pad>': 1, '<bos>': 2, '<eos>': 3, ' ': 4, 'e': 5, 't': 6, 'a': 7, 'n': 8, 'i': 9, 'o': 10, 'r': 11, 's': 12, 'h': 13, 'd': 14, 'l': 15, 'u': 16, 'c': 17, 'm': 18, 'f': 19, 'p': 20, 'g': 21, 'w': 22, 'b': 23, 'y': 24, 'k': 25, ',': 26, '.': 27, 'v': 28, '<': 29, '>': 30, '@': 31, '\n': 32, '1': 33, '0': 34, '=': 35, '"': 36, '2': 37, "'": 38, '9': 39, '-': 40, 'j': 41, 'x': 42, ')': 43, '(': 44, '3': 45, '5': 46, '8': 47, '4': 48, '6': 49, '7': 50, 'z': 51, 'q': 52, ';': 53, '–': 54, ':': 55, '—': 56, '/': 57, '%': 58, '$': 59, '[': 60, ']': 61, 'é': 62, '&': 63, '!': 64, '’': 65, 'í': 66, 'á': 67, 'ā': 68, '°': 69, '£': 70, '?': 71, 'ó': 72, '+': 73, '#': 74, '−': 75, 'š': 76, 'ö': 77, 'ō': 78, 'è': 79, '×': 80, 'ü': 81, 'ä': 82, '“': 83, 'ʻ': 84, 'ś': 85, '”': 86, 'ć': 87, 'ł': 88, 'ø': 89, 'ç': 90, '₹': 91, 'ã': 92, 'µ': 93, 'ì': 94, 'ư': 95, '→': 96, '\ufeff': 97, 'ñ': 98, '…': 99, 'æ': 100, 'ơ': 101, 'å': 102, '⁄': 103, '☉': 104, '*': 105, '‘': 106, '~': 107, 'ú': 108, 'î': 109, '²': 110, 'ë': 111, 'ệ': 112, 'ī': 113, 'α': 114, 'à': 115, '^': 116, 'ễ': 117, '¥': 118, 'ô': 119, 'ă': 120, 'ū': 121, '♯': 122, 'ê': 123, '‑': 124, 'ỳ': 125, 'đ': 126, 'μ': 127, '्': 128, '≤': 129, 'ل': 130, 'ṃ': 131, '†': 132, '~': 133, '€': 134, '±': 135, 'ė': 136, 'ž': 137, 'β': 138, '〈': 139, '〉': 140, '・': 141, '½': 142, 'û': 143, 'č': 144, 'γ': 145, 'с': 146, 'ṭ': 147, 'ị': 148, '„': 149, '♭': 150, 'â': 151, '̃': 152, 'ا': 153, 'ه': 154, '჻': 155, 'ṅ': 156, 'ầ': 157, 'ớ': 158, '′': 159, '⅓': 160, '大': 161, '空': 162, '¡': 163, '³': 164, '·': 165, 'ş': 166, 'ح': 167, 'ص': 168, 'ن': 169, 'ვ': 170, 'ი': 171, 'კ': 172, 'ო': 173, 'ხ': 174, 'ჯ': 175, 'ḥ': 176, 'ṯ': 177, 'ả': 178, 'ấ': 179, '″': 180, '火': 181, '礮': 182, '\\': 183, '`': 184, '|': 185, '§': 186, 'ò': 187, 'þ': 188, 'ń': 189, 'ų': 190, 'ż': 191, 'ʿ': 192, 'κ': 193, 'а': 194, 'в': 195, 'е': 196, 'к': 197, 'о': 198, 'т': 199, 'я': 200, 'ก': 201, 'ง': 202, 'ณ': 203, 'ต': 204, 'ม': 205, 'ย': 206, 'ร': 207, 'ล': 208, 'ั': 209, 'า': 210, 'ิ': 211, '่': 212, '์': 213, 'გ': 214, 'დ': 215, 'ზ': 216, 'რ': 217, 'ს': 218, 'უ': 219, 'ც': 220, 'ძ': 221, 'წ': 222, 'ṣ': 223, 'ắ': 224, 'ử': 225, '₤': 226, '⅔': 227, 'の': 228, 'ァ': 229, 'ア': 230, 'キ': 231, 'ス': 232, 'ッ': 233, 'ト': 234, 'プ': 235, 'ュ': 236, 'リ': 237, 'ル': 238, 'ヴ': 239, '動': 240, '場': 241, '戦': 242, '攻': 243, '機': 244, '殻': 245, '隊': 246}

1.3 Reorganize Data for Task

In this section, we have organized our dataset for training purposes. We are looping through text examples of train dataset one by one and moving window of size 100 through each example as we had discussed earlier. We have created data features (X_train) and target values (Y_train) that have characters. Then, we have retrieved the index of characters from the vocabulary as well for data features (X_train) and target values (Y_train). Now our dataset consists of character indexes which will be given to the network for training. We have also converted arrays to mxnet ndarrays as required by MXNet networks.

Please make a NOTE that we have used the first few text examples from the dataset for training and not the whole dataset. The reason behind this is that it'll take a lot of time to train the network if we use all examples.

Below, we have tried to explain the process with one simple example.

vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}

text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10

X_train = [
            ['h','e','l','l','o',',',' ', 'h','o','w'],
            [,'e','l','l','o',',',' ', 'h','o','w',' '],
            ['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
            ['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
            ...
            ['d','e','r','z','c','o','l', 'u','m','n']
            ]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']

X_train_vectorized = [
                        [1,2,3,4,5,6,1,4,7],
                        [2,3,4,5,6,1,4,7,5],
                        [3,4,5,6,1,4,7,5,1],
                        ...
                        [16,2,9,17,13,4,3,11,14,18]
                     ]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time

import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np

train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
train_dataset = ArrayDataset(list(train_dataset))

seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []

for text in list(train_dataset)[:7500]:
    for i in range(0, len(text)-seq_length):
        inp_seq = list(text[i:i+seq_length].lower())
        out_seq = text[i+seq_length].lower()
        X_train.append(vocab(inp_seq)) ## Retrieve character index
        Y_train.append(vocab[out_seq]) ## Retrieve character index

X_train, Y_train = nd.array(X_train, dtype=np.float32), nd.array(Y_train)

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1) ## Extra dimension is added for LSTM layer

X_train.shape, Y_train.shape
CPU times: user 1min 15s, sys: 1.32 s, total: 1min 17s
Wall time: 1min 17s
((1781323, 100, 1), (1781323,))

1.4 Create Dataset and Data Loader

In this section, we have created the dataset and data loader using data features and target values ndarrays we created in the previous step. The data loader will help us loop through training data in batches. We have set shuffle to False to prevent shuffling of examples because character sequence is important. The batch size is set at 1024.

from mxnet.gluon.data import DataLoader, ArrayDataset

vectorized_train_dataset = ArrayDataset(X_train, Y_train)

train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(1024, 100, 1) (1024,)

2. Define Network

In this section, we have defined our LSTM network that we'll use for our task. As we'll be predicting one of the vocabulary characters as output, our task will be considered classification task. The network is simple and consists of 3 layers.

  1. LSTM Layer
  2. LSTM Layer
  3. Dense Layer

The first two layers of our network are LSTM layers. We have created LSTM layers using LSTM() constructor available from rnn sub-module of gluon sub-module of MXnet. We have set the output of the size of lstm layers at 256. By setting num_layers parameter to 2, we have asked the constructor to stack two LSTM layers. The input shape to first LSTM layer will be (batch_size, seq_length, 1) = (batch_size, 100, 1) and output shape will be (batch_size, seq_length, hidden_size) = (batch_size, 100, 256). The output of the first LSTM layer will be given to the second LSTM layer for processing and it'll also produce an output of shape (batch_size, 100, 256). The LSTM layers process a sequence of character indexes internally. For each example, it goes through 100 characters and produces a final output that in some way remembers something about this 100 characters sequence.

The output of the second LSTM layer will be given to the dense layer. The dense layer has the same output units as the length of vocabulary. The output of the dense layer will be of shape (batch_size, vocab_len) and will be a prediction of the network.

After defining the network, we initialized it and performed a forward pass through it for verification purposes. We have also printed the shape of weights/biases of layers of network for information purposes.

We have not discussed LSTM layer in detail over here. If you are someone new to it then we recommend that you go through the below link in your free time where we have explained it. The tutorial uses the LSTM network for text classification task and it'll help you understand LSTM layers a little better.

If you are new to MXNet and want to learn how to create neural networks using it then please check the below link in your free time.

from mxnet.gluon import nn, rnn
from mxnet import gluon

hidden_dim = 256
n_layers = 2

class LSTMTextGenerator(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextGenerator, self).__init__(**kwargs)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=1)
        self.dense = nn.Dense(len(vocab))

    def forward(self, x):
        x = self.lstm(x)
        return self.dense(x[:, -1])

model = LSTMTextGenerator()

model
LSTMTextGenerator(
  (lstm): LSTM(1 -> 256, NTC, num_layers=2)
  (dense): Dense(None -> 247, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
preds = model(nd.random.randn(10,seq_length,1, ctx=device))

preds.shape
[05:24:04] ../src/base.cc:79: cuDNN lib mismatch: linked-against version 8005 != compiled-against version 8004.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
(10, 247)
for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
lstm0_l0_i2h_weight       : (1024, 1)
lstm0_l0_h2h_weight       : (1024, 256)
lstm0_l0_i2h_bias         : (1024,)
lstm0_l0_h2h_bias         : (1024,)
lstm0_l1_i2h_weight       : (1024, 256)
lstm0_l1_h2h_weight       : (1024, 256)
lstm0_l1_i2h_bias         : (1024,)
lstm0_l1_h2h_bias         : (1024,)
dense0_weight             : (247, 256)
dense0_bias               : (247,)

3. Train Network

In this section, we are training our network. In order to train the network, we have designed a function that will help us perform the training process. The function takes the trainer object, train data loader, and a number of epochs as input. It then executes a training loop number of epochs time. During each epoch, it loops through training data in batches. For each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters. The function records the loss of each batch and prints the average loss of all batches at the end of the epoch.

from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def TrainModelInBatches(trainer, train_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch.as_in_context(device)) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch.as_in_context(device)) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        if (i%5)==0:
            print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))

Below, we are actually training the network using the function designed in the previous cell. We have initialized a number of epochs to 50 and the learning rate to 0.001. Then, we have initialized our model, cross-entropy loss, Adam optimizer, and Trainer object (with network parameters). At last, we have called our training routine with the necessary parameters to perform training. We can notice from the decreasing loss value at the end of each epoch that our model seems to be improving. Next, we'll generate some text using it.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.001

model = LSTMTextGenerator()
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.93it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.89it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.90it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
Train CrossEntropyLoss : 1.852
100%|██████████| 1740/1740 [03:15<00:00,  8.90it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.79it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.94it/s]
Train CrossEntropyLoss : 1.637
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.90it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.87it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.83it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
Train CrossEntropyLoss : 1.525
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.90it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.83it/s]
Train CrossEntropyLoss : 1.456
100%|██████████| 1740/1740 [03:14<00:00,  8.95it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.82it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.91it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
Train CrossEntropyLoss : 1.408
100%|██████████| 1740/1740 [03:16<00:00,  8.85it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.85it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.73it/s]
Train CrossEntropyLoss : 1.373
100%|██████████| 1740/1740 [03:16<00:00,  8.83it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.73it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.81it/s]
Train CrossEntropyLoss : 1.346
100%|██████████| 1740/1740 [03:16<00:00,  8.85it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.83it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
Train CrossEntropyLoss : 1.323
100%|██████████| 1740/1740 [03:14<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.80it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.80it/s]
Train CrossEntropyLoss : 1.305
100%|██████████| 1740/1740 [03:17<00:00,  8.81it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.80it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.85it/s]
Train CrossEntropyLoss : 1.289

4. Generate Text

In this section, we are generating text using our trained model. To start with, we have randomly selected an example from our dataset. We have printed the characters of our example. Then, we have executed a loop of generating new characters. The loop generates 100 new characters. For the first iteration, our selected example is given to the model to make the prediction. The predicted character is added at the end of the sequence and the first character is removed from the sequence to keep the sequence length of 100 as required by our model. This modified sequence with a predicted character added at the end of the sequence becomes an input to the model for the next iteration. This process is repeated for each iteration where we generate a new character, add it at the end of the sequence and remove the existing first character from the sequence. After generating 100 new characters, we have also printed them.

We can notice from the generated text that our model has learned to form words and there are no spelling mistakes. The generated text seems to be in the English language though it does not make much sense. The prediction made by the network seems a little deterministic as it is repeating words. We'll train the network for more epochs to see whether we can further improve results.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the command of the command of the command of the command of the command of the command of the co

5. Train Network More

In this section, we have trained the network for another 50 epochs. We have set the learning rate to 0.0003. We can notice from the loss values that our network is improving further because loss is decreasing at every epoch. Next, we'll test it by generating new text.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.0003

optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.80it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.78it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
Train CrossEntropyLoss : 1.270
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.93it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.73it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
Train CrossEntropyLoss : 1.261
100%|██████████| 1740/1740 [03:19<00:00,  8.72it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.71it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.70it/s]
100%|██████████| 1740/1740 [03:20<00:00,  8.68it/s]
Train CrossEntropyLoss : 1.253
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:19<00:00,  8.71it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.02it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.83it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.88it/s]
Train CrossEntropyLoss : 1.246
100%|██████████| 1740/1740 [03:12<00:00,  9.04it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.75it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:20<00:00,  8.67it/s]
Train CrossEntropyLoss : 1.240
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:21<00:00,  8.65it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.97it/s]
Train CrossEntropyLoss : 1.234
100%|██████████| 1740/1740 [03:20<00:00,  8.67it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.04it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.86it/s]
100%|██████████| 1740/1740 [03:17<00:00,  8.80it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
Train CrossEntropyLoss : 1.229
100%|██████████| 1740/1740 [03:19<00:00,  8.73it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.93it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
100%|██████████| 1740/1740 [03:21<00:00,  8.65it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
Train CrossEntropyLoss : 1.224
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
100%|██████████| 1740/1740 [03:21<00:00,  8.63it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:21<00:00,  8.63it/s]
Train CrossEntropyLoss : 1.219
100%|██████████| 1740/1740 [03:12<00:00,  9.04it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.01it/s]
100%|██████████| 1740/1740 [03:21<00:00,  8.66it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.05it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
Train CrossEntropyLoss : 1.214

6. Generate Text

In this section, we have again generated text using our more trained model. The code for text generation is almost the same as earlier. We have started with the same random example. We can notice from the generated text that it is a little better compared to last time. It is generating more words. There are no spelling mistakes as usual. Though the model still seems deterministic. We can train it for a few more epochs to check whether it helps.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the second position of the south of the country of the country of the country of the country of

7. Train Even More

In this section, we have trained the network for another 50 epochs with a learning rate of 0.0001. Please make a note that we have reduced the learning rate a second time. The loss values indicate that our network has improved further. Next, we'll test it by generating new text.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.0001

optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1740/1740 [03:13<00:00,  9.01it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:23<00:00,  8.55it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.02it/s]
100%|██████████| 1740/1740 [03:12<00:00,  9.03it/s]
Train CrossEntropyLoss : 1.214
100%|██████████| 1740/1740 [03:23<00:00,  8.55it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.01it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:23<00:00,  8.55it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.01it/s]
Train CrossEntropyLoss : 1.211
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:23<00:00,  8.55it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:26<00:00,  8.42it/s]
Train CrossEntropyLoss : 1.209
100%|██████████| 1740/1740 [03:15<00:00,  8.89it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:24<00:00,  8.51it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
Train CrossEntropyLoss : 1.206
100%|██████████| 1740/1740 [03:24<00:00,  8.50it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:25<00:00,  8.46it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
Train CrossEntropyLoss : 1.204
100%|██████████| 1740/1740 [03:14<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:25<00:00,  8.46it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:26<00:00,  8.42it/s]
Train CrossEntropyLoss : 1.202
100%|██████████| 1740/1740 [03:14<00:00,  8.95it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.98it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.99it/s]
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
Train CrossEntropyLoss : 1.200
100%|██████████| 1740/1740 [03:22<00:00,  8.58it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.97it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  9.00it/s]
100%|██████████| 1740/1740 [03:13<00:00,  8.97it/s]
Train CrossEntropyLoss : 1.198
100%|██████████| 1740/1740 [03:27<00:00,  8.39it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.95it/s]
100%|██████████| 1740/1740 [03:15<00:00,  8.89it/s]
100%|██████████| 1740/1740 [03:14<00:00,  8.96it/s]
Train CrossEntropyLoss : 1.196
100%|██████████| 1740/1740 [03:18<00:00,  8.76it/s]
100%|██████████| 1740/1740 [03:26<00:00,  8.43it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
100%|██████████| 1740/1740 [03:16<00:00,  8.84it/s]
Train CrossEntropyLoss : 1.194

8. Generate Text

Here, we have again generated new text using our trained network. Our logic to generate text is the same as earlier and it starts with the same example. We can notice from the generated text that it is quite different compared to earlier. It is generating punctuation marks as well this time. It is correctly spelling words. Next, we have given a few recommendations on how we can improve text generation models further.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the country , and the construction of the south of the country .
9@ kear the country , and the

9. Further Recommendations

  1. Train the network for more epochs.
  2. Try different sequence lengths. We tried a sequence length of 100.
  3. Try different encoding approaches to encode text like character embeddings, etc.
  4. Try the n-gram/word-based model instead of the character-based model.
  5. Try different LSTM layer output sizes.
  6. Try adding more LSTM layers. This can increase training time.
  7. Try learning rate schedulers
  8. Add little randomness to the prediction of the next character. REFERENCE
Sunny Solanki  Sunny Solanki

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.