Updated On : Jul-03,2022 Tags mxnet, text-generation, …

Guide to Text Generation using MXNet RNNs (LSTM) (Character Embeddings)

Text generation is a type of language modeling task where we create a language model to generate new text. It is an active area of research under NLP. These language model needs to understand the structure of language (grammar, spellings, etc) in order to generate new text. The language models like these are used for tasks like text translation, conversational systems (chatbots), speech-to-text, text summarization, etc. Nowadays deep neural networks are getting developed for text generation tasks. The commonly used architecture for text generation tasks is Recurrent Neural Networks (RNNs). The RNNs and their variants (LSTM, GRU, etc) are quite good at remembering sequences in data. As with text generation tasks also, we need it to understand the sequence of words/characters in order to predict new words/characters, these models are quite a good choice.

As a part of this tutorial, we'll explain how we can create Recurrent Neural Networks (RNNs) consisting of LSTM Layers using Python Deep Learning library MXNet for solving text generation task. We'll be using character-based approach for text generation where we'll give network a few characters and will ask it to generate a new character after those characters. To encode text, we'll combine bag of words with character embeddings. We have used Wikipedia dataset available from torchtext to train our network. It has a list of well-curated Wikipedia articles. Apart from this one, we have one more tutorial on text generation using MXNet where we haven't used character embeddings to encode data. We have only used a bag of words. Please feel free to check the below link if it interests you.

Below, we have listed important sections of the Tutorial to give an overview of the material covered.

Important Sections Of Tutorial

  1. Prepare Data
    • 1.1 Load Data
    • 1.2 Populate Vocabulary
    • 1.3 Organize Data for Task
    • 1.4 Create Data Loader
  2. Define Network
  3. Train Network
  4. Generate Text
  5. Train Network More
  6. Generate Text
  7. Train Even More
  8. Generate Text
  9. Further Suggestions

Below, we have imported the necessary Python libraries and printed the versions that we have used in our tutorial.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))
MXNet Version : 1.9.0
import gluonnlp

print("GluonNLP Version : {}".format(gluonnlp.__version__))
GluonNLP Version : 0.10.0
import torchtext

print("TorchText Version : {}".format(torchtext.__version__))
TorchText Version : 0.10.1
device = mxnet.gpu() if mxnet.test_utils.list_gpus() else mxnet.cpu()

device
gpu(0)

1. Prepare Data

In this section, we are preparing data that will be given to network for training purposes. We have decided that we'll be giving network 100 characters and ask it to predict the next character after those characters. As explained earlier, we'll be using character embeddings as our encoding method. We'll be following the below steps to prepare data.

  1. Load Data
  2. Loop through each text example of data and create a vocabulary of unique characters. A vocabulary is a simple mapping from a character to an integer index. Each character is assigned a unique integer index starting from 0.
  3. Move the window of size 100 through the text of each text example taking characters inside the window as data features (X) and a character coming after the window as the target value (Y). To explain with an example,
    • Characters 1-100 will be data features (X) and character 101 will be the target value (Y).
    • Move the window by 1 position.
    • Characters 2-101 will be data features (X) and character 102 will be target value (Y).
    • Move the window by 1 position.
    • Characters 3-102 will be data features (X) and character 103 will be target value (Y).
    • and so on.
  4. Retrieve integer indexing for characters present in data features (X) and target values (Y) using our populated vocabulary. After completion of this step, data features (X) and target values (Y) will be arrays of integers.
  5. Assign a real-valued vector (character embeddings) to each integer of data features (X). This step will be performed by embedding layer of the neural network.

In short, we'll first map characters to integer indexes and then will map integer indexes to character embeddings. The steps till mapping to integer indexes will be performed in this section. The mapping of an integer index to its character embedding (real-valued vector) will be performed by the embedding layer of the network. The embedding is a real-valued vector assigned to a character (through integer index mapping) which will give it more representation power.

Below, we have included an image that shows word embeddings. The character embedding is almost same with the only difference being that we are assigning a real-valued vector to an individual character.

MXNet: Text Generation using Character Embeddings

1.1 Load Data

In this section, we have simply loaded Wikipedia dataset that we are going to use for our case. The dataset has well-curated Wikipedia articles. The dataset is available from torchtext library. It is already divided into train, validation, and test sets. We'll be using the train set for our purposes which have ~36k text documents.

from mxnet.gluon.data import ArrayDataset

train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()

X_train = list(train_dataset)

train_dataset = ArrayDataset(X_train)
wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:00<00:00, 10.1MB/s]
len(train_dataset)
36718

1.2 Populate Vocabulary

In this section, we are populating the vocabulary of unique characters. To populate vocabulary, we have used count_tokens() helper function available from gluonnlp library of mxnet. The method takes as an input list of characters and Counter object. It then updates the count of characters present in the list of characters to Counter object. The Counter object is available from Python collections module. It let us maintain the count of keys. We have initially created an empty Counter object. We are then looping through each text example of the train dataset calling count_tokens() method on the list of characters of text examples. Each call to count_tokens() updates Counter object with count of characters present in that example. After completion of the loop, the Counter object will have a count of each character present in the train dataset. We can then create vocabulary by calling Vocab() constructor with Counter object.

After the creation of vocabulary, we have printed vocabulary content and its size as well. We can see how each character is assigned a unique integer index.

from collections import Counter

counter = Counter()

for dataset in [train_dataset, ]:
    for X in dataset:
        gluonnlp.data.count_tokens(list(X), to_lower=True, counter=counter)

vocab = gluonnlp.Vocab(counter=counter, min_freq=1)

print("Vocabulary Size : {}".format(len(vocab)))
print(vocab.token_to_idx)
Vocabulary Size : 247
{'<unk>': 0, '<pad>': 1, '<bos>': 2, '<eos>': 3, ' ': 4, 'e': 5, 't': 6, 'a': 7, 'n': 8, 'i': 9, 'o': 10, 'r': 11, 's': 12, 'h': 13, 'd': 14, 'l': 15, 'u': 16, 'c': 17, 'm': 18, 'f': 19, 'p': 20, 'g': 21, 'w': 22, 'b': 23, 'y': 24, 'k': 25, ',': 26, '.': 27, 'v': 28, '<': 29, '>': 30, '@': 31, '\n': 32, '1': 33, '0': 34, '=': 35, '"': 36, '2': 37, "'": 38, '9': 39, '-': 40, 'j': 41, 'x': 42, ')': 43, '(': 44, '3': 45, '5': 46, '8': 47, '4': 48, '6': 49, '7': 50, 'z': 51, 'q': 52, ';': 53, '–': 54, ':': 55, '—': 56, '/': 57, '%': 58, '$': 59, '[': 60, ']': 61, 'é': 62, '&': 63, '!': 64, '’': 65, 'í': 66, 'á': 67, 'ā': 68, '°': 69, '£': 70, '?': 71, 'ó': 72, '+': 73, '#': 74, '−': 75, 'š': 76, 'ö': 77, 'ō': 78, 'è': 79, '×': 80, 'ü': 81, 'ä': 82, '“': 83, 'ʻ': 84, 'ś': 85, '”': 86, 'ć': 87, 'ł': 88, 'ø': 89, 'ç': 90, '₹': 91, 'ã': 92, 'µ': 93, 'ì': 94, 'ư': 95, '→': 96, '\ufeff': 97, 'ñ': 98, '…': 99, 'æ': 100, 'ơ': 101, 'å': 102, '⁄': 103, '☉': 104, '*': 105, '‘': 106, '~': 107, 'ú': 108, 'î': 109, '²': 110, 'ë': 111, 'ệ': 112, 'ī': 113, 'α': 114, 'à': 115, '^': 116, 'ễ': 117, '¥': 118, 'ô': 119, 'ă': 120, 'ū': 121, '♯': 122, 'ê': 123, '‑': 124, 'ỳ': 125, 'đ': 126, 'μ': 127, '्': 128, '≤': 129, 'ل': 130, 'ṃ': 131, '†': 132, '~': 133, '€': 134, '±': 135, 'ė': 136, 'ž': 137, 'β': 138, '〈': 139, '〉': 140, '・': 141, '½': 142, 'û': 143, 'č': 144, 'γ': 145, 'с': 146, 'ṭ': 147, 'ị': 148, '„': 149, '♭': 150, 'â': 151, '̃': 152, 'ا': 153, 'ه': 154, '჻': 155, 'ṅ': 156, 'ầ': 157, 'ớ': 158, '′': 159, '⅓': 160, '大': 161, '空': 162, '¡': 163, '³': 164, '·': 165, 'ş': 166, 'ح': 167, 'ص': 168, 'ن': 169, 'ვ': 170, 'ი': 171, 'კ': 172, 'ო': 173, 'ხ': 174, 'ჯ': 175, 'ḥ': 176, 'ṯ': 177, 'ả': 178, 'ấ': 179, '″': 180, '火': 181, '礮': 182, '\\': 183, '`': 184, '|': 185, '§': 186, 'ò': 187, 'þ': 188, 'ń': 189, 'ų': 190, 'ż': 191, 'ʿ': 192, 'κ': 193, 'а': 194, 'в': 195, 'е': 196, 'к': 197, 'о': 198, 'т': 199, 'я': 200, 'ก': 201, 'ง': 202, 'ณ': 203, 'ต': 204, 'ม': 205, 'ย': 206, 'ร': 207, 'ล': 208, 'ั': 209, 'า': 210, 'ิ': 211, '่': 212, '์': 213, 'გ': 214, 'დ': 215, 'ზ': 216, 'რ': 217, 'ს': 218, 'უ': 219, 'ც': 220, 'ძ': 221, 'წ': 222, 'ṣ': 223, 'ắ': 224, 'ử': 225, '₤': 226, '⅔': 227, 'の': 228, 'ァ': 229, 'ア': 230, 'キ': 231, 'ス': 232, 'ッ': 233, 'ト': 234, 'プ': 235, 'ュ': 236, 'リ': 237, 'ル': 238, 'ヴ': 239, '動': 240, '場': 241, '戦': 242, '攻': 243, '機': 244, '殻': 245, '隊': 246}

1.3 Organize Data for Task

In this section, we are organizing our data for the neural network. We have set sequence length as 100 characters as discussed earlier. We'll be giving 100 characters to network and make it predict the next character after them.

In order to prepare data for training, we are looping through text examples of train dataset moving window of size 100 through them. We are adding character that falls inside a window in data features (X_train) and the next character after them in target values (Y_train). Then, we are retrieving integer indexes of characters present in data features (X_train) and target values (Y_train) using our populated vocabulary. After completion of the loop, we'll have two arrays (X_train, Y_train) which will have integer indexes present in them. We have converted these arrays to mxnet nd arrays as required by MXNet networks.

Please make a note that we have not used all text documents present inside of the train dataset as it'll take a lot of training time.

Below we have explained with a simple example how the process works.

vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}

text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10

X_train = [
            ['h','e','l','l','o',',',' ', 'h','o','w'],
            [,'e','l','l','o',',',' ', 'h','o','w',' '],
            ['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
            ['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
            ...
            ['d','e','r','z','c','o','l', 'u','m','n']
            ]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']

X_train_vectorized = [
                        [1,2,3,4,5,6,1,4,7],
                        [2,3,4,5,6,1,4,7,5],
                        [3,4,5,6,1,4,7,5,1],
                        ...
                        [16,2,9,17,13,4,3,11,14,18]
                     ]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time

import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np

train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2() ## Loading dataset again
train_dataset = ArrayDataset(list(train_dataset))

seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []

for text in list(train_dataset)[:6000]:
    for i in range(0, len(text)-seq_length):
        inp_seq = list(text[i:i+seq_length].lower())
        out_seq = text[i+seq_length].lower()
        X_train.append(vocab(inp_seq))
        Y_train.append(vocab[out_seq])

X_train, Y_train = nd.array(X_train, dtype=np.int32), nd.array(Y_train)

X_train.shape, Y_train.shape
CPU times: user 57.9 s, sys: 1.03 s, total: 58.9 s
Wall time: 59.1 s
((1377719, 100), (1377719,))

1.4 Create Data Loader

In this section, we have simply wrapped nd arrays (X_train, Y_train) inside ArrayDataset and then created a data loader from the dataset object. This data loader object will let us loop through training data easily in batches. We have set the batch size to 1024. We have also set shuffle argument to False as we don't want to disturb the character sequence.

from mxnet.gluon.data import DataLoader, ArrayDataset

vectorized_train_dataset = ArrayDataset(X_train, Y_train)

train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break
(1024, 100) (1024,)

2. Define Network

In this section, we have defined a network that we'll use for our task. Our task will be a classification task as we'll be predicting one of the vocabulary character indexes. The network consists of 4 layers.

  1. Emebdding Layer
  2. LSTM Layer
  3. LSTM Layer
  4. Dense Layer

The first layer of our network is the embedding layer which we have created using Embedding() constructor available from "nn" sub-module of MXnet. We have provided vocabulary length and embedding length to the constructor. The embedding length in our case is 50 which means that each character index will be assigned a real-valued vector of length 50. This constructor will create an integer weight matrix of shape (vocab_len, embedding_len). For each character index, the real-valued vector (embedding) will be retrieved by simply integer indexing this weight matrix. The input shape to embedding layer is (batch_size, seq_len) = (batch_size, 100) and output data shape is (batch_size, seq_len, embed_len) = (batch_size, 100, 50).

The output of the embedding layer is given to the first LSTM layers for processing. We have created LSTM layers using LSTM() constructor. The hidden size of the LSTM layer is set as 256 and num_layers parameter is set as 2 instructing the constructor to stack two LSTM layers. The first LSTM layer will transform data from shape (batch_size, 100, 50) to (batch_size, 100, 256). The output of the first LSTM layer will be given to the second LSTM layer which has the same output shape as the first LSTM layer.

The output of the second LSTM layer is given to a dense layer for the processing which has the same output units as the number of characters in vocabulary because we want it to predict the next character. The output of the dense layer is a prediction of our network

After defining the network, we initialized it, made predictions with random data for verification, and printed the shape of weight/biases of each layer.

Please make a NOTE that we have not discussed LSTM layers in detail in this tutorial as we have assumed that the reader has background on it. If you want to know about LSTM then we recommend that you go through the below tutorial which covers it in detail. It can help you with this tutorial.

If you are new to MXNet and want to learn how to create neural networks using it then we suggest that you go through the below tutorial.

from mxnet.gluon import nn, rnn

embed_len = 50
hidden_dim = 256
n_layers = 2

class LSTMTextGenerator(nn.Block):
    def __init__(self, **kwargs):
        super(LSTMTextGenerator, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_len)
        self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
        self.dense = nn.Dense(len(vocab))

    def forward(self, x):
        x = self.embedding(x)
        x = self.lstm(x)
        return self.dense(x[:, -1])

model = LSTMTextGenerator()

model
LSTMTextGenerator(
  (embedding): Embedding(247 -> 50, float32)
  (lstm): LSTM(50 -> 256, NTC, num_layers=2)
  (dense): Dense(None -> 247, linear)
)
from mxnet import init, initializer

model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
preds = model(nd.random.randn(10,seq_length).as_in_context(device))

preds.shape
[11:04:47] ../src/base.cc:79: cuDNN lib mismatch: linked-against version 8005 != compiled-against version 8004.  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
(10, 247)
for key,val in model.collect_params().items():
    print("{:25s} : {}".format(key, val.shape))
embedding0_weight         : (247, 50)
lstm0_l0_i2h_weight       : (1024, 50)
lstm0_l0_h2h_weight       : (1024, 256)
lstm0_l0_i2h_bias         : (1024,)
lstm0_l0_h2h_bias         : (1024,)
lstm0_l1_i2h_weight       : (1024, 256)
lstm0_l1_h2h_weight       : (1024, 256)
lstm0_l1_i2h_bias         : (1024,)
lstm0_l1_h2h_bias         : (1024,)
dense0_weight             : (247, 256)
dense0_bias               : (247,)

3. Train Network

In this section, we are training our network. To train the network, we have designed a simple function that will let us perform the training process. The function takes Trainer* object (network parameters), train data loader, and a number of epochs as input. It executes a training loop number of epochs time. During each epoch, it loops through training data in batches using a train data loader. During each batch, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters. The loss for each batch is recorded and an average loss of all batches per epoch is printed at the end of an epoch.

from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score

def TrainModelInBatches(trainer, train_loader, epochs):
    for i in range(1, epochs+1):
        losses = [] ## Record loss of each batch
        for X_batch, Y_batch in tqdm(train_loader):
            with autograd.record():
                preds = model(X_batch.as_in_context(device)) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch.as_in_context(device)) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        if (i%5)==0:
            print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))

Below, we are actually training our network using the training routine defined in the previous cell. We have initialized a number of epochs to 50 and the learning rate to 0.001. Then, we have initialized our model, cross-entropy loss function, Adam optimizer, and Trainer object. The trainer object holds network parameters and will be responsible for updating them after each forward pass. At last, we have called our training routine with the necessary parameters to perform the training process.

We can notice from the loss value getting printed after epochs that our network seems to be doing a good job at the task as loss is decreasing constantly.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.001

model = LSTMTextGenerator()
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.14it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.15it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.14it/s]
Train CrossEntropyLoss : 1.395
100%|██████████| 1346/1346 [02:45<00:00,  8.13it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.14it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
Train CrossEntropyLoss : 1.235
100%|██████████| 1346/1346 [02:46<00:00,  8.09it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.08it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.15it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
Train CrossEntropyLoss : 1.159
100%|██████████| 1346/1346 [02:45<00:00,  8.13it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.09it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
Train CrossEntropyLoss : 1.109
100%|██████████| 1346/1346 [02:47<00:00,  8.03it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.14it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.15it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
Train CrossEntropyLoss : 1.073
100%|██████████| 1346/1346 [02:46<00:00,  8.09it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.09it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
Train CrossEntropyLoss : 1.047
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.08it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.98it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.03it/s]
Train CrossEntropyLoss : 1.025
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.10it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.09it/s]
Train CrossEntropyLoss : 1.007
100%|██████████| 1346/1346 [02:47<00:00,  8.05it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.01it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.96it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.991
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.12it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.14it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.96it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.20it/s]
Train CrossEntropyLoss : 0.978

4. Generate Text

In this section, we are generating text using our trained network to see how it is performing. We have randomly selected a 100 characters sequence from the train data we prepared earlier. Then we generated 100 characters after them using our trained network using a loop. During the first iteration of the loop, we give the selected sequence as input to network to predict the next character. The output character index of the first iteration is appended at the end of our sequence and the first index is removed from the sequence. This is done to keep the length of sequence consistent at 100. For the second iteration, this modified sequence of character indexes will be given to the network to make a prediction of the next character. The output of the second iteration is appended at the end of the sequence and the first character is removed from the sequence. This modified sequence becomes the input of the model for the third iteration. This process is repeated for all iterations. In the end, the generated 100 characters are printed.

We can see that generated text seems like English language text though it is not making any sense. The model is properly spelling words. It is also able to generate punctuation marks. Overall the result is good. We'll train the network further to see whether it is improving results further or not.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the series . the song was intensified as a master of the series , and the storm for the series ,

5. Train Network More

In this section, we are training the network for another 50 epochs with a learning rate of 0.0003. We can notice from the loss values getting printed that they are decreasing which is a sign of improvement. Next, we'll check the performance of the network by generating new text.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.0003

optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1346/1346 [02:46<00:00,  8.10it/s]
100%|██████████| 1346/1346 [02:48<00:00,  8.01it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.03it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.05it/s]
Train CrossEntropyLoss : 0.952
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.02it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.99it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.04it/s]
Train CrossEntropyLoss : 0.925
100%|██████████| 1346/1346 [02:44<00:00,  8.17it/s]
100%|██████████| 1346/1346 [02:48<00:00,  8.00it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.99it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.16it/s]
Train CrossEntropyLoss : 0.906
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
100%|██████████| 1346/1346 [02:48<00:00,  8.00it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.99it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.890
100%|██████████| 1346/1346 [02:48<00:00,  8.00it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.08it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.98it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.878
100%|██████████| 1346/1346 [02:49<00:00,  7.96it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:48<00:00,  7.97it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.94it/s]
Train CrossEntropyLoss : 0.865
100%|██████████| 1346/1346 [02:50<00:00,  7.89it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.93it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.90it/s]
Train CrossEntropyLoss : 0.854
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.93it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
Train CrossEntropyLoss : 0.846
100%|██████████| 1346/1346 [02:49<00:00,  7.95it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.15it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.89it/s]
Train CrossEntropyLoss : 0.836
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.92it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.20it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.828

6. Generate Text

In this section, we have again generated text using our network which is trained for 50 more epochs now. The code to generate text is exactly the same as earlier. We have started with the same text sequence which we used earlier when generating text. The text generated by the network seems satisfactory though it is not making much sense it looks like English language text. The model is generating proper words this time as well. We'll further train the network to see whether it is further improving.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the first time in the south african @-@ than @-@ three .
everthel producers , in which the song

7. Train Even More

In this section, we have trained the network for another 50 epochs with a learning rate of 0.0001. We can notice from the loss values that the network seems to be improving. We'll check the performance next by generating text.

from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

epochs=50
learning_rate = 0.0001

optimizer = optimizer.Adam(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), optimizer)

TrainModelInBatches(trainer, train_loader, epochs)
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:45<00:00,  8.15it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.95it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
Train CrossEntropyLoss : 0.864
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:51<00:00,  7.87it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.90it/s]
Train CrossEntropyLoss : 0.847
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.19it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.20it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.89it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.837
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.87it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.03it/s]
Train CrossEntropyLoss : 0.828
100%|██████████| 1346/1346 [02:46<00:00,  8.07it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.24it/s]
100%|██████████| 1346/1346 [02:51<00:00,  7.83it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
Train CrossEntropyLoss : 0.821
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:50<00:00,  7.88it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.18it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.814
100%|██████████| 1346/1346 [02:52<00:00,  7.80it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:52<00:00,  7.82it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.24it/s]
Train CrossEntropyLoss : 0.809
100%|██████████| 1346/1346 [02:46<00:00,  8.11it/s]
100%|██████████| 1346/1346 [02:47<00:00,  8.06it/s]
100%|██████████| 1346/1346 [02:51<00:00,  7.86it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.20it/s]
Train CrossEntropyLoss : 0.802
100%|██████████| 1346/1346 [02:52<00:00,  7.78it/s]
100%|██████████| 1346/1346 [02:44<00:00,  8.21it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
100%|██████████| 1346/1346 [02:46<00:00,  8.08it/s]
100%|██████████| 1346/1346 [02:49<00:00,  7.93it/s]
Train CrossEntropyLoss : 0.797
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:53<00:00,  7.77it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.23it/s]
100%|██████████| 1346/1346 [02:43<00:00,  8.22it/s]
Train CrossEntropyLoss : 0.792

8. Generate Text

In this section, we have generated text using our network. We have used the same code to generate text that we have been using for all previous text generations. We can notice that network is properly generating text. The text generated still does not make much sense but looks like English language text. The words are correctly spelled. Next, we have given a few suggestions on how network performance can be improved further.

import random

random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()

print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))

generated_text = []
for i in range(100):
    X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
    preds = model(X_batch.as_in_context(device)) ## Make Prediction
    predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
    generated_text.append(predicted_index) ## Add token index to result
    pattern.append(predicted_index) ## Add token index to original pattern
    pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.

print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
Initial Pattern : 1987 – 88 season where he was named the ihl 's co @-@ rookie of the year and most valuable player af
Generated Text : ter the second conference officers and the song was a christmas " and " striker of the storm began t

9. Further Suggestions

  1. Train the network for more epochs.
  2. Try different sequence lengths. We used a sequence length of 100 characters.
  3. Try to generate more next characters. We had generated just one next character.
  4. Try different LSTM output sizes.
  5. Add more LSTM layers. This can increase training time drastically hence think thrice before trying.
  6. Try the n-gram/word-based model instead of the character-based.
  7. Try learning rate schedulers
  8. Add little randomness to the prediction of the next character. REFERENCE
Sunny Solanki  Sunny Solanki

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.