Text generation is a type of language modeling task where we create a language model to generate new text. It is an active area of research under NLP. These language model needs to understand the structure of language (grammar, spellings, etc) in order to generate new text. The language models like these are used for tasks like text translation, conversational systems (chatbots), speech-to-text, text summarization, etc. Nowadays deep neural networks are getting developed for text generation tasks. The commonly used architecture for text generation tasks is Recurrent Neural Networks (RNNs). The RNNs and their variants (LSTM, GRU, etc) are quite good at remembering sequences in data. As with text generation tasks also, we need it to understand the sequence of words/characters in order to predict new words/characters, these models are quite a good choice.
As a part of this tutorial, we'll explain how we can create Recurrent Neural Networks (RNNs) consisting of LSTM Layers using Python Deep Learning library MXNet for solving text generation task. We'll be using character-based approach for text generation where we'll give network a few characters and will ask it to generate a new character after those characters. To encode text, we'll combine bag of words with character embeddings. We have used Wikipedia dataset available from torchtext to train our network. It has a list of well-curated Wikipedia articles. Apart from this one, we have one more tutorial on text generation using MXNet where we haven't used character embeddings to encode data. We have only used a bag of words. Please feel free to check the below link if it interests you.
Below, we have listed important sections of the Tutorial to give an overview of the material covered.
Below, we have imported the necessary Python libraries and printed the versions that we have used in our tutorial.
import mxnet
print("MXNet Version : {}".format(mxnet.__version__))
import gluonnlp
print("GluonNLP Version : {}".format(gluonnlp.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
device = mxnet.gpu() if mxnet.test_utils.list_gpus() else mxnet.cpu()
device
In this section, we are preparing data that will be given to network for training purposes. We have decided that we'll be giving network 100 characters and ask it to predict the next character after those characters. As explained earlier, we'll be using character embeddings as our encoding method. We'll be following the below steps to prepare data.
In short, we'll first map characters to integer indexes and then will map integer indexes to character embeddings. The steps till mapping to integer indexes will be performed in this section. The mapping of an integer index to its character embedding (real-valued vector) will be performed by the embedding layer of the network. The embedding is a real-valued vector assigned to a character (through integer index mapping) which will give it more representation power.
Below, we have included an image that shows word embeddings. The character embedding is almost same with the only difference being that we are assigning a real-valued vector to an individual character.
In this section, we have simply loaded Wikipedia dataset that we are going to use for our case. The dataset has well-curated Wikipedia articles. The dataset is available from torchtext library. It is already divided into train, validation, and test sets. We'll be using the train set for our purposes which have ~36k text documents.
from mxnet.gluon.data import ArrayDataset
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
X_train = list(train_dataset)
train_dataset = ArrayDataset(X_train)
len(train_dataset)
In this section, we are populating the vocabulary of unique characters. To populate vocabulary, we have used count_tokens() helper function available from gluonnlp library of mxnet. The method takes as an input list of characters and Counter object. It then updates the count of characters present in the list of characters to Counter object. The Counter object is available from Python collections module. It let us maintain the count of keys. We have initially created an empty Counter object. We are then looping through each text example of the train dataset calling count_tokens() method on the list of characters of text examples. Each call to count_tokens() updates Counter object with count of characters present in that example. After completion of the loop, the Counter object will have a count of each character present in the train dataset. We can then create vocabulary by calling Vocab() constructor with Counter object.
After the creation of vocabulary, we have printed vocabulary content and its size as well. We can see how each character is assigned a unique integer index.
from collections import Counter
counter = Counter()
for dataset in [train_dataset, ]:
for X in dataset:
gluonnlp.data.count_tokens(list(X), to_lower=True, counter=counter)
vocab = gluonnlp.Vocab(counter=counter, min_freq=1)
print("Vocabulary Size : {}".format(len(vocab)))
print(vocab.token_to_idx)
In this section, we are organizing our data for the neural network. We have set sequence length as 100 characters as discussed earlier. We'll be giving 100 characters to network and make it predict the next character after them.
In order to prepare data for training, we are looping through text examples of train dataset moving window of size 100 through them. We are adding character that falls inside a window in data features (X_train) and the next character after them in target values (Y_train). Then, we are retrieving integer indexes of characters present in data features (X_train) and target values (Y_train) using our populated vocabulary. After completion of the loop, we'll have two arrays (X_train, Y_train) which will have integer indexes present in them. We have converted these arrays to mxnet nd arrays as required by MXNet networks.
Please make a note that we have not used all text documents present inside of the train dataset as it'll take a lot of training time.
Below we have explained with a simple example how the process works.
vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}
text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10
X_train = [
['h','e','l','l','o',',',' ', 'h','o','w'],
[,'e','l','l','o',',',' ', 'h','o','w',' '],
['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
...
['d','e','r','z','c','o','l', 'u','m','n']
]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']
X_train_vectorized = [
[1,2,3,4,5,6,1,4,7],
[2,3,4,5,6,1,4,7,5],
[3,4,5,6,1,4,7,5,1],
...
[16,2,9,17,13,4,3,11,14,18]
]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time
import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2() ## Loading dataset again
train_dataset = ArrayDataset(list(train_dataset))
seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []
for text in list(train_dataset)[:6000]:
for i in range(0, len(text)-seq_length):
inp_seq = list(text[i:i+seq_length].lower())
out_seq = text[i+seq_length].lower()
X_train.append(vocab(inp_seq))
Y_train.append(vocab[out_seq])
X_train, Y_train = nd.array(X_train, dtype=np.int32), nd.array(Y_train)
X_train.shape, Y_train.shape
In this section, we have simply wrapped nd arrays (X_train, Y_train) inside ArrayDataset and then created a data loader from the dataset object. This data loader object will let us loop through training data easily in batches. We have set the batch size to 1024. We have also set shuffle argument to False as we don't want to disturb the character sequence.
from mxnet.gluon.data import DataLoader, ArrayDataset
vectorized_train_dataset = ArrayDataset(X_train, Y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
print(X.shape, Y.shape)
break
In this section, we have defined a network that we'll use for our task. Our task will be a classification task as we'll be predicting one of the vocabulary character indexes. The network consists of 4 layers.
The first layer of our network is the embedding layer which we have created using Embedding() constructor available from "nn" sub-module of MXnet. We have provided vocabulary length and embedding length to the constructor. The embedding length in our case is 50 which means that each character index will be assigned a real-valued vector of length 50. This constructor will create an integer weight matrix of shape (vocab_len, embedding_len). For each character index, the real-valued vector (embedding) will be retrieved by simply integer indexing this weight matrix. The input shape to embedding layer is (batch_size, seq_len) = (batch_size, 100) and output data shape is (batch_size, seq_len, embed_len) = (batch_size, 100, 50).
The output of the embedding layer is given to the first LSTM layers for processing. We have created LSTM layers using LSTM() constructor. The hidden size of the LSTM layer is set as 256 and num_layers parameter is set as 2 instructing the constructor to stack two LSTM layers. The first LSTM layer will transform data from shape (batch_size, 100, 50) to (batch_size, 100, 256). The output of the first LSTM layer will be given to the second LSTM layer which has the same output shape as the first LSTM layer.
The output of the second LSTM layer is given to a dense layer for the processing which has the same output units as the number of characters in vocabulary because we want it to predict the next character. The output of the dense layer is a prediction of our network
After defining the network, we initialized it, made predictions with random data for verification, and printed the shape of weight/biases of each layer.
Please make a NOTE that we have not discussed LSTM layers in detail in this tutorial as we have assumed that the reader has background on it. If you want to know about LSTM then we recommend that you go through the below tutorial which covers it in detail. It can help you with this tutorial.
If you are new to MXNet and want to learn how to create neural networks using it then we suggest that you go through the below tutorial.
from mxnet.gluon import nn, rnn
embed_len = 50
hidden_dim = 256
n_layers = 2
class LSTMTextGenerator(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextGenerator, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_len)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=embed_len)
self.dense = nn.Dense(len(vocab))
def forward(self, x):
x = self.embedding(x)
x = self.lstm(x)
return self.dense(x[:, -1])
model = LSTMTextGenerator()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
preds = model(nd.random.randn(10,seq_length).as_in_context(device))
preds.shape
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we are training our network. To train the network, we have designed a simple function that will let us perform the training process. The function takes Trainer* object (network parameters), train data loader, and a number of epochs as input. It executes a training loop number of epochs time. During each epoch, it loops through training data in batches using a train data loader. During each batch, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters. The loss for each batch is recorded and an average loss of all batches per epoch is printed at the end of an epoch.
from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score
def TrainModelInBatches(trainer, train_loader, epochs):
for i in range(1, epochs+1):
losses = [] ## Record loss of each batch
for X_batch, Y_batch in tqdm(train_loader):
with autograd.record():
preds = model(X_batch.as_in_context(device)) ## Forward pass to make predictions
train_loss = loss_func(preds.squeeze(), Y_batch.as_in_context(device)) ## Calculate Loss
train_loss.backward() ## Calculate Gradients
train_loss = train_loss.mean().asscalar()
losses.append(train_loss)
trainer.step(len(X_batch)) ## Update weights
if (i%5)==0:
print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
Below, we are actually training our network using the training routine defined in the previous cell. We have initialized a number of epochs to 50 and the learning rate to 0.001. Then, we have initialized our model, cross-entropy loss function, Adam optimizer, and Trainer object. The trainer object holds network parameters and will be responsible for updating them after each forward pass. At last, we have called our training routine with the necessary parameters to perform the training process.
We can notice from the loss value getting printed after epochs that our network seems to be doing a good job at the task as loss is decreasing constantly.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.001
model = LSTMTextGenerator()
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
In this section, we are generating text using our trained network to see how it is performing. We have randomly selected a 100 characters sequence from the train data we prepared earlier. Then we generated 100 characters after them using our trained network using a loop. During the first iteration of the loop, we give the selected sequence as input to network to predict the next character. The output character index of the first iteration is appended at the end of our sequence and the first index is removed from the sequence. This is done to keep the length of sequence consistent at 100. For the second iteration, this modified sequence of character indexes will be given to the network to make a prediction of the next character. The output of the second iteration is appended at the end of the sequence and the first character is removed from the sequence. This modified sequence becomes the input of the model for the third iteration. This process is repeated for all iterations. In the end, the generated 100 characters are printed.
We can see that generated text seems like English language text though it is not making any sense. The model is properly spelling words. It is also able to generate punctuation marks. Overall the result is good. We'll train the network further to see whether it is improving results further or not.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
In this section, we are training the network for another 50 epochs with a learning rate of 0.0003. We can notice from the loss values getting printed that they are decreasing which is a sign of improvement. Next, we'll check the performance of the network by generating new text.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.0003
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
In this section, we have again generated text using our network which is trained for 50 more epochs now. The code to generate text is exactly the same as earlier. We have started with the same text sequence which we used earlier when generating text. The text generated by the network seems satisfactory though it is not making much sense it looks like English language text. The model is generating proper words this time as well. We'll further train the network to see whether it is further improving.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
In this section, we have trained the network for another 50 epochs with a learning rate of 0.0001. We can notice from the loss values that the network seems to be improving. We'll check the performance next by generating text.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.0001
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
In this section, we have generated text using our network. We have used the same code to generate text that we have been using for all previous text generations. We can notice that network is properly generating text. The text generated still does not make much sense but looks like English language text. The words are correctly spelled. Next, we have given a few suggestions on how network performance can be improved further.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.int32).reshape(1, seq_length) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to