Text Generation is one of the most important complicated tasks of Natural language processing (NLP). It requires us to understand the underlying structure of language to better form sentences that are meaningful. Nowadays, Language models are getting created using deep neural networks which are good at text generation tasks. Some researchers also refer to the text generation task as a language modeling task as it requires us to create a language model that understands language and then uses that knowledge to generate meaningful content. Language models have applications like machine translation, conversational systems (chatbots), text summarization, speech-to-text, etc. Deep learning models that involve RNN layers (Vanilla RNN, LSTM, GRU, etc) are generally preferred for language modeling tasks. The reason behind this is that these layers are good at understanding sequence found in data compared to dense layers which is the main reason they are commonly used for solving tasks involving time-series data.
As a part of our tutorial, we'll create language model by creating a Recurrent Neural Network consisting of LSTM layers using PyTorch for text generation task. The text generation model generally takes a list of tokens (characters/n-grams/words) as input and predicts the next token (character/n-gram/word) in the sequence as output. We have used character-based approach for our case which means that our network takes a list of characters as input and returns the next character that it thinks should come next. We can also design models that take a list of words as input and predicts the next word. For encoding text data, we have used character embeddings approach which assigns a real-valued vector to each token (character). We have used Wikipedia dataset available from Python library torchtext for training our network. We have another tutorial on text generation using Pytorch which does not use character embeddings and is based on only a bag of words. Please feel free to check it from the below link.
Please make a NOTE that language models are generally big and require training for many epochs hence we recommend using GPU for training them. It'll be hard to train the language model on CPU.
Below, we have listed important sections of Tutorial to give an overview of the material covered.
Below, we have imported the necessary Python libraries that we have used in our tutorial and printed their versions.
import torch
print("PyTorch Version : {}".format(torch.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
device = "cuda" if torch.cuda.is_available() else "cpu"
device
import gc
In this section, we are preparing our dataset for giving it to the neural network for training. As we had said earlier, we'll use character-based approach for our case which means that we'll give a pre-decided length of characters to network and ask it to predict the next character after that sequence of characters. We have decided to use a sequence length of 100 characters to be given to the network for training making it predict the next character after them. We have used embeddings approach for encoding text data where we assign a real-valued vector of specified length to each unique character.
In order to prepare data for the network, we have followed the below steps.
Steps 1-4 mentioned above will be completed in this section. Step 5 will be implemented in the neural network using an embedding layer which will assign unique embeddings to each character index.
Below, we have included an image of word embeddings. The character embeddings is exactly same as word embeddings with the only difference being that we assigned real-valued vector (embeddings) to each character.
In this section, we have loaded our Wikipedia dataset that we are going to use for our task. The dataset has well-curated Wikipedia articles. The main dataset is already divided into the train, validation, and test sets. We'll be using only the train set for our purpose. The training dataset has ~36k articles.
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
In this section, we have populated a vocabulary of all unique words present in our dataset. In order to populate vocabulary, we have used build_vocab_from_iterator() function. This function takes as input a python iterator that returns a list of iterators on each call. We have created a function named build_vocabulary() which will act as our iterator. It takes datasets as input and loops through each example of the dataset yielding a list of characters of each example. We have done special handling of <unk>
token so that it does not get broken into individual characters.
After populating the vocabulary, we printed the length of the vocabulary. We have also printed vocabulary to show unique characters present in it. Later on, we'll use this vocabulary to map each character to their index (E.g, <unk>
will be mapped to index 0, ' ' will be mapped to index 1, character 'e' will be mapped to index 2, character 't' will be mapped to index 3, and so on).
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
def build_vocabulary(datasets):
for dataset in datasets:
for text in dataset:
if "<unk>" in text:
texts = text.split("<unk>")
total = list(texts[0].lower())
for t in texts[1:]:
total.extend(["<unk>", ] + list(t.lower()))
yield total
else:
yield list(text.lower())
vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, ]), min_freq=1, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
len(vocab)
print(vocab.get_itos())
In this section, we have organized our dataset in the proper shape for the neural network. We are looping through each text example moving a window of 100 characters as we had explained at the beginning of the section. We have used limited text examples to complete training faster otherwise it could take a lot of time to train the network. When we take 100 characters as data features (X_train) and the next character as the target value (Y_train), we are also retrieving their index from the vocabulary.
After we have looped through all text examples moving a window of 100 characters through them, we converted the final data to torch tensors. PyTorch networks works on torch tensors hence we need to transform them from the list to tensors.
Below, we have tried to explain the process with a simple example.
vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}
text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10
X_train = [
['h','e','l','l','o',',',' ', 'h','o','w'],
[,'e','l','l','o',',',' ', 'h','o','w',' '],
['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
...
['d','e','r','z','c','o','l', 'u','m','n']
]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']
X_train_vectorized = [
[1,2,3,4,5,6,1,4,7],
[2,3,4,5,6,1,4,7,5],
[3,4,5,6,1,4,7,5,1],
...
[16,2,9,17,13,4,3,11,14,18]
]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []
for text in list(train_dataset)[:7500]:
for i in range(0, len(text)-seq_length):
inp_seq = list(text[i:i+seq_length].lower())
out_seq = text[i+seq_length].lower()
X_train.append(vocab(inp_seq)) ## Retrieve index for characters from vocab
Y_train.append(vocab[out_seq]) ## Retrieve index for character from vocab
X_train, Y_train = torch.tensor(X_train, dtype=torch.int32), torch.tensor(Y_train)
X_train.shape, Y_train.shape
In this section, we have simply wrapped our data features (X_train) and target values (Y_train) in the tensor dataset and created a data loader from this dataset. The data loader will help us loop through data in batches during the training process. We have set the batch size of 1024. This will return 1024 examples and their target values on each call.
from torch.utils.data import DataLoader, TensorDataset
vectorized_train_dataset = TensorDataset(X_train, Y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
print(X.shape, Y.shape)
break
gc.collect()
In this section, we have defined a network that we'll use for our text generation task. Our task will be considered classification task because we are predicting one of the characters from the vocabulary. The network consists of four layers.
The first layer of our network is the embedding layer. We have created an embedding layer using Embedding() constructor. We have provided vocab length as a number of embeddings and embedding length of 100. This will create a matrix of shape (vocab_len, 100) which will be set as the weight matrix of the layer. The layer will take the list of indexes as input and retrieve embeddings for indexes by indexing the weight matrix of the layer. The input shape to layer is (batch_size, seq_length) = (batch_size, 100) and output shape will be (batch_size, seq_length, embed_len) = (batch_size, 100, 100).
The output of the embedding layer will be given to the first LSTM layer for processing which has hidden dimension size of 256. It'll process the data sequence. The output shape of first LSTM layer is (batch_size, seq_length, hidden_size) = (batch_size, 100, 256).
The output of the first LSTM layer will be given to the second LSTM layer for processing which also has hidden dimension size of 256. It'll process data sequence the same way. The output shape of second LSTM layer is (batch_size, seq_length, hidden_size) = (batch_size, 100, 256).
The output of the second LSTM layer will be given to the linear layer which has vocab_len output units for processing. It'll transform data shape to (batch_size, vocab_len). The output of the linear layer is the prediction of our network.
After defining the network, we have initialized it and printed shapes of weights/biases of layers. We have also performed a forward pass-through network using a few data examples for verification purposes.
Please make a NOTE that we have not explained how LSTM internally processes a sequence of data or PyTorch network design in detail here. If you are someone new to PyTorch and LSTM then we recommend that you go through the below links in your free time to understand them better.
from torch import nn
from torch.nn import functional as F
embed_len = 100
hidden_dim = 256
n_layers=2
class LSTMTextGenerator(nn.Module):
def __init__(self):
super(LSTMTextGenerator, self).__init__()
self.word_embedding = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
self.linear = nn.Linear(hidden_dim, len(vocab))
def forward(self, X_batch):
embeddings = self.word_embedding(X_batch)
hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim).to(device), torch.randn(n_layers, len(X_batch), hidden_dim).to(device)
output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
return self.linear(output[:,-1])
text_generator = LSTMTextGenerator().to(device)
text_generator
for layer in text_generator.children():
print("Layer : {}".format(layer))
print("Parameters : ")
for param in layer.parameters():
print(param.shape)
print()
out = text_generator(torch.randint(0, len(vocab), (1024, seq_length)).to(device))
out.shape
In this section, we are training our network for generating text. We have designed a simple function which we'll use for training purposes. The function takes network, loss function, optimizer, train data loader, and a number of epochs as input. It then executes training loop number of epochs times. For each epoch, it loops through training data in batches using a train data loader. For each batch, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters using gradients. It records the loss of each batch and prints the average loss of all batches at the end of each epoch.
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gc
def TrainModel(model, loss_fn, optimizer, train_loader, epochs=10):
for i in range(1, epochs+1):
losses = []
for X, Y in tqdm(train_loader):
Y_preds = model(X.to(device))
loss = loss_fn(Y_preds, Y.to(device))
losses.append(loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i%5) == 0:
print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
Below, we are actually training our network using a function defined in the previous cell. We have initialized a number of epochs to 25 and the learning rate to 0.001. Then, we have initialized cross entropy loss (classification task), LSTM network, and Adam optimizer. At last, we have called our training routine with the necessary parameters to perform training. We can notice from the loss values getting printed after every 5 epochs that our network seems to be improving over time.
%%time
from torch.optim import Adam
epochs = 25
learning_rate = 1e-3
loss_fn = nn.CrossEntropyLoss().to(device)
text_generator = LSTMTextGenerator().to(device)
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
In this section, we are trying to generate text using our trained network to see how it is performing. We first randomly selected a sequence of characters from our train data and printed the characters of the sequence. Then we are looping 100 times generating new characters each time using our trained network. For the first iteration of the loop, our original selected sequence will be input to the network and it'll predict a new character. We'll add this character at the end of our sequence and take out the first character from the sequence for the second iteration. This process is repeated for 100 iterations where we add a newly predicted character at the end of the sequence taking the existing first character off every time. After we have generated 100 new characters, we have printed them as well.
We can notice from the results that they look like English language sentences. The network is correctly spelling words. There are no spelling errors which is good. The sentence does not seem to make much sense but the network has learned to generate correctly spelled words. The network seems to have become a little deterministic as some words are repeated. This can be avoided by adding little randomness to the output of the network so that it generates different words. Overall the results look promising by training network for only 25 epochs.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.int32).reshape(1, seq_length) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
In this section, we have trained our network for another 50 epochs to check whether it helps improve results further. We have also reduced the learning rate from 0.001 to 0.0003. We can notice from the loss value getting printed after epochs that the network is improving further. Next, we'll check the performance.
epochs = 50
learning_rate = 3e-4
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
In this section, we have again generated 100 characters using our trained model. We have used the same starting example that we had used earlier. We can notice that this time network is generating more different words. It even generates punctuation marks and also added a newline character ('\n'). The model still seems a little deterministic due to repeating words but the results are a little better compared to earlier. We'll train the model even further to check whether it helps or not.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.int32).reshape(1, seq_length) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
In this section, we have trained our network for another 50 epochs. We have reduced the learning rate from 0.0003 to 0.0001. We can notice from the loss value getting printed after every 5 epochs that the network seems to be improving. Next, we'll test the model.
epochs = 50
learning_rate = 1e-4
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
In this section, we have generated 100 new characters using our trained model. We have used the same example as a starting point. We can notice from the generated text that it seems to be generating decent text. A few words are repeated but the overall text looks like English language text. There are no spelling errors and the model is generating punctuation marks as well. Next, we have suggested a few tips to further improve model performance.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.int32).reshape(1, seq_length) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
This ends our small tutorial explaining how to create LSTM Networks using PyTorch that uses character embeddings text encoding approach for text classification tasks. Please feel free to contact us if you have questions.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to