Updated On : Apr-01,2022 Time Investment : ~45 mins

How to Use GloVe Word Embeddings With PyTorch Networks?¶

Word embeddings is one of the most commonly used approaches nowadays when training text data using deep neural networks. Word embeddings let us use vectors of real values to represent a single token/word. Each word/token will have its own vector of floats. This helps improve the accuracy of models as more numbers better capture the meaning of the word/token and context compared to if we use only a single number (Word Frequency, Tf-Idf, etc.). We can generate word embeddings by ourselves if we have a big dataset that has a lot of words. We have already covered in detail how we can train a neural network using random word embeddings.

Word Embeddings for PyTorch Text Classification Networks

If we have a small dataset then rather than initializing and training our own word embeddings, we can use word embeddings generated by other networks as well. There are many word embeddings available like GloVe, FastText, word2vec, etc. These are embeddings trained for other tasks but they have captured the meaning of the words/tokens hence we can use the same embeddings for our task. They have embeddings for millions of words/tokens hence the majority of our words might be present in them.

As a part of our tutorial, we'll explain how we can use Glove (Global Vectors) embeddings with our PyTorch network for classification tasks. There are various versions of GloVe embeddings that are created using an unsupervised learning algorithm that was trained on a large corpus of Wikipedia and twitter texts. We have used AG NEWS dataset for our task and will be taking embeddings for words of the dataset from GloVe.

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Approach 1: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=25)
- Create Tokenizer
- Load GloVe '840B' Embeddings
- Load Datsets And Create Data Loaders
- Define Network
- Train Network
- Evaluate Network Performance
Approach 2: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=50)
Approach 3: GloVe '42B' (Embeddings Length=300, Tokens per Text Example=50)
Approach 4: GloVe '840B' Averaged (Embeddings Length=300, Tokens per Text Example=50)
Approach 5: GloVe '840B' Summed (Embeddings Length=300, Tokens per Text Example=50)

Below, we have loaded the necessary libraries and printed the versions that we have used in our tutorial.

import torch

print("PyTorch Version : {}".format(torch.__version__))

PyTorch Version : 1.9.1+cpu

import torchtext

print("Torch Text Version : {}".format(torchtext.__version__))

Torch Text Version : 0.10.1

Approach 1: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=25) ¶

As a part of our first approach, we'll use GloVe 840B embeddings. It has embeddings for 2.2 Million unique tokens and the length of each token is 300. There are different types of GloVe embeddings available from Stanford. Please check the below link for a list of available embeddings types.

GloVe

For our approach in this section, we have decided to keep a maximum of 25 tokens/words per text example and we'll look for embeddings of these tokens in GloVe embeddings.

Create Tokenizer¶

Below, we have simply loaded a tokenizer that we'll use for our text classification task. We have loaded a simple tokenizer available from torchtext.data module. The tokenizer is a function that takes a text document as input and generates a list of tokens.

from torchtext.data import get_tokenizer

tokenizer = get_tokenizer("basic_english") ## We'll use tokenizer available from PyTorch

tokenizer("Hello, How are you?")

['hello', ',', 'how', 'are', 'you', '?']

Load GloVe '840B' Embeddings¶

The torchtext module provides us with a class named GloVe which can be used to load GloVe embeddings. It is available from vocab module of torchttext. We need to provide the embedding name and dimensions to it. There are embeddings of different dimensions (50, 100, 200, 300, etc) available.

Once we have loaded GloVe embeddings by creating an instance of GloVe. We can call get_vecs_by_tokens() method of it by giving a list of tokens to it. It'll return embeddings for all tokens given to it. We have explained with simple examples below how to use it.

from torchtext.vocab import GloVe

global_vectors = GloVe(name='840B', dim=300)

.vector_cache/glove.840B.300d.zip: 2.18GB [06:53, 5.26MB/s]
100%|█████████▉| 2196016/2196017 [03:32<00:00, 10346.35it/s]

embeddings = global_vectors.get_vecs_by_tokens(tokenizer("Hello, How are you?"), lower_case_backup=True)

embeddings.shape

torch.Size([6, 300])

global_vectors.get_vecs_by_tokens([""], lower_case_backup=True)

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Load Datasets And Create Data Loaders¶

In this section, we have loaded our AG NEWS dataset and created data loaders from it. The dataset has text documents for 4 different categories (["World", "Sports", "Business", "Sci/Tech"]) of news. We can load dataset by calling AG_NEWS() function from datasets module of torchtext. It returns train and test datasets separately. After loading datasets, we have created data loaders for them that will be used during training. We have set the batch size to 1024 for data loaders.

When creating data loaders, we have given a function to collate_fn parameter of DataLoader() constructor. This function will be applied to all batches and the return value of this function will be our single dataset. The function loops through each text document of the batch and tokenizes it. During tokenization, it makes sure that we keep 25 tokens per text example. It'll pad with empty string tokens for example that has less than 25 tokens and for examples that have more than 25 tokens, it'll truncate them to 25 tokens. It then retrieves GloVe embeddings for tokens of the batch. At last, we have put embeddings for tokens of text example next to each other and returned them along with their target labels converted to torch tensors. We have also subtracted 1 from target labels because from the dataset, they are in the range 1-4 and we need them in the range 0-3.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

max_words = 25
embed_len = 300

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [tokenizer(x) for x in X]
    X = [tokens+[""] * (max_words-len(tokens))  if len(tokens)<max_words else tokens[:max_words] for tokens in X]
    X_tensor = torch.zeros(len(batch), max_words, embed_len)
    for i, tokens in enumerate(X):
        X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
    return X_tensor.reshape(len(batch), -1), torch.tensor(Y) - 1 ## Subtracted 1 from labels to bring in range [0,1,2,3] from [1,2,3,4]

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

train.csv: 29.5MB [00:00, 52.8MB/s]
test.csv: 1.86MB [00:00, 33.8MB/s]

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 7500]) torch.Size([1024])

Define Network¶

In this section, we have defined a network that we'll use for classifying our text documents. The network consists of 4 linear layers. The linear layers have 256, 128, 64, and 4 output units respectively. We have applied relu activation after each linear layer except the last linear layer. We have defined the network using Sequential API of PyTorch.

Please feel free to check the below tutorial if you want some background on how to create neural networks using PyTorch.

Create Simple PyTorch Neural Networks using 'torch.nn' Module

from torch import nn
from torch.nn import functional as F

class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(max_words*embed_len, 256),
            nn.ReLU(),

            nn.Linear(256,128),
            nn.ReLU(),

            nn.Linear(128,64),
            nn.ReLU(),

            nn.Linear(64, len(target_classes)),
        )

    def forward(self, X_batch):
        return self.seq(X_batch)

Train Network¶

In this section, we have trained our network. To train the network, we have defined a helper function. The function takes the model, loss function, optimizer, train loader, validation loader, and a number of epochs as input. It then executes the training process for a number of epochs. During each epoch, it loops through training data in batches using the training data loader we created earlier. For each batch, it performs a forward pass to make predictions, calculates loss (using predictions and actual target labels), calculate gradients, and updates network parameter at last using gradients. It records loss for each batch and prints the average loss of all batches at the end of each epoch. We have also created another helper function that calculates the loss and accuracy of the trained model on the validation dataset at the end of each epoch.

from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gc

def CalcValLossAndAccuracy(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X, Y in val_loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))

def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y)
            losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if i%5==0:
            print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
            CalcValLossAndAccuracy(model, loss_fn, val_loader)

Below, we have trained our network using the function we designed in the previous cell. We have initialized a number of epochs to 25 and the learning rate to 0.001. Then, we have initialized Cross entropy loss, our text classification network, and Adam optimizer. At last, we have called our training routine with the necessary parameters to perform training. We can notice from the loss and accuracy value getting printed at the end of each epoch that our model is doing a good job.

from torch.optim import Adam

epochs = 25
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

TrainModel(embed_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:22<00:00,  5.24it/s]
100%|██████████| 118/118 [00:22<00:00,  5.22it/s]
100%|██████████| 118/118 [00:22<00:00,  5.17it/s]
100%|██████████| 118/118 [00:23<00:00,  5.02it/s]
100%|██████████| 118/118 [00:23<00:00,  5.10it/s]

Train Loss : 0.162
Valid Loss : 0.477
Valid Acc  : 0.861

100%|██████████| 118/118 [00:22<00:00,  5.31it/s]
100%|██████████| 118/118 [00:22<00:00,  5.23it/s]
100%|██████████| 118/118 [00:22<00:00,  5.32it/s]
100%|██████████| 118/118 [00:22<00:00,  5.30it/s]
100%|██████████| 118/118 [00:22<00:00,  5.25it/s]

Train Loss : 0.068
Valid Loss : 0.736
Valid Acc  : 0.842

100%|██████████| 118/118 [00:22<00:00,  5.22it/s]
100%|██████████| 118/118 [00:22<00:00,  5.28it/s]
100%|██████████| 118/118 [00:22<00:00,  5.27it/s]
100%|██████████| 118/118 [00:22<00:00,  5.17it/s]
100%|██████████| 118/118 [00:26<00:00,  4.51it/s]

Train Loss : 0.013
Valid Loss : 0.823
Valid Acc  : 0.882

100%|██████████| 118/118 [00:22<00:00,  5.23it/s]
100%|██████████| 118/118 [00:22<00:00,  5.16it/s]
100%|██████████| 118/118 [00:22<00:00,  5.15it/s]
100%|██████████| 118/118 [00:22<00:00,  5.23it/s]
100%|██████████| 118/118 [00:22<00:00,  5.17it/s]

Train Loss : 0.004
Valid Loss : 0.845
Valid Acc  : 0.882

100%|██████████| 118/118 [00:23<00:00,  5.11it/s]
100%|██████████| 118/118 [00:22<00:00,  5.17it/s]
100%|██████████| 118/118 [00:22<00:00,  5.29it/s]
100%|██████████| 118/118 [00:22<00:00,  5.14it/s]
100%|██████████| 118/118 [00:23<00:00,  5.12it/s]

Train Loss : 0.002
Valid Loss : 0.903
Valid Acc  : 0.883

Evaluate Network Performance¶

Here, we have evaluated the network performance by calculating accuracy, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics for test predictions. We have created a helper function that takes the model and loader objects as input and returns predictions. We can notice from the accuracy that our model is doing quite a good job at classifying text documents.

Scikit-Learn - Model Evaluation & Scoring Metrics

In the cell after below cell, we have also plotted the confusion matrix for test predictions using scikit-plot library. From the plot, we can notice that our model is doing good for categories Sports and World compared to categories Business and Sci/Tech. If you are interested in learning about various ML metrics plots available from scikit-plot then please feel free to check the below link.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

def MakePredictions(model, loader):
    Y_shuffled, Y_preds = [], []
    for X, Y in loader:
        preds = model(X)
        Y_preds.append(preds)
        Y_shuffled.append(Y)
    gc.collect()
    Y_preds, Y_shuffled = torch.cat(Y_preds), torch.cat(Y_shuffled)

    return Y_shuffled.detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).detach().numpy()

Y_actual, Y_preds = MakePredictions(embed_classifier, test_loader)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8831578947368421

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.89      0.90      1900
      Sports       0.94      0.95      0.95      1900
    Business       0.84      0.84      0.84      1900
    Sci/Tech       0.84      0.85      0.85      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


Confusion Matrix :
[[1689   61   80   70]
 [  25 1809   37   29]
 [  78   25 1597  200]
 [  76   27  180 1617]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Approach 2: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=50) ¶

Our approach in this section is almost the same as our approach in the previous section with the only difference that we are using 50 tokens per text example here unlike our previous approach where we had used 25 per text example. We are still using the same GloVe 840B word embedding as our previous approach. The majority of the code in this section is almost the same as our previous section code.

Load Datasets And Create Data Loaders¶

Below, we have loaded datasets and defined data loaders. We have set the max tokens per text example at the beginning to 50.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

max_words = 50
embed_len = 300

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 15000]) torch.Size([1024])

Define Network¶

Below, we have again defined our network which has exactly the same structure as our network from the previous section. The only difference is the input length to the first layer which is 15000 (50 * 300) this time.

from torch import nn
from torch.nn import functional as F

class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(max_words*embed_len, 256),
            nn.ReLU(),

            nn.Linear(256,128),
            nn.ReLU(),

            nn.Linear(128,64),
            nn.ReLU(),

            nn.Linear(64, len(target_classes)),
        )

    def forward(self, X_batch):
        return self.seq(X_batch)

Train Network¶

Here, we have trained our network using exactly the same settings that we have used in our previous section. We can notice from the loss and accuracy getting printed after each epoch that the model is doing a good job.

from torch.optim import Adam

epochs = 25
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

TrainModel(embed_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:42<00:00,  2.74it/s]
100%|██████████| 118/118 [00:42<00:00,  2.78it/s]
100%|██████████| 118/118 [00:42<00:00,  2.74it/s]
100%|██████████| 118/118 [00:42<00:00,  2.79it/s]
100%|██████████| 118/118 [00:42<00:00,  2.78it/s]

Train Loss : 0.118
Valid Loss : 0.452
Valid Acc  : 0.878

100%|██████████| 118/118 [00:42<00:00,  2.76it/s]
100%|██████████| 118/118 [00:43<00:00,  2.74it/s]
100%|██████████| 118/118 [00:47<00:00,  2.46it/s]
100%|██████████| 118/118 [00:43<00:00,  2.70it/s]
100%|██████████| 118/118 [00:43<00:00,  2.69it/s]

Train Loss : 0.045
Valid Loss : 0.635
Valid Acc  : 0.887

100%|██████████| 118/118 [00:43<00:00,  2.72it/s]
100%|██████████| 118/118 [00:43<00:00,  2.68it/s]
100%|██████████| 118/118 [00:44<00:00,  2.68it/s]
100%|██████████| 118/118 [00:43<00:00,  2.72it/s]
100%|██████████| 118/118 [00:44<00:00,  2.66it/s]

Train Loss : 0.003
Valid Loss : 0.852
Valid Acc  : 0.890

100%|██████████| 118/118 [00:43<00:00,  2.73it/s]
100%|██████████| 118/118 [00:44<00:00,  2.67it/s]
100%|██████████| 118/118 [00:43<00:00,  2.72it/s]
100%|██████████| 118/118 [00:43<00:00,  2.72it/s]
100%|██████████| 118/118 [00:44<00:00,  2.67it/s]

Train Loss : 0.001
Valid Loss : 0.905
Valid Acc  : 0.889

100%|██████████| 118/118 [00:47<00:00,  2.50it/s]
100%|██████████| 118/118 [00:43<00:00,  2.72it/s]
100%|██████████| 118/118 [00:43<00:00,  2.69it/s]
100%|██████████| 118/118 [00:43<00:00,  2.69it/s]
100%|██████████| 118/118 [00:43<00:00,  2.70it/s]

Train Loss : 0.001
Valid Loss : 0.961
Valid Acc  : 0.890

Evaluate Network Performance¶

In this section, we have evaluated the network performance by calculating various metrics like our previous approach. We can notice from the test accuracy that there is very little improvement in accuracy by changing the approach to use 50 words per text example compared to 25 words.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(embed_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8897368421052632

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.89      0.90      1900
      Sports       0.95      0.96      0.95      1900
    Business       0.84      0.86      0.85      1900
    Sci/Tech       0.86      0.85      0.85      1900

    accuracy                           0.89      7600
   macro avg       0.89      0.89      0.89      7600
weighted avg       0.89      0.89      0.89      7600


Confusion Matrix :
[[1696   57   78   69]
 [  27 1828   32   13]
 [  67   23 1626  184]
 [  70   21  197 1612]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Approach 3: GloVe '42B' (Embeddings Length=300, Tokens per Text Example=50) ¶

Our approach in this section is almost exactly the same as our approach in the previous section with the only difference that we are using GloVe 42B word embeddings this time instead of 840B embeddings. We have used 50 words per text example again this time as well. The GloVe 42B embeddings have embeddings for 1.9 Million unique tokens.

Load Glove '42B' Embeddings¶

Below, we have loaded GloVe 42B word embeddings using GloVe() constructor.

from torchtext.vocab import GloVe

global_vectors = GloVe(name='42B', dim=300)

.vector_cache/glove.42B.300d.zip: 1.88GB [05:56, 5.26MB/s]
100%|█████████▉| 1917493/1917494 [03:07<00:00, 10213.34it/s]

embeddings = global_vectors.get_vecs_by_tokens(tokenizer("Hello, How are you?"), lower_case_backup=True)

embeddings.shape

torch.Size([6, 300])

Load Datasets And Create Data Loaders¶

In this section, we have again loaded datasets and created data loaders from them. The new data loaders will now use GloVe 42B word embeddings to vectorize text data.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

max_words = 50
embed_len = 300

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

Define Network¶

Below, we have again defined a network that we'll use for our text classification task. It has exactly the same code as our previous example.

from torch import nn
from torch.nn import functional as F

class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(max_words*embed_len, 256),
            nn.ReLU(),

            nn.Linear(256,128),
            nn.ReLU(),

            nn.Linear(128,64),
            nn.ReLU(),

            nn.Linear(64, len(target_classes)),
        )

    def forward(self, X_batch):
        return self.seq(X_batch)

Train Network¶

Here, we have trained our network again using the same settings we had used in our previous approaches. The loss and accuracy getting printed after each epoch points out that our model is doing a good job.

from torch.optim import Adam

epochs = 25
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

TrainModel(embed_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:48<00:00,  2.41it/s]
100%|██████████| 118/118 [00:43<00:00,  2.71it/s]
100%|██████████| 118/118 [00:44<00:00,  2.65it/s]
100%|██████████| 118/118 [00:43<00:00,  2.69it/s]
100%|██████████| 118/118 [00:43<00:00,  2.70it/s]

Train Loss : 0.141
Valid Loss : 0.352
Valid Acc  : 0.891

100%|██████████| 118/118 [00:45<00:00,  2.61it/s]
100%|██████████| 118/118 [00:44<00:00,  2.63it/s]
100%|██████████| 118/118 [00:52<00:00,  2.24it/s]
100%|██████████| 118/118 [00:45<00:00,  2.58it/s]
100%|██████████| 118/118 [00:44<00:00,  2.63it/s]

Train Loss : 0.056
Valid Loss : 0.712
Valid Acc  : 0.870

100%|██████████| 118/118 [00:44<00:00,  2.64it/s]
100%|██████████| 118/118 [00:45<00:00,  2.57it/s]
100%|██████████| 118/118 [00:45<00:00,  2.62it/s]
100%|██████████| 118/118 [00:45<00:00,  2.61it/s]
100%|██████████| 118/118 [00:45<00:00,  2.57it/s]

Train Loss : 0.004
Valid Loss : 0.764
Valid Acc  : 0.894

100%|██████████| 118/118 [00:44<00:00,  2.63it/s]
100%|██████████| 118/118 [00:45<00:00,  2.61it/s]
100%|██████████| 118/118 [00:46<00:00,  2.56it/s]
100%|██████████| 118/118 [00:44<00:00,  2.63it/s]
100%|██████████| 118/118 [00:45<00:00,  2.61it/s]

Train Loss : 0.002
Valid Loss : 0.761
Valid Acc  : 0.896

100%|██████████| 118/118 [00:50<00:00,  2.32it/s]
100%|██████████| 118/118 [00:45<00:00,  2.62it/s]
100%|██████████| 118/118 [00:46<00:00,  2.56it/s]
100%|██████████| 118/118 [00:45<00:00,  2.61it/s]
100%|██████████| 118/118 [00:45<00:00,  2.60it/s]

Train Loss : 0.001
Valid Loss : 0.807
Valid Acc  : 0.894

Evaluate Network Performance¶

Below, we have calculated various ML metrics to evaluate the performance of our network as usual and we have also plotted the confusion matrix. We can notice from the results that there is a slight improvement in accuracy even though we have used only GloVe 42B embeddings which have embeddings for fewer tokens compared to GloVe 840B embeddings.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(embed_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8944736842105263

Classification Report :
              precision    recall  f1-score   support

       World       0.91      0.90      0.90      1900
      Sports       0.96      0.97      0.96      1900
    Business       0.84      0.86      0.85      1900
    Sci/Tech       0.87      0.85      0.86      1900

    accuracy                           0.89      7600
   macro avg       0.89      0.89      0.89      7600
weighted avg       0.89      0.89      0.89      7600


Confusion Matrix :
[[1716   42   86   56]
 [  30 1840   22    8]
 [  73   22 1626  179]
 [  75   13  196 1616]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Approach 4: GloVe '840B' Averaged (Embeddings Length=300, Tokens per Text Example=50) ¶

Our approach in this again uses GloVe 840B embeddings and 50 tokens per text example. The main difference in this approach is the way we have handled embeddings per text example. Till now, all our approaches kept embeddings for all tokens and laid them next to each other to create a single big tensor for each text example. As for our previous example, where we had kept 50 tokens per text example and the embedding length was 300 per token, hence the flattened vector will have 50x300 = 15000 embeddings.

But in this approach, we have made a minor change in the way we handle embeddings per text example. We have taken the average of embeddings of all tokens per text example. In this section, we have averaged embeddings of length 300 for 50 tokens hence we'll have a vector of length 300 after averaging.

Load GloVe '840B' Embeddings¶

Below, we have again loaded GloVe 840B word embeddings which we'll use in this section.

from torchtext.vocab import GloVe

global_vectors = GloVe(name='840B', dim=300)

Load Datasets And Create Data Loaders¶

Here, we have again loaded our datasets and created data loaders. The only difference is in the last line of the vectorization function which we are giving to collate_fn parameter. We have called mean() function to take an average of embeddings of tokens of each text example. The rest of the code is the same as earlier. This will return averaged embeddings for each batch.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

max_words = 50
embed_len = 300

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [tokenizer(x) for x in X]
    X = [tokens+[""] * (max_words-len(tokens))  if len(tokens)<max_words else tokens[:max_words] for tokens in X]
    X_tensor = torch.zeros(len(batch), max_words, embed_len)
    for i, tokens in enumerate(X):
        X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
    return X_tensor.mean(dim=1), torch.tensor(Y) - 1 ## Averaging Embedding accross all words of text document

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 300]) torch.Size([1024])

Define Network¶

Below, we have defined a network that we'll use for our task in this section. The network has the same structure as the networks we are using till now. The only difference is the input shape.

from torch import nn
from torch.nn import functional as F

class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(embed_len, 256),
            nn.ReLU(),

            nn.Linear(256,128),
            nn.ReLU(),

            nn.Linear(128,64),
            nn.ReLU(),

            nn.Linear(64, len(target_classes)),
        )

    def forward(self, X_batch):
        return self.seq(X_batch)

Train Network¶

Here, we have trained our neural network using the same settings that we are using for all our previous approaches. We can notice from the loss and accuracy getting printed after each epoch that the model is getting quite good at the text classification task.

from torch.optim import Adam

epochs = 25
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

TrainModel(embed_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:25<00:00,  4.70it/s]
100%|██████████| 118/118 [00:25<00:00,  4.61it/s]
100%|██████████| 118/118 [00:25<00:00,  4.55it/s]
100%|██████████| 118/118 [00:25<00:00,  4.63it/s]
100%|██████████| 118/118 [00:25<00:00,  4.63it/s]

Train Loss : 0.282
Valid Loss : 0.297
Valid Acc  : 0.898

100%|██████████| 118/118 [00:25<00:00,  4.55it/s]
100%|██████████| 118/118 [00:25<00:00,  4.56it/s]
100%|██████████| 118/118 [00:27<00:00,  4.36it/s]
100%|██████████| 118/118 [00:25<00:00,  4.66it/s]
100%|██████████| 118/118 [00:25<00:00,  4.63it/s]

Train Loss : 0.245
Valid Loss : 0.279
Valid Acc  : 0.904

100%|██████████| 118/118 [00:35<00:00,  3.36it/s]
100%|██████████| 118/118 [00:25<00:00,  4.62it/s]
100%|██████████| 118/118 [00:25<00:00,  4.57it/s]
100%|██████████| 118/118 [00:25<00:00,  4.59it/s]
100%|██████████| 118/118 [00:25<00:00,  4.62it/s]

Train Loss : 0.219
Valid Loss : 0.270
Valid Acc  : 0.908

100%|██████████| 118/118 [00:25<00:00,  4.61it/s]
100%|██████████| 118/118 [00:25<00:00,  4.55it/s]
100%|██████████| 118/118 [00:25<00:00,  4.56it/s]
100%|██████████| 118/118 [00:25<00:00,  4.62it/s]
100%|██████████| 118/118 [00:26<00:00,  4.53it/s]

Train Loss : 0.196
Valid Loss : 0.264
Valid Acc  : 0.913

100%|██████████| 118/118 [00:25<00:00,  4.68it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:26<00:00,  4.53it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:25<00:00,  4.63it/s]

Train Loss : 0.175
Valid Loss : 0.267
Valid Acc  : 0.914

Evaluate Network Performance¶

Here, we have evaluated various ML metrics as usual to check network performance. We can notice from the results that test accuracy is the highest of all the approaches we tried till now.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(embed_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.9142105263157895

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.93      0.92      1900
      Sports       0.97      0.98      0.97      1900
    Business       0.88      0.88      0.88      1900
    Sci/Tech       0.91      0.86      0.89      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600


Confusion Matrix :
[[1767   34   64   35]
 [  21 1861   14    4]
 [  86   13 1677  124]
 [  83   14  160 1643]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Approach 5: GloVe '840B' Summed (Embeddings Length=300, Tokens per Text Example=50) ¶

Our approach in this section is almost exactly the same as our approach in the previous section with one minor change. We have again used GloVe 840B words embeddings and 50 tokens per text example. The main difference this approach has is that we sum up embeddings of tokens of each text example. In our previous approach, we have taken average and here, we are summing up embeddings.

Load Datasets And Create Data Loaders¶

Below, we have again loaded datasets and created data loaders from them. There is a minor change in the definition of the vectorization function. The last line of the function uses sum() function to sum up embeddings. The rest of the code is the same as earlier.

from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

max_words = 50
embed_len = 300

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [tokenizer(x) for x in X]
    X = [tokens+[""] * (max_words-len(tokens))  if len(tokens)<max_words else tokens[:max_words] for tokens in X]
    X_tensor = torch.zeros(len(batch), max_words, embed_len)
    for i, tokens in enumerate(X):
        X_tensor[i] = global_vectors.get_vecs_by_tokens(tokens)
    return X_tensor.sum(dim=1), torch.tensor(Y) - 1 ## Summing Embedding accross all words of text document

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([1024, 300]) torch.Size([1024])

Define Network¶

Below, we have again defined the network that we'll use for our task. It has exactly the same structure as our network from the previous section.

from torch import nn
from torch.nn import functional as F

class EmbeddingClassifier(nn.Module):
    def __init__(self):
        super(EmbeddingClassifier, self).__init__()
        self.seq = nn.Sequential(
            nn.Linear(embed_len, 256),
            nn.ReLU(),

            nn.Linear(256,128),
            nn.ReLU(),

            nn.Linear(128,64),
            nn.ReLU(),

            nn.Linear(64, len(target_classes)),
        )

    def forward(self, X_batch):
        return self.seq(X_batch)

Train Network¶

Below, we have trained our network using the same settings we have used for all our previous approaches. The loss and accuracy getting printed at the end of each epoch point that the model is doing quite a good job at classifying text documents.

from torch.optim import Adam

epochs = 25
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
embed_classifier = EmbeddingClassifier()
optimizer = Adam(embed_classifier.parameters(), lr=learning_rate)

TrainModel(embed_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 118/118 [00:25<00:00,  4.66it/s]
100%|██████████| 118/118 [00:25<00:00,  4.61it/s]
100%|██████████| 118/118 [00:25<00:00,  4.64it/s]
100%|██████████| 118/118 [00:24<00:00,  4.73it/s]
100%|██████████| 118/118 [00:25<00:00,  4.57it/s]

Train Loss : 0.240
Valid Loss : 0.281
Valid Acc  : 0.904

100%|██████████| 118/118 [00:29<00:00,  3.94it/s]
100%|██████████| 118/118 [00:26<00:00,  4.53it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:25<00:00,  4.61it/s]
100%|██████████| 118/118 [00:25<00:00,  4.55it/s]

Train Loss : 0.189
Valid Loss : 0.274
Valid Acc  : 0.911

100%|██████████| 118/118 [00:25<00:00,  4.57it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:25<00:00,  4.63it/s]
100%|██████████| 118/118 [00:25<00:00,  4.59it/s]
100%|██████████| 118/118 [00:25<00:00,  4.62it/s]

Train Loss : 0.155
Valid Loss : 0.289
Valid Acc  : 0.911

100%|██████████| 118/118 [00:25<00:00,  4.67it/s]
100%|██████████| 118/118 [00:26<00:00,  4.41it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:26<00:00,  4.52it/s]

Train Loss : 0.124
Valid Loss : 0.343
Valid Acc  : 0.908

100%|██████████| 118/118 [00:25<00:00,  4.60it/s]
100%|██████████| 118/118 [00:25<00:00,  4.59it/s]
100%|██████████| 118/118 [00:25<00:00,  4.64it/s]
100%|██████████| 118/118 [00:26<00:00,  4.50it/s]
100%|██████████| 118/118 [00:25<00:00,  4.54it/s]

Train Loss : 0.118
Valid Loss : 0.369
Valid Acc  : 0.902

Evaluate Network Performance¶

Here, we have again calculated various ML metrics on test predictions and plotted the confusion matrix as usual to evaluate network performance. The accuracy hints that it's better than our first three approaches but a little less compared to our previous approach which averaged embeddings.

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Y_actual, Y_preds = MakePredictions(embed_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.9019736842105263

Classification Report :
              precision    recall  f1-score   support

       World       0.92      0.91      0.91      1900
      Sports       0.96      0.97      0.96      1900
    Business       0.82      0.92      0.87      1900
    Sci/Tech       0.92      0.80      0.86      1900

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600


Confusion Matrix :
[[1732   50   80   38]
 [  19 1852   23    6]
 [  53   15 1752   80]
 [  82   22  277 1519]]

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_actual], [target_classes[i] for i in Y_preds],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Purples",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

This ends our small tutorial explaining how we can use pre-trained embeddings like GloVe for text classification tasks using PyTorch networks. Please feel free to let us know your views in the comments section.

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

glove-embeddings, pytorch, text-classification

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

How to Use GloVe Word Embeddings With PyTorch Networks?¶

Important Sections Of Tutorial¶

Approach 1: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=25) ¶

Create Tokenizer¶

Load GloVe '840B' Embeddings¶

Load Datasets And Create Data Loaders¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Approach 2: GloVe '840B' (Embeddings Length=300, Tokens per Text Example=50) ¶

Load Datasets And Create Data Loaders¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Approach 3: GloVe '42B' (Embeddings Length=300, Tokens per Text Example=50) ¶

Load Glove '42B' Embeddings¶

Load Datasets And Create Data Loaders¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Approach 4: GloVe '840B' Averaged (Embeddings Length=300, Tokens per Text Example=50) ¶

Load GloVe '840B' Embeddings¶

Load Datasets And Create Data Loaders¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

Approach 5: GloVe '840B' Summed (Embeddings Length=300, Tokens per Text Example=50) ¶

Load Datasets And Create Data Loaders¶

Define Network¶

Train Network¶

Evaluate Network Performance¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription