Word embeddings is one of the most preferred approaches when encoding text data to real-valued data for Machine learning tasks. When using the word embeddings approach, we first break text into a list of tokens (words) and then assign a real-valued vector (embeddings) of pre-decided length to each token. Other encoding approaches like one-hot, word frequency, Tf-IDF, etc assign just a single scalar to each token. This limits the representation power of the tokens. With a single scalar, you can only represent a single meaning. With the word embeddings approach where we assign a real-valued vector to the token, it gives more representation power to it. The same word can have different meanings in a different context and can have a different relationship with other words. These can be better captured by embeddings.
GloVe (Global Vectors) is an unsupervised learning algorithm that is trained on a big corpus of data to capture the meaning of the words by generating word embeddings for them. These word embeddings can be then used by other ML tasks that have different small datasets. The trained token embeddings can be taken from GloVe Embeddings. This saves us time to generate and train our own embeddings which might not be possible with small datasets. There are different versions of Glove embeddings generated by Stanford professors with different embedding lengths. Please feel free to check the below link if you want to learn more about GloVe.
As a part of this tutorial, we have designed neural networks using Python deep learning library Keras (Tensorflow) that uses GloVe Word Embeddings (840B.300d) for text classification tasks. We have tried different approaches to using embeddings and recorded their results for comparison purposes. We have also calculated various ML metrics for evaluating the performance of our networks. Apart from calculations, we have also explained predictions made by our networks using LIME (Local Interpretable Model-Agnostic Explanations) algorithm.
Below, we have listed important sections of tutorial to give an overview of the material covered.
Below, we have imported Python deep learning library Keras (tensorflow) and printed the version we used in our tutorial.
import tensorflow
from tensorflow import keras
print("Keras Version : {}".format(keras.__version__))
In this section, we are preparing data to be given to the neural network for training purposes. We'll follow the below steps in order to use GloVe word embeddings for text classification tasks.
So basically, we first tokenize text examples, populate vocabulary, and retrieve token indexes for tokens of text examples. We can then retrieve glove embeddings using these token indexes to index the embeddings matrix.
The first three steps mentioned above will be completed in this section whereas the fourth step will be implemented in the neural network as the Embedding layer. The Glove embedding matrix will be set as the weight matrix of the first layer of the network which is the embedding layer and this layer will retrieve embeddings from the weight matrix based on input token indexes.
Below, we have included an image that explains word embeddings.
In this step, we have simply downloaded GloVe 840B.300d word embeddings from the Stanford website as a zip file and then unzipped it. This word embedding has embeddings of length 300 for 2.2 Million tokens.
!wget https://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip glove.840B.300d.zip
In this section, we are simply loading Glove embeddings in memory from the file. We have created a simple dictionary whose keys are tokens (words) and values are embeddings.
%%time
import numpy as np
glove_embeddings = {}
with open("glove.840B.300d.txt") as f:
for line in f:
try:
line = line.split()
glove_embeddings[line[0]] = np.array(line[1:], dtype=np.float32)
except:
continue
embeddings = glove_embeddings["the"]
embeddings.shape, embeddings.dtype
In this section, we have loaded the dataset that we are going to use for our text classification task. We'll be using the newsgroups dataset available from scikit-learn. The dataset has ~20k text documents for 20 different categories of news. We have loaded text documents for four categories for our purpose. The dataset is already divided into train and test sets for our convenience.
import numpy as np
from sklearn import datasets
import gc
all_categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware','comp.windows.x', 'misc.forsale','rec.autos','rec.motorcycles',
'rec.sport.baseball','rec.sport.hockey','sci.crypt','sci.electronics','sci.med',
'sci.space','soc.religion.christian','talk.politics.guns','talk.politics.mideast',
'talk.politics.misc','talk.religion.misc']
target_classes = ['comp.sys.ibm.pc.hardware','rec.autos','rec.sport.hockey','talk.politics.mideast']
X_train_text, Y_train = datasets.fetch_20newsgroups(subset="train", categories=target_classes, return_X_y=True)
X_test_text , Y_test = datasets.fetch_20newsgroups(subset="test", categories=target_classes, return_X_y=True)
classes = np.unique(Y_train)
mapping = dict(zip(classes, target_classes))
len(X_train_text), len(X_test_text), classes, mapping
In this section, we have implemented the first two steps of our encoding process which we had explained at the beginning.
First, we have created an instance of Tokenizer and called fit_on_texts() method on it. We have provided the method train and test examples. A call to this method will internally populate a vocabulary of all unique tokens in the tokenizer object.
Next, we have called texts_to_sequences() method on Tokenizer object with train and test text examples. This method will tokenize each text example into tokens and then retrieve indexes of those tokens from the vocabulary.
Now, each of our text examples is of a different length and hence has a different number of tokens (words). We have decided to keep maximum of 50 tokens per text example. To do this, we have called pad_sequences() function on the list of token indexes. This method will make sure that each example has exactly 50 token indexes. The examples that have more than 50 tokens will be truncated to 50 tokens and those who have less than 50 tokens will be padded with 0s to bring it to length 50.
So we have turned each text example into a list of token indexes. We'll later use these indexes to integer index embedding matrix and retrieve embeddings of tokens.
In the next cells after the below cell, we have printed the number of tokens in vocabulary.
Below, we have explained with a simple example how text example is vectorized.
text = "Hello, How are you? Where are you planning to go?"
tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
'are', 'you', 'planning', 'to', 'go', '?']
vocab = {
'hello': 0,
'bye': 1,
'how': 2,
'the': 3,
'welcome': 4,
'are': 5,
'you': 6,
'to': 7,
'<unk>': 8,
}
vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_tokens = 50 ## Hyperparameter
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)
## Vectorizing data to keep 50 words per sample.
X_train_vect = pad_sequences(tokenizer.texts_to_sequences(X_train_text), maxlen=max_tokens, padding="post", truncating="post", value=0.)
X_test_vect = pad_sequences(tokenizer.texts_to_sequences(X_test_text), maxlen=max_tokens, padding="post", truncating="post", value=0.)
print(X_train_vect[:3])
X_train_vect.shape, X_test_vect.shape
print("Vocab Size : {}".format(len(tokenizer.word_index)))
## What is word 13
print(tokenizer.index_word[13])
## How many times it comes in first text document??
print(X_train_text[0]) ## 2 times
In this section, we have implemented the third step of our encoding process that we had explained earlier. We are simply looping through our vocabulary and retrieving GloVe embeddings for each token. We have then stacked embeddings of all tokens of our vocabulary in one big matrix. The matrix have shape (vocab_len, embed_len). The embedding length in our case is 300 as we had said earlier.
We can now retrieve glove embedding from this matrix by integer indexing it using the token index of that token. To explain it with an example, let's say that the index of 'the' token in our vocabulary is '1' then we can simply index embedding matrix like 'embedding_matrix[1]' to retrieve the embedding of the token 'the'.
%%time
embed_len = 300
word_embeddings = np.zeros((len(tokenizer.index_word)+1, embed_len))
for idx, word in tokenizer.index_word.items():
word_embeddings[idx] = glove_embeddings.get(word, np.zeros(embed_len))
word_embeddings[1][:10]
Our first approach flattens GloVe embeddings and processes them through dense layers to make predictions. We have defined a simple network of one embedding layer and 3 dense layers for our text classification task. After training the network, we have also calculated ML metrics for evaluating the performance of the network.
In this section, we have defined a network that we'll use for our text classification task. The network consists of one embedding layer and 3 dense layers.
The first layer of the network is the embedding layer. We have created an embedding layer using Embedding() constructor. We have provided the length of vocabulary as a number of tokens to the constructor. The embedding length is set at 300. We have also set the glove embedding matrix we created earlier as the weight of the layer. We don't want to update these embeddings hence we have set trainable parameter to False which will prevent updating of weights. This layer is responsible for mapping token indexes to their respective embeddings. The shape of input data to embedding layer is (batch_size, max_tokens) = (batch_size, 50) and output data shape is (batch_size, max_tokens, embed_len) = (batch_size, 50, 300).
The output of embedding layer is flattened which transforms data shape from (batch_size, 50, 300) to (batch_size, 50 x 300) = (batch_size, 15000).
The flattened output is given to a dense layer with 128 output units. This will change data from shape (batch_size, 15000) to (batch_size, 128). The relu activation is applied to the output of the dense layer.
The output of the first dense layer is given to the second dense layer which has 64 output units. This transforms shape from (batch_size, 128) to (batch_size, 64). It also applies relu activation to the output.
The output of second layer is given to the third and last dense layer of the network which has 4 output units (same as no of target classes) for processing. This transforms shape from (batch_size, 64) to (batch_size, 4). The last layer also applies softmax activation to convert outputs to probabilities.
After defining the network, we have also printed a summary of layers' output shapes and count of parameters.
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
model = Sequential([
Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
input_length=max_tokens, trainable=False, weights=[word_embeddings]),
Flatten(),
Dense(128, activation="relu"),
Dense(64, activation="relu"),
Dense(len(target_classes), activation="softmax")
])
model.summary()
model.weights[0][1][:10], word_embeddings[1][:10]
Here, we have compiled our network to use Adam optimizer, cross entropy loss, and accuracy metric.
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
In this section, we have trained our network for 8 epochs with a batch size of 1024 using fit() method. We have provided a test dataset as a validation dataset for validation purposes. We can notice from the loss and accuracy values getting printed after each epoch that our network is doing a good job at the text classification task.
model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))
In this section, we have evaluated the performance of our network by calculating accuracy score, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics on test predictions. We can notice from the accuracy score that our network has done a good job at the classification task.
Here, we have calculated ML metrics using functions available from the Python library Scikit-learn. If you are interested in learning about various ML metrics available from sklearn then please check the below link which covers the majority of them in detail.
Apart from calculations, we have also created a visualization of the confusion matrix. This lets us better understand which target class is doing better and which are worse. We can notice from the visualization that the accuracy of all 4 categories is almost the same.
We have created confusion matrix visualization using Python library scikit-plot. It provides visualization for many ML metrics. Please feel free to check the below link if you are interested in it.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Y_preds = model.predict(X_test_vect).argmax(axis=-1)
print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained the prediction made by our network using LIME (Local Interpretable Model-Agnostic Explanations) algorithm. The algorithm is commonly used to explain black-box ML models like deep neural networks. Currently, the Python library lime provides an implementation of the algorithm. We'll use it to create a visualization that highlights words that contributed to predicting a particular target label on a given text example.
If you are someone who is new to the concept of LIME and want to understand it better then we would recommend that you go through the below links in your spare time.
In order to create explanation visualization using lime, we need to create an instance of LimeTextExplainer which we had done below. We'll use it later.
from lime import lime_text
import numpy as np
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
explainer
Below, we have first created a prediction function. The function takes a batch of text examples and returns their probabilities predicted by the network. We'll use this function later to create an explanation.
Next, we randomly selected a text example from test data and made predictions on it using our trained network. We can notice that our network correctly predicts the target label as 'rec.autos' for the selected text example. Now, we'll create an explanation visualization for this text example.
def make_predictions(X_batch_text):
X = tokenizer.texts_to_sequences(X_batch_text)
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)
return preds
rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)
print("Prediction : ", target_classes[preds.argmax()])
print("Actual : ", target_classes[Y_test[idx]])
Below, we have first created Explanation object by calling explain_instance() method on LimeTextExplainer instance. We provided the method with a selected text example, a prediction function we had defined earlier, and a target label. Then, we have called show_in_notebook() method on the explanation object to create visualizations.
We can notice from the visualization that words like 'cars', 'Datsun', 'Opel', 'owners', etc are contributing to predicting the target label as 'rec.autos'. This makes sense as these are commonly used words in the auto industry.
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions,
labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()
Our approach in this section has a minor change in network architecture compared to our previous approach. Our network in this section takes the average of embeddings of tokens per text example instead of flattening them like in the previous section. The majority of the code is the same as in the previous section with only a change in network architecture.
Below, we have defined the network that we'll use for our text classification task. The network as usual consists of one embedding layer and three dense layers. The only major difference from the previous section is that the output of the embedding layer is averaged at the token level. This step transforms data shape from (batch_size, max_tokens, embed_len) = (batch_size, 50, 300) to (batch_size, embed_len) = (batch_size, 300). This output is then given to the first dense layer and forward.
After defining the network, we have printed a summary of layers output shapes and layer parameter counts.
from keras.models import Model
from keras.layers import Dense, Embedding, Input
inputs = Input(shape=(max_tokens, ))
embeddings_layer = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
input_length=max_tokens, trainable=False, weights=[word_embeddings])
dense1 = Dense(128, activation="relu")
dense2 = Dense(64, activation="relu")
dense3 = Dense(len(target_classes), activation="softmax")
x = embeddings_layer(inputs)
x = tensorflow.reduce_mean(x, axis=1) ### Averaged embeddings of tokens of each example
x = dense1(x)
x = dense2(x)
outputs = dense3(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()
Here, we have compiled a network to use Adam embedding, cross entropy loss, and accuracy metric.
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
Now, we have trained our network for 8 epochs with a batch size of 32 using fit() method. We have provided test data as validation data for validation purposes. We can notice from the loss and accuracy values that our network is doing a good job at classification tasks.
model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))
In this section, we have evaluated the performance of our trained network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is quite better compared to our previous approach. We have also plotted the confusion matrix which is also showing improvement in classifying text documents into individual categories.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Y_preds = model.predict(X_test_vect).argmax(axis=-1)
print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained network predictions using LIME algorithm. Our trained network correctly predicts the target label as 'rec.autos' for randomly selected text example from test data. The visualization shows that words like 'cars', 'Datsun', 'Nissan', 'Opel', 'mileage', etc contributes to predicting target category as 'rec.autos'.
from lime import lime_text
import numpy as np
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)
print("Prediction : ", target_classes[preds.argmax()])
print("Actual : ", target_classes[Y_test[idx]])
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()
Our approach in this section is the same as our previous approach the only minor difference is that we are taking the sum of embeddings instead of the average. The majority of the code is the same as earlier with only differences in network architecture.
Below, we have defined a network that we'll use for our task in this section. The network has the same architecture as the previous section with the only difference being that we have used sum() function on the output of the embedding layer instead. The rest of the code is exactly the same.
from keras.models import Model
from keras.layers import Dense, Embedding, Input
inputs = Input(shape=(max_tokens, ))
embeddings_layer = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=embed_len,
input_length=max_tokens, trainable=False, weights=[word_embeddings])
dense1 = Dense(128, activation="relu")
dense2 = Dense(64, activation="relu")
dense3 = Dense(len(target_classes), activation="softmax")
x = embeddings_layer(inputs)
x = tensorflow.reduce_sum(x, axis=1) ### Summed embeddings of tokens of each example
x = dense1(x)
x = dense2(x)
outputs = dense3(x)
model = Model(inputs=inputs, outputs=outputs)
model.summary()
Here, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy metric.
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
Below, we have trained our network using the same settings that we have been using for all our approaches. The loss and accuracy values getting printed after each epoch hint that the network is doing a good job at the classification task.
model.fit(X_train_vect, Y_train, batch_size=32, epochs=8, validation_data=(X_test_vect, Y_test))
In this section, we have evaluated the performance of the network by calculating ML metrics like accuracy score, classification report and confusion matrix on test predictions. We can notice from the accuracy score that it is the highest of all our approaches. We have also plotted the confusion matrix metric for reference purposes.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Y_preds = model.predict(X_test_vect).argmax(axis=-1)
print("Test Accuracy : {}".format(accuracy_score(Y_test, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_test, Y_preds, target_names=target_classes))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, Y_preds))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
import numpy as np
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in Y_preds],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have explained predictions made by our network using LIME algorithm. The network correctly predicts target label as 'rec.autos' for selected text example from test dataset. The visualization shows that words like 'cars', 'datsun', 'nissan', 'mileage', 'opel', 'sell', 'rust', '240Z', '260Z', '280Z', '280ZX' etc are contributing to predicting target category as 'rec.autos'.
from lime import lime_text
import numpy as np
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
rng = np.random.RandomState(12)
idx = rng.randint(1, len(X_test_text))
X = tokenizer.texts_to_sequences(X_test_text[idx:idx+1])
X = pad_sequences(X, maxlen=max_tokens, padding="post", truncating="post", value=0) ## Bringing all samples to max_tokens length.
preds = model.predict(X)
print("Prediction : ", target_classes[preds.argmax()])
print("Actual : ", target_classes[Y_test[idx]])
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1], num_features=15)
explanation.show_in_notebook()
Approach | Max Tokens | Embedding Length | Test Accuracy (%) |
---|---|---|---|
GloVe 840B Embeddings Flattened | 50 | 300 | 87.33 |
GloVe 840B Embeddings Averaged | 50 | 300 | 90.14 |
GloVe 840B Embeddings Summed | 50 | 300 | 90.85 |
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to