All of the machine learning libraries expect input in the form of floats and that also fixed length/dimensions. But in real life, we face data in different forms like text, images, audio, video, etc. We need to find a way to represent these forms of data as floats to be able to train learning algorithms based on them. In this tutorial, we'll be discussing how to convert free form text which can be of variable length to an array of floats (called feature extraction generally).
We'll start with a simple method for representation of text data called a bag of words.
Here, we'll be assuming data has come to us as a single string for each instance(spam mail, book, new, etc.) of data. We'll split each instance to a list of tokens based on white space and then lowercase each word. We'll repeat this process for each of our instances in the dataset. At the end of the process, we'll have quite a big vocabulary of words from all instances.
Now looking at each of our samples we can tell how often it appears in vocabulary. We'll represent our string as a single vector of length the same as that of vocabulary and words from that string will be marked 1s
& all other entries will be 0s
in that vector. We'll repeat the process for each instance of data.
At the end of the process, we'll end up with an array of size (number_of_instance/samples x vocabulary_size)
which will be quite a sparse array because the dictionary contains all possible words and each sentence will have few words from it.
It's called bag-of-words because the order of words is lost totally.
We'll start by importing necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections
import warnings
import re
import sklearn
warnings.filterwarnings("ignore")
%matplotlib inline
Below we have created a sample dataset of 3 strings which we'll be using for an explanation of our purpose.
X = ['Welcome to coderzcolumn. We will help you learn python',
'Lets start our day by learning something new',
'Learn from tutorials, learn from blogs. Keep learning till life ends. Its a long journey']
We'll be using a simple CounteVectorizer
provided by scikit-learn for converting our list of strings to a list of tokens based on vocabulary.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X)
Once we have an initialized model and trained it with train data, we can then transform data to floats using the transform()
method. We can also use the fit_transform()
method available in an object to perform fitting and transforming data in one step if we want to combine them.
vectorized_input = vectorizer.transform(X)
vectorized_input.shape, vectorized_input, type(vectorized_input)
Make a note that transformed input is returning sparse scipy array
which is stored in CSR(Compressed Sparse Row)
format. One can easily convert such array to numpy array and back.
vectorized_input.todense(), vectorized_input.toarray()
CountVectorizer
class has an attribute named vocabulary_
which maintains vocabulary words in total corpus and their index as well. Words are sorted alphabetically in a dictionary.
print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names())
We can transform sparse array back to the original list of strings using the inverse_transform()
method but we'll have lost our order of words in original sentences.
print(vectorizer.inverse_transform(vectorized_input))
We'll below list down other important parameters available in the CountVectorizer
model which can help us with various purposes when extracting futures from text data.
input - It accepts one of string values from list ['content', 'filename','file']. content
expects list of strings/bytes
as input. filename
expects list of filename as input.file
expects list of file objects as input. default=content
encoding - If the list of bytes or files opened in binary mode are given as input then this parameter is used to decode data. default=utf-8
decode_error - It accepts string from list ['strict', 'ignore', 'replace']. strict
will fail vectorizer if there is error when decoding byte sequence.ignore
will ignore characters where errors occur while decoding. replace
will replace with suitable matching character if error occurs while decoding.default=strict
.
preprocessor - It accepts callable
or None
as value. We can create our own preprocessor function which takes as input string and performs preprocessing according to our need. We can add lemmatization, stemming, etc. default=None
tokenizer - It accepts callable
or None
as value. We can define our own function which will split words according to our needs.It's only useful when analyzer=word
is set. default=None
def user_defined_preprocessor(sample):
"""
sample: It returns to one sample of data.
returns: It returns string with special characters removed.
It returns list of words with only english characters seprated by single white space.
"""
return ' '.join(re.findall(r'\w+', sample)) ## \w captures [a-zA-Z0-9] chracters in data.
def user_defined_tokenizer(sample):
"""
sample: It returns to one sample of data.
returns: It first lowers each chracter in string and then split them by single white space.
It then returns list of words.
"""
return sample.lower().split(' ')
vectorizer = CountVectorizer(preprocessor=user_defined_preprocessor, tokenizer=user_defined_tokenizer)
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary', vectorizer.vocabulary_)
We can access preprocessor and tokenizer using build_preprocessor()
and build_tokenizer()
methods and preprocessor
and tokenizer
property of CountVectorizer object.
print(help(vectorizer.build_preprocessor()))
print()
print(help(vectorizer.build_tokenizer()))
print()
print('Preprocessor : ', vectorizer.preprocessor)
print()
print('Tokenizer : ',vectorizer.tokenizer)
stop_words - It accepts string english
, list of words
or None
as value. It removes these words when performing tokenization hence it won't be available in final vocabulary. It's only applied when analyzer=word
. default=None
token_pattern - It refers to tokenization pattern which will decide what can be defined as one token(word). It's only applied when analyzer=word
. default='(?u)\\b\\w\\w+\\b'
vectorizer = CountVectorizer(stop_words=['we','you', 'it','its','to','a','an', 'the', 'by'])
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', vectorizer.vocabulary_)
vectorizer = CountVectorizer(stop_words='english')
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', vectorizer.vocabulary_)
vectorizer = CountVectorizer(stop_words=None)
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', vectorizer.vocabulary_)
Please make a note above how vocabulary is created with different use of stop_words
values.
ngram_range - It accepts tuple of (min_n, max_n) which refers to the minimum and maximum values to be considered for n-grams. It's explained further below in tutorial in-depth. default=(1,1)
analyzer - It accepts string from list ['word', 'char', 'char_wb'] as value. It decides what should be considered as one token( a word of a character).default=word
vectorizer = CountVectorizer(ngram_range=(2,3), analyzer='word', stop_words='english')
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', vectorizer.vocabulary_)
Please make a note above that we have considered only 2-words and 3-words as token while removing stop words.
vectorizer = CountVectorizer(ngram_range=(4, 5), analyzer='char', stop_words='english')
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', list(vectorizer.vocabulary_.items())[:20])
Please make a note above that we have considered only 4-characters and 5-characters as token while removing stop words. We are printing only 20 token to prevent output from flooding.
max_futures - It accepts int
or None
as value. If an integer is provided then only that many top tokens according to token-frequency will be considered across the corpus. If vocabulary
parameter described below has been given then this parameter is ignored. default=None
vocabulary - It accepts mapping(dict) or iterable as value. Mapping should be a dictionary with the key as token and value as indices. For iterable, it should be a list of (token, index) values.default=None
Note: Document-frequency represents a total number of the document that contains term token/term.
vectorizer = CountVectorizer(max_features=10)
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', list(vectorizer.vocabulary_.items())[:20])
vectorizer = CountVectorizer(vocabulary={'welcome':0,'tutorials':1, 'blogs':2})
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', list(vectorizer.vocabulary_.items())[:20])
min_df - It accepts float value in range [0.0, 1.0]. It ignores all tokens whose document-frequency is lower than given value. default=1
max_df - It accepts float value in range [0.0, 1.0]. It ignores all tokens whose document-frequency is higher than given value. default=1.0
vectorizer = CountVectorizer(min_df=0.25, max_df=0.75)
transformed_X = vectorizer.fit_transform(X)
print('Vocabulary : ', list(vectorizer.vocabulary_.items())[:20])
tf-idf (term frequency-inverse document frequency) is a type of transformation applied to bag-of-words tokens. It's kind of scaling which can help complete training fast.
The main idea behind scaling is that down weight words which occur in many documents because that kind of words will have less influence on natural processing tasks like document classification. It puts more emphasis on words that are less occurring giving them more weight than frequently occurring.
We'll below explain step by step of getting tf-idf though scikit-learn has direct implementation for it as well.
Raw Term Frequency - tf(t,d): We already explained above raw term frequency and scikit-learn implementation CountVectorizer
to get it.
Normalized Term Frequency: Raw term frequency is normalized using l2-normalization which involves dividing normal term frequency $v$ by its vector's length $||v||$ (Euclidean Norm).
$ v_{norm} = \dfrac {v} {||v||_2} = \dfrac {v} {(\sum_{i=1}^{n} v_i^{2})^{1/2}} $
document frequency - df(d,t): It represents a total number of the document that contains term t.
inverse document frequency - idf(t): Formula for idf is given below based on document frequency.
$ idf(t) = {\log_{} \dfrac {n_d} {df(d,t)}} + 1 $
smooth_idf: Scikit-learn transformers have an attribute called smooth_idf
which transforms idf
formula mentioned above to below one.
$ idf(t) = {\log_{} \dfrac {1+n_d} {1+df(d,t)}} + 1 $
tf-idf: FInal formula based on above terms for tf-idf is given below
$ tf{-}idf(t,d) = tf(t,d) * idf(t) $
tf = vectorized_input.toarray()
normalized_tf = tf[2] / np.sqrt(np.sum(tf[2]**2))
print('Normalized Term Frequency of 3rd sample : \n',normalized_tf)
We can also get normalized term frequency using scikit-learn's class called TfidTransformer
.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)
tf_normalized = tfidf.fit_transform(tf).toarray()
print('Normalized Term Frequency of 3rd sample : \n', tf_normalized[2])
n_docs = len(X)
tf_welcome = 1
df_welcome = 1
inverse_df_welcome = (np.log(n_docs / df_welcome) + 1)
print('idf of "welcome" : ',tf_welcome * inverse_df_welcome)
tf_learn = 3
df_learn = 2
inverse_df_learn = (np.log(n_docs / df_learn) + 1)
print('idf of "learn" : ', tf_learn * inverse_df_learn)
Let’s use scikit-learn's TfIdfTransformer
with no normalization and no smoothing of idf.
tfidf = TfidfTransformer(norm=None,smooth_idf=False,use_idf=True)
tf_idf = tfidf.fit_transform(tf).toarray()
tfidf.idf_
tf_idf[2]
Let’s use the TfidfVectorizer
class of scikit-learn for generating tf-idfs.
Note: Please make a note that TfidfTransformer
works on term frequency array generated through CountVectorizer
and TfidfVectorizer
works directly on the original list of strings.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(norm=None,smooth_idf=False,use_idf=True)
tf_idf = tfidf_vect.fit_transform(X).toarray()
tf_idf[2]
tfidf_vect.idf_
n_docs = len(X)
tf_welcome = 1
df_welcome = 1
inverse_df_welcome = (np.log( (1+n_docs) / (1+df_welcome)) + 1)
print('idf of "welcome" : ',tf_welcome * inverse_df_welcome)
tf_learn = 3
df_learn = 2
inverse_df_learn = (np.log((1+n_docs) / (1+df_learn)) + 1)
print('idf of "learn" : ', tf_learn * inverse_df_learn)
tfidf_vect = TfidfVectorizer(norm=None) ## Tfidf with no normalization. It'll be using idf and smoothing of idf though.
tf_idf = tfidf_vect.fit_transform(X).toarray()
tf_idf[2]
tfidf_vect = TfidfVectorizer() ## tfidf with l2 normalization, using idf and smoothing idf as well
tf_idf = tfidf_vect.fit_transform(X).toarray()
tf_idf[2]
TfidfVectorizer
has most of the parameter the same as that of Countvectorizer
which we have explained above in-depth. One can try the parameter values explained above with TfidfVectorizer
as well to check results. Parameters that were specific to TfidfVectorizer
have been already explained above with examples.
Till now we have discussed only one-word tokens(1-gram - unigram) and totally discarded order of words. But this might not be always right as we might need to consider the order in some scenarios (like "not" can invert the meaning of the sentence).
A simple way to consider some order of words is to use n-grams. N-Grams does not look at single words but all pairs of possible neighbors.
2-grams can consist of all 2 words neighboring pairs with an overlap of 1 word. 3-grams can consist of all 3 words neighboring pairs with an overlap of 2 words.
Deciding "n" to be used in n-gram is dependent on the application and can be used as one hyperparameter of the algorithm to be tuned.
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) ## N-gram with min lenght of 2 and max length of 2
bigram_vectorizer.fit(X)
print(bigram_vectorizer.get_feature_names())
bigram_vectorizer.transform(X).toarray()
print(bigram_vectorizer.vocabulary_)
We can include unigram and bigrams both in tokenization.
gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
gram_vectorizer.fit(X)
print(gram_vectorizer.get_feature_names())
gram_vectorizer.transform(X).toarray()
print(gram_vectorizer.vocabulary_)
Sometimes we want to create tokens of a list of characters of a particular length. Character N-Grams are generally used in language identification.
We can use analyzer="char"
for generating character n-grams with CountVectorizer
as we had described above.
char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
char_vectorizer.fit(X)
print(char_vectorizer.get_feature_names())
We'll perform an SMS Spam classification task from the UCI ML data library. It'll help us explain the whole process of text feature extraction, feature selection, training model, evaluating the model and visualizing results.
We'll first download data from the UCI ML data directory and then will perform classification by reading a file.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
with open('SMSSpamCollection') as f:
data = [line.strip().split('\t') for line in f.readlines()]
y, text = zip(*data)
collections.Counter(y)
Lets split our data into training(75%) and test sets(25%).
from sklearn.model_selection import train_test_split
text_train, text_test, y_train, y_test = train_test_split(text, y,
random_state=42,
test_size=0.25,
stratify=y)
Let’s create unigrams for each sample of the train and test set. We'll be using count vectorizer and tf-idf vectorizer both for explanation purposes.
%%time
count_vectorizer = CountVectorizer()
count_vectorizer.fit(text_train)
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(text_train)
X_train_cnt = count_vectorizer.transform(text_train)
X_test_cnt = count_vectorizer.transform(text_test)
X_train_tfidf = tfidf_vectorizer.transform(text_train)
X_test_tfidf = tfidf_vectorizer.transform(text_test)
print(X_train_cnt.shape, X_test_cnt.shape, X_train_tfidf.shape, X_test_tfidf.shape)
Lets define simple logistic regression model for classification purpose and fit it.
from sklearn.linear_model import LogisticRegression
clf_cnt = LogisticRegression()
print(clf_cnt.fit(X_train_cnt, y_train))
clf_tfidf = LogisticRegression()
print(clf_tfidf.fit(X_train_tfidf, y_train))
Lets evaluate model accuracy on train data.
print('Count Vectorizer Performance : %.3f'%clf_cnt.score(X_test_cnt, y_test))
print('TfIdf Vectorizer Performance : %.3f'%clf_tfidf.score(X_test_cnt, y_test))
Lets evaluate model accuracy on test data.
print('Count Vectorizer Performance : %.3f'%clf_cnt.score(X_test_tfidf, y_test))
print('TfIdf Vectorizer Performance : %.3f'%clf_tfidf.score(X_test_tfidf, y_test))
Let’s visualize most important
20 features and least important
20 features. Least important features will be mostly repeated words in almost every other sentence like "the", "me", "at" etc.
Most important words will be the ones that occur quite less compared to the least important which occurs quite often as it'll be mostly words required in sentence formation.
coef = clf_cnt.coef_.ravel()
positive_coeffs = np.argsort(coef)[-20:]
negative_coeffs = np.argsort(coef)[:20]
interesting_coeffs = np.hstack([negative_coeffs, positive_coeffs])
# plot them
with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(18, 8))
colors = ["red" if c < 0 else "green" for c in coef[interesting_coeffs]]
plt.bar(np.arange(2 * 20), coef[interesting_coeffs], color=colors)
feature_names = np.array(count_vectorizer.get_feature_names())
plt.xticks(np.arange(1, 2 * 20 + 1), feature_names[interesting_coeffs], rotation=60, ha="right")
plt.xlabel('Feature Importance')
plt.xlabel('Feature')
plt.title('CountVectorizer Features Importance')
coef = clf_tfidf.coef_.ravel()
positive_coeffs = np.argsort(coef)[-20:]
negative_coeffs = np.argsort(coef)[:20]
interesting_coeffs = np.hstack([negative_coeffs, positive_coeffs])
# plot them
with plt.style.context(('seaborn', 'ggplot')):
plt.figure(figsize=(18, 8))
colors = ["red" if c < 0 else "green" for c in coef[interesting_coeffs]]
plt.bar(np.arange(2 * 20), coef[interesting_coeffs], color=colors)
feature_names = np.array(tfidf_vectorizer.get_feature_names())
plt.xticks(np.arange(1, 2 * 20 + 1), feature_names[interesting_coeffs], rotation=60, ha="right")
plt.xlabel('Feature Importance')
plt.xlabel('Feature')
plt.title('TfIfVectorizer Features Importance')
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to