Share @ Google LinkedIn Facebook  transfer-learning, pytorch

Overview

Transfer learning is a process where a person takes neural model trained on large amount of data for some task and uses that pre-trained model for some other task which has somewhat similar data than training model again from scratch.

It generally refers to transfer of knowledge from one model to another model which has somewhat similar requirements so that we can reuse weights of model.We can use such aproach when we don't have enough data or resources to train model.

Types of transfer learnings:

  • convenet as fixed feature extractor: Here we replace last fully connected layer of pre-trained network like VGG and replace with layer according to requirement and train only that layer of network keeping all previous layers freeze. Here we reuse almost whole architecture of VGG except last few fully connected layers.
  • finetuning convnet: Here we replace last fully connected layer of pre-trained network like VGG and then train whole network slightly for few epochs on lower learning rate to finetune model according to new requirement. We also train conv layers here to finetune them for new requirement. We can keep few earlier layers of conv net fixed and train last few conv layers or we can train all conv layers. Earlier conv layers has basic shape details like line, circle etc.Here we reuse almost whole architecture of VGG like conv models.
  • pretrained models: Here we take weights released by convolution layers trained large amount of data like imagenet and use these weights in our conv model. Here we don't reuse architecture but only use weights of conv layers of VGG like models into our model's layers.

How to decide when we can do tranfer learning?

There are various factors when we can decide on transfer learning and level of transfer learning to apply:

  • New dataset is similar to original dataset but small - convnet as fixed future extractor.
  • New dataset is similar to original but large - finetuning convenet. Train whole net little bit on low learning rate.
  • New dataset is different from original but small - convnet as fixed future extractor.
  • New dataset is different from original but large - Design your own conv net and train it but better to initialize it's conv layer weights with weights from some existing well performing model.
In [ ]:
#!pip install --upgrade pip
#!pip install torch torchvision

Importing all necessary libraries

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms, models

import matplotlib.pyplot as plt
import numpy as np

import os
import shutil
import glob

print(torch.__version__)
device  = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Cuda available : '+ str((device == 'cuda')))
1.0.0
Cuda available : True

Downloading Dataset and Extracting

We'll be using Dog Breeds image dataset provided by stanford for our purpose. It has around 20k+ images of 120 category of different dog breeds.

We'll download and unzip it in current directory.

In [2]:
%%time
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
!tar -xf images.tar
#!cp -r Images dogs
!rm images.tar
%ls
--2019-03-03 10:39:04--  http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
Resolving vision.stanford.edu (vision.stanford.edu)... 171.64.68.10
Connecting to vision.stanford.edu (vision.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 793579520 (757M) [application/x-tar]
Saving to: ‘images.tar’

images.tar          100%[===================>] 756.82M  60.6MB/s    in 13s

2019-03-03 10:39:17 (58.8 MB/s) - ‘images.tar’ saved [793579520/793579520]

Images/  __notebook_source__.ipynb
CPU times: user 412 ms, sys: 168 ms, total: 580 ms
Wall time: 21.1 s

Convnet as fixed feature extractor example

We'll be first trying convnet as fixed future extractor where we'll modify last Linear Layer to output 120 probabilities for each image. We'll also set all layers of network as non-trainable except last layer. We'll then train network by setting different learning rates for few epochs to make it predict correct output for our purpose.

Creating subdirectories and copying data

We have below designed function which creates new destination directory and then creates train,val and test subdirectories inside it.

We'll then move 80% of images according to category to train subfolder, 10% to val subfolder and 10% to test subfolder.

Note: Please make a not that train,val and test subfolders will have same structure(dog category subfolders) as original source folder except that number of images inside dog category subfolders will be according to proportion mentiond in previous sentence.

In [3]:
%%time

def create_ml_file_strcuture_and_move_files(src, dest):
    os.makedirs(os.path.join(dest,'train'), exist_ok=True)
    os.makedirs(os.path.join(dest,'val'), exist_ok=True)
    os.makedirs(os.path.join(dest,'test'), exist_ok=True)

    for directory in os.listdir(src):
        os.makedirs(os.path.join(dest,'train',directory), exist_ok=True)
        os.makedirs(os.path.join(dest,'val',directory), exist_ok=True)
        os.makedirs(os.path.join(dest,'test',directory), exist_ok=True)
        init_path = os.path.join(src, directory)
        all_files = os.listdir(init_path)
        n = len(all_files)
        for file in all_files[:int(0.8*n)]:
            shutil.copy(os.path.join(src,directory,file),os.path.join(dest,'train',directory))
        for file in all_files[int(0.8*n):int(0.9*n)]:
            shutil.copy(os.path.join(src,directory,file),os.path.join(dest,'val',directory))
        for file in all_files[int(0.9*n):]:
            shutil.copy(os.path.join(src,directory,file),os.path.join(dest,'test',directory))

create_ml_file_strcuture_and_move_files('Images','dogs')
CPU times: user 1.34 s, sys: 2.15 s, total: 3.49 s
Wall time: 8.74 s

Below we are printing count for different folder's images to verify that our original image count does match with train,val,test subfolders images count.

We also varify that all folder structure is properly created.

In [4]:
print('List of subdirs in Images folder : %d'%len(os.listdir('Images')))
print('List of subdirs in dogs/train folder : %d'%len(os.listdir('dogs/train')))
print('List of subdirs in dogs/val folder : %d'%len(os.listdir('dogs/val')))
print('List of subdirs in dogs/test folder : %d'%len(os.listdir('dogs/test')))
print('List of JPGs in original Images directory : %d'%len(glob.glob('Images/*/*.jpg')))
print('List of JPGs in dogs sub directories : %d'%len(glob.glob('dogs/*/*/*.jpg')))
List of subdirs in Images folder : 120
List of subdirs in dogs/train folder : 120
List of subdirs in dogs/val folder : 120
List of subdirs in dogs/test folder : 120
List of JPGs in original Images directory : 20580
List of JPGs in dogs sub directories : 20580

Initializing DataLoaders

Pytorch provides different modules for doing image manipulation in batches named torchvision.

torchvision provides default data loaders which we can use for loading images from folders and transformers which can do various transformations on images like resizing, cropping, converting to tensors etc.

We'll be loading images from folders using ImageFolder dataset creator and apply 4 transformations to all images. We'll resize them to 256x256 size images, crop center 224x224 pixel image from it, convert image to pytorch tensor and then do normalization (subtracting mean and dividing by standard deviation).

We need to resize images to 224x224 and do normalization because that is requirement of VGG neural network which we'll be using for our purpose. It's setting based on which it performs well.

Also make a note that we are shuffling images of training and validation data but not of test data. num_workers refers to number parallel threads to run to handle task.

In [5]:
data_transform = transforms.Compose([
                                        transforms.Resize(256),
                                        transforms.CenterCrop(224),
                                        transforms.ToTensor(),
                                        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
                                    ])
dsets = {}
dsets['train'] = datasets.ImageFolder('dogs/train', transform=data_transform)
dsets['val'] = datasets.ImageFolder('dogs/val', transform=data_transform)
dsets['test'] = datasets.ImageFolder('dogs/test', transform=data_transform)
loaders = {}
loaders['train'] = torch.utils.data.DataLoader(dsets['train'], batch_size=8, shuffle=True,num_workers=4)
loaders['val'] = torch.utils.data.DataLoader(dsets['val'], batch_size=8, shuffle=True,num_workers=4)
loaders['test'] = torch.utils.data.DataLoader(dsets['test'], batch_size=1, shuffle=False,num_workers=4)

Below we have defined few lists and dictionaries which we'll be using for few mappings purpose. By default we'll get index of dog which has highest probability of 120 dog breeds. We'll need to translate that index to dog name for which we have defined below dictionary.

When we create datasets from folders with images, it does provide us with dictionary from dog breed names to index which we'll invert to get index to dog breed names dictionary. We'll also be storing different dog breed names in dog_breeds variable.

In [6]:
dog_breeds = dsets['train'].classes
dog_breeds_to_idx = dsets['train'].class_to_idx
dog_breeeds, idx = zip(*dsets['train'].class_to_idx.items())
idx_to_dog_breed = dict(zip(idx, dog_breeds))

Below we are looping through train data loader to verify the sizes of image tensors.

It should have format batch_size x channels x width x height. Pytorch models expects image tensors in this format.

In [7]:
for i,(images,labels) in enumerate(loaders['train']):
    if i == 3:
        break
    print(images.size(),labels.size())
torch.Size([8, 3, 224, 224]) torch.Size([8])
torch.Size([8, 3, 224, 224]) torch.Size([8])
torch.Size([8, 3, 224, 224]) torch.Size([8])

Visualizing Train Images

Below we are visualizing first batch which consist of 8 images from training dataset to get idea of images.

In [8]:
images,labels = next(iter(loaders['train']))
inp = torchvision.utils.make_grid(images)
print('Type of Image : '+ str(type(inp)))
inp = inp.numpy().transpose(1,2,0)
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
image = std * inp + mean
iamge = image.clip(0,1)
plt.figure(figsize=(25,5))
plt.imshow(image)
plt.title(str([idx_to_dog_breed[label].split('-')[1] for label in labels.numpy()]))
plt.xticks([])
plt.yticks([])
None
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Type of Image : <class 'torch.Tensor'>

Loading Model with Pretrained Weights

Pytorch provides different module by name of torchvision for providing some pretrained image classification models and few image manipulation functionalities.

We'll be using VGG neural network which was 1st runner up at ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2014. It's designed by Visual Geometry Group of Oxfordd University.

VGG neural network has quite simple structure compared to other competition winner neural networks and also quite good performance. It first used 3x3 convolution for traininf and many other models are based on it hence worth exploring.

In [9]:
vgg = models.vgg16(pretrained=True)
for param in list(vgg.parameters())[:-1]:
    param.requires_grad = False
vgg.classifier[6] = nn.Linear(4096, len(dog_breeds))
vgg = vgg.to(device)
vgg
Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /tmp/.torch/models/vgg16-397923af.pth
100%|██████████| 553433881/553433881 [00:29<00:00, 19036114.42it/s]
Out[9]:
VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace)
    (5): Dropout(p=0.5)
    (6): Linear(in_features=4096, out_features=120, bias=True)
  )
)

Initializing Loss Function and Optimization Function

Below we have defined CrossEntropyLoss function which we'll be optimizing and Stochastic Gradient Descent Optimizer initialized with parameter of VGG neural net & learning rate of 0.001.

In [10]:
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(params=vgg.parameters(), lr = 0.001)

Model Training

Below we have defined function which we'll be suing for training providing it epochs for it should run. One epoch refers to one pass through whole training data to train neural network and one pass thorugh validation data to check accuracy.

In [11]:
def train(epochs):
    for epoch in range(epochs):
        for phase in ['train', 'val']:
            if phase == 'train':
                vgg.train() ## We set model to train phase as it activates layers like Dropout and BatchNormalization.
            else:
                vgg.eval() ## We set model to train phase as it de-activates layers like Dropout and BatchNormalization.

            total_loss = 0.0
            correct_preds = 0

            for i, (images, labels) in enumerate(loaders[phase]):
                images, labels = images.to(device), labels.to(device) ## Translate normal tensor to cuda tensors it GPU is available.
                optimizer.zero_grad() ## At start of each batch we set gradients of loss with respect to parameters to zero.
                with torch.set_grad_enabled(phase == 'train'): ## This enables gradients calculation based on phase.
                    results = vgg.forward(images) ## We do forward pass thorugh batch images.
                    _, predictions = torch.max(results,1) ## We get indexes of max probabilities for each image of batch.
                    loss = loss_function(predictions, labels) ## We calculation loss based on predicted probabilities and actual ones.

                    if phase == 'train':
                        loss.backward() # Backpropogation execution which calculates gradients for each weight parameter.
                        optimizer.step() ## This step updates weights based on gradients calculated above and learning rate set above.
                #print(i)
                total_loss += loss.item()
                correct_preds += torch.sum(predictions == labels) ## We find out correct predictions.

            print('Epoch : %d'%(epoch+1))
            print('Stage : %s'%phase)
            print('Loss : %f'%total_loss)
            #print(correct_preds.item())
            print('Accuracy : %f'% (int(correct_preds.item()) / len(dsets[phase])))
            print('-'*100)

Below we are training model for 3 epochs and check accuracy for training and validation data.

In [12]:
%time train(3)
Epoch : 1
Stage : train
Loss : 4201.561909
Accuracy : 0.596845
----------------------------------------------------------------------------------------------------
Epoch : 1
Stage : val
Loss : 244.480238
Accuracy : 0.832359
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : train
Loss : 1820.296348
Accuracy : 0.796443
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : val
Loss : 174.814241
Accuracy : 0.848441
----------------------------------------------------------------------------------------------------
Epoch : 3
Stage : train
Loss : 1452.731233
Accuracy : 0.822025
----------------------------------------------------------------------------------------------------
Epoch : 3
Stage : val
Loss : 152.199616
Accuracy : 0.855750
----------------------------------------------------------------------------------------------------
CPU times: user 11min 32s, sys: 10min 19s, total: 21min 51s
Wall time: 22min

We execute training for 2 more epochs to check wether it still improves accuracy.

In [13]:
%time train(2)
Epoch : 1
Stage : train
Loss : 1299.218669
Accuracy : 0.827202
----------------------------------------------------------------------------------------------------
Epoch : 1
Stage : val
Loss : 140.284341
Accuracy : 0.852827
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : train
Loss : 1198.354018
Accuracy : 0.836582
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : val
Loss : 132.881917
Accuracy : 0.859649
----------------------------------------------------------------------------------------------------
CPU times: user 7min 39s, sys: 6min 56s, total: 14min 35s
Wall time: 14min 41s

We train for 2 more epochs with same learning rate to check whether it's still improving accuracy.

In [14]:
%time train(2)
Epoch : 1
Stage : train
Loss : 1123.551571
Accuracy : 0.845231
----------------------------------------------------------------------------------------------------
Epoch : 1
Stage : val
Loss : 126.995190
Accuracy : 0.863548
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : train
Loss : 1080.140666
Accuracy : 0.848033
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : val
Loss : 124.079382
Accuracy : 0.861598
----------------------------------------------------------------------------------------------------
CPU times: user 7min 41s, sys: 6min 55s, total: 14min 36s
Wall time: 14min 43s

We noticed above that accuracy is not improving hence we are reducing learning rate and then will train for 2 more epochs to check whether it improves accuracy.

In [15]:
optimizer.lr = 0.0001
%time train(2)
Epoch : 1
Stage : train
Loss : 1026.826890
Accuracy : 0.855768
----------------------------------------------------------------------------------------------------
Epoch : 1
Stage : val
Loss : 119.996681
Accuracy : 0.863548
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : train
Loss : 993.452745
Accuracy : 0.857230
----------------------------------------------------------------------------------------------------
Epoch : 2
Stage : val
Loss : 119.978386
Accuracy : 0.864522
----------------------------------------------------------------------------------------------------
CPU times: user 7min 38s, sys: 6min 48s, total: 14min 27s
Wall time: 14min 39s

Model Testing

We now have accuracy of almost 85+%. We'll now test our model on test data which we have kept aside for final round of testing.

In [16]:
def test():
    with torch.no_grad(): ## We are setting it to no grads as we don't need gradients during testing.
        correct = 0
        #loss = 0
        for images,labels in loaders['test']:
            images,labels = images.to(device), labels.to(device)

            predictions = vgg(images)
            _, preds = torch.max(predictions, 1)
            correct += torch.sum(preds == labels)
        print('Test Set Accuracy : %f'%(correct.item() / len(dsets['test'])))

%time test()
Test Set Accuracy : 0.863981
CPU times: user 31.4 s, sys: 23.5 s, total: 55 s
Wall time: 56.5 s

Visualizing Predictions on Test Data

Below we have written function which visualizes first 40 images of test images.

In [17]:
def visualizing_predictions_on_test_data():
    plt.figure(figsize=(22,28))
    with torch.no_grad():
        for i, (image,label) in enumerate(loaders['test']):
            if i == 40:
                break
            plt.subplot(8,5,i+1)
            image,label = image.to(device), label.to(device)

            prediction = vgg(image)
            _, pred = torch.max(prediction,1)
            img = image.to('cpu').numpy()[0].transpose(1,2,0)
            mean = np.array([0.485, 0.456, 0.406])
            std = np.array([0.229, 0.224, 0.225])
            img = std * img + mean
            plt.imshow(img.clip(0.0,1.0))
            plt.title('Actual : %s,\nPredicted : %s'%(idx_to_dog_breed[int(label.item())].split('-')[1], idx_to_dog_breed[int(pred.item())].split('-')[1]))
            plt.xticks([])
            plt.yticks([])

visualizing_predictions_on_test_data()