Updated On : Mar-14,2022 Time Investment : ~30 mins

MXNet: Learning Rate Schedules¶

When training neural networks, we generally keep the learning rate constant throughout the whole training process. But some research has shown that changing the learning rate over time can help improve the performance of the neural network. There are various formulas to decrease and increase learning rates in cycles over time to increase the accuracy of our network. This process of decreasing the learning rate over time during training is generally referred to as learning rate scheduling or learning rate annealing.

As a part of this tutorial, we have explained with examples how we can perform learning rate scheduling with mxnet networks. The mxnet many learning rate schedulers that we'll explore as a part of the tutorial. We have used the Fashion MNIST dataset for our purpose and trained a simple Convolutional Neural Network (CNN) on it. For training, we have used SGD optimizer with various learning rate schedulers from mxnet. We have also created various visualizations showing how the learning rate changes during training to give an idea about how the scheduler works. We assume that the reader has little background on mxnet. Please feel free to check the below links if you want to learn how to create CNN using mxnet.

MXNet: Convolutional Neural Networks (CNN)

Below, we have listed important sections of tutorial to give an overview of the material covered.

Below, we have imported mxnet and printed the version that we have used in our tutorial.

import mxnet

print("MXNet Version : {}".format(mxnet.__version__))

MXNet Version : 1.9.0

Load Data ¶

Below, we have loaded the Fashion MNIST dataset which is available from keras. The dataset has grayscale images of shape (28,28) pixels for 10 different fashion items. The dataset is already divided into the train (60k images) and test (10k images) sets. After loading datasets, we have converted them from numpy arrays to mxnet arrays as required by mxnet networks. Below we have included a table that has a mapping from index to class names.

Label	Description
0	T-shirt/top
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle boot

from tensorflow import keras
from sklearn.model_selection import train_test_split

from mxnet import nd
import numpy as np

(X_train, Y_train), (X_test, Y_test) = keras.datasets.fashion_mnist.load_data()

X_train, X_test, Y_train, Y_test = nd.array(X_train, dtype=np.float32),\
                                   nd.array(X_test, dtype=np.float32),\
                                   nd.array(Y_train, dtype=np.float32),\
                                   nd.array(Y_test, dtype=np.float32)

X_train, X_test = X_train.reshape(-1,1,28,28), X_test.reshape(-1,1,28,28)

X_train, X_test = X_train/255.0, X_test/255.0

classes =  np.unique(Y_train.asnumpy())
class_labels = ["T-shirt/top","Trouser","Pullover","Dress","Coat","Sandal","Shirt","Sneaker","Bag","Ankle boot"]
mapping = dict(zip(classes, class_labels))

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step

((60000, 1, 28, 28), (10000, 1, 28, 28), (60000,), (10000,))

Define CNN ¶

In this section, we have defined a convolutional neural network that we'll use to classify images. The network has 2 convolution layers and one dense layer. The two convolution layers have 32 and 16 output channels respectively and both have a kernel of shape (3,3). Both convolution layers apply relu activation function to the output. The output of the second convolution layer is flattened and given to the dense layer as input. The dense layer has 10 output units (same as the target classes).

After defining the network, we have initialized it and made predictions using it for verification purposes.

from mxnet.gluon import nn

class CNN(nn.Block):
    def __init__(self, **kwargs):
        super(CNN, self).__init__(**kwargs)
        self.conv1 = nn.Conv2D(channels=32, kernel_size=(3,3), activation="relu", padding=(1,1))
        self.conv2 = nn.Conv2D(channels=16, kernel_size=(3,3), activation="relu", padding=(1,1))
        self.flatten = nn.Flatten()
        self.linear = nn.Dense(len(classes))

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)

        x = self.flatten(x)
        logits = self.linear(x)
        return logits #nd.softmax(logits)

model = CNN()

model

CNN(
  (conv1): Conv2D(None -> 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), Activation(relu))
  (conv2): Conv2D(None -> 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), Activation(relu))
  (flatten): Flatten
  (linear): Dense(None -> 10, linear)
)

from mxnet import init, initializer

model.initialize(initializer.Xavier())

preds = model(X_train[:5])

preds.shape

(5, 10)

1. Constant Learning Rate ¶

In this section, we have trained our network using a constant learning rate. Below, we have created a function that we'll use throughout our tutorial for the training network. The function takes trainer object, training data (X, Y), validation data (X_val, Y_val), number of epochs, and batch size as input. It then performs a training loop number of epochs times. For each epoch, it loops through whole training data in batches. For each batch, it performs a forward pass to make predictions, calculate loss, calculate gradients, and update network parameters. We accumulate training loss for each batch and then print the average training loss at the end of each epoch. We also calculate validation loss at the end of each epoch and print it.

from mxnet import autograd
from tqdm import tqdm

def TrainModelInBatches(trainer, X, Y, X_val, Y_val, epochs, batch_size=32):
    for i in range(1, epochs+1):
        batches = nd.arange((X.shape[0]//batch_size)+1) ### Batch Indices

        losses = [] ## Record loss of each batch
        for batch in tqdm(batches):
            batch = batch.asscalar()
            if batch != batches[-1]:
                start, end = int(batch*batch_size), int(batch*batch_size+batch_size)
            else:
                start, end = int(batch*batch_size), None

            X_batch, Y_batch = X[start:end], Y[start:end] ## Single batch of data

            with autograd.record():
                preds = model(X_batch) ## Forward pass to make predictions
                train_loss = loss_func(preds.squeeze(), Y_batch) ## Calculate Loss
            train_loss.backward() ## Calculate Gradients

            train_loss = train_loss.mean().asscalar()
            losses.append(train_loss)

            trainer.step(len(X_batch)) ## Update weights

        print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
        val_loss = loss_func(model(X_val), Y_val)
        print("Valid CrossEntropyLoss : {:.3f}".format(val_loss.mean().asscalar()))

In the below cell, we are training our network using a function defined in the previous cell. We have first initialized batch size to 256, a number of epochs to 25, and learning rate to 0.001. Then, we have initialized the network, loss function, optimizer, and trainer object. At last, we have called our training function to perform training. We can notice from the loss values getting printed after each epoch that our model is doing a good job.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()
grad_descent = optimizer.SGD(learning_rate=learning_rate)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:24<00:00,  9.77it/s]

Train CrossEntropyLoss : 2.287
Valid CrossEntropyLoss : 2.271

100%|██████████| 235/235 [00:25<00:00,  9.34it/s]

Train CrossEntropyLoss : 2.243
Valid CrossEntropyLoss : 2.207

100%|██████████| 235/235 [00:25<00:00,  9.36it/s]

Train CrossEntropyLoss : 2.109
Valid CrossEntropyLoss : 1.954

100%|██████████| 235/235 [00:25<00:00,  9.30it/s]

Train CrossEntropyLoss : 1.624
Valid CrossEntropyLoss : 1.274

100%|██████████| 235/235 [00:25<00:00,  9.34it/s]

Train CrossEntropyLoss : 1.045
Valid CrossEntropyLoss : 0.914

100%|██████████| 235/235 [00:25<00:00,  9.24it/s]

Train CrossEntropyLoss : 0.834
Valid CrossEntropyLoss : 0.802

100%|██████████| 235/235 [00:25<00:00,  9.37it/s]

Train CrossEntropyLoss : 0.754
Valid CrossEntropyLoss : 0.747

100%|██████████| 235/235 [00:25<00:00,  9.29it/s]

Train CrossEntropyLoss : 0.708
Valid CrossEntropyLoss : 0.710

100%|██████████| 235/235 [00:25<00:00,  9.31it/s]

Train CrossEntropyLoss : 0.676
Valid CrossEntropyLoss : 0.683

100%|██████████| 235/235 [00:25<00:00,  9.21it/s]

Train CrossEntropyLoss : 0.651
Valid CrossEntropyLoss : 0.662

100%|██████████| 235/235 [00:25<00:00,  9.21it/s]

Train CrossEntropyLoss : 0.631
Valid CrossEntropyLoss : 0.645

100%|██████████| 235/235 [00:36<00:00,  6.48it/s]

Train CrossEntropyLoss : 0.615
Valid CrossEntropyLoss : 0.630

100%|██████████| 235/235 [00:26<00:00,  8.95it/s]

Train CrossEntropyLoss : 0.601
Valid CrossEntropyLoss : 0.618

100%|██████████| 235/235 [00:26<00:00,  8.97it/s]

Train CrossEntropyLoss : 0.589
Valid CrossEntropyLoss : 0.607

100%|██████████| 235/235 [00:26<00:00,  8.97it/s]

Train CrossEntropyLoss : 0.578
Valid CrossEntropyLoss : 0.598

100%|██████████| 235/235 [00:26<00:00,  9.04it/s]

Train CrossEntropyLoss : 0.569
Valid CrossEntropyLoss : 0.590

100%|██████████| 235/235 [00:25<00:00,  9.09it/s]

Train CrossEntropyLoss : 0.561
Valid CrossEntropyLoss : 0.583

100%|██████████| 235/235 [00:26<00:00,  8.92it/s]

Train CrossEntropyLoss : 0.554
Valid CrossEntropyLoss : 0.576

100%|██████████| 235/235 [00:26<00:00,  8.90it/s]

Train CrossEntropyLoss : 0.547
Valid CrossEntropyLoss : 0.570

100%|██████████| 235/235 [00:26<00:00,  8.89it/s]

Train CrossEntropyLoss : 0.541
Valid CrossEntropyLoss : 0.564

100%|██████████| 235/235 [00:26<00:00,  8.98it/s]

Train CrossEntropyLoss : 0.536
Valid CrossEntropyLoss : 0.560

100%|██████████| 235/235 [00:25<00:00,  9.13it/s]

Train CrossEntropyLoss : 0.531
Valid CrossEntropyLoss : 0.555

100%|██████████| 235/235 [00:26<00:00,  8.85it/s]

Train CrossEntropyLoss : 0.526
Valid CrossEntropyLoss : 0.551

100%|██████████| 235/235 [00:26<00:00,  8.91it/s]

Train CrossEntropyLoss : 0.522
Valid CrossEntropyLoss : 0.547

100%|██████████| 235/235 [00:26<00:00,  8.98it/s]

Train CrossEntropyLoss : 0.518
Valid CrossEntropyLoss : 0.543

Below, we have made predictions on test data using our trained model. Then, we have calculated accuracy and a classification report on test predictions.

Below we have calculated metrics using functions available from scikit-learn. Please feel free to check the below link if you want to learn about various ML metrics available from scikit-learn.

Scikit-Learn - Model Evaluation & Scoring Metrics

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.8076
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.81      0.77      0.79      1057
     Trouser       0.92      0.95      0.93       968
    Pullover       0.70      0.68      0.69      1018
       Dress       0.85      0.77      0.81      1098
        Coat       0.72      0.70      0.71      1029
      Sandal       0.90      0.91      0.90       985
       Shirt       0.44      0.55      0.49       790
     Sneaker       0.89      0.90      0.89       987
         Bag       0.94      0.92      0.93      1022
  Ankle boot       0.94      0.90      0.92      1046

    accuracy                           0.81     10000
   macro avg       0.81      0.80      0.80     10000
weighted avg       0.82      0.81      0.81     10000

2. Factor Scheduler ¶

In this section, we have trained our network using SGD with a factor learning rate scheduler. It multiplies the current learning rate by a particular factor after a specified number of steps has passed to generate a new learning rate. We can create factor scheduler using FactorScheduler() constructor available from lr_scheduler sub-module of mxnet. Below are important parameters of the constructor.

step - This parameter accepts integer specifying after how many steps (batches) to anneal learning rate.
factor - This parameter accepts float value that is used to multiply the current learning rate after specified steps have passed to generate a new learning rate.
base_lr - This is the initial learning rate.
stop_factor_lr - This the minimum learning rate. The learning rate won't be decreased below this value.
warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
- 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
- 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

The scheduler uses the below formula to anneal the learning rate.

base_lr * pow(factor, floor(num_update/step))

In our case, we have initialized FactorScheduler with an initial learning rate of 0.001, steps after which to anneal learning rate to 1000, factor to 0.9, and minimum LR to 1e-6. This will start with an initial learning rate of 0.001 and multiply it by 0.9 after 1000 steps.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:25<00:00,  9.12it/s]

Train CrossEntropyLoss : 2.283
Valid CrossEntropyLoss : 2.267

100%|██████████| 235/235 [00:26<00:00,  8.85it/s]

Train CrossEntropyLoss : 2.243
Valid CrossEntropyLoss : 2.205

100%|██████████| 235/235 [00:26<00:00,  8.91it/s]

Train CrossEntropyLoss : 2.121
Valid CrossEntropyLoss : 1.987

100%|██████████| 235/235 [00:26<00:00,  8.80it/s]

Train CrossEntropyLoss : 1.715
Valid CrossEntropyLoss : 1.382

100%|██████████| 235/235 [00:26<00:00,  8.84it/s]

Train CrossEntropyLoss : 1.143
Valid CrossEntropyLoss : 0.972

100%|██████████| 235/235 [00:26<00:00,  8.71it/s]

Train CrossEntropyLoss : 0.883
Valid CrossEntropyLoss : 0.826

100%|██████████| 235/235 [00:26<00:00,  8.85it/s]

Train CrossEntropyLoss : 0.777
Valid CrossEntropyLoss : 0.757

100%|██████████| 235/235 [00:38<00:00,  6.14it/s]

Train CrossEntropyLoss : 0.718
Valid CrossEntropyLoss : 0.714

100%|██████████| 235/235 [00:26<00:00,  8.75it/s]

Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.684

100%|██████████| 235/235 [00:26<00:00,  8.76it/s]

Train CrossEntropyLoss : 0.654
Valid CrossEntropyLoss : 0.663

100%|██████████| 235/235 [00:26<00:00,  8.80it/s]

Train CrossEntropyLoss : 0.633
Valid CrossEntropyLoss : 0.645

100%|██████████| 235/235 [00:25<00:00,  9.10it/s]

Train CrossEntropyLoss : 0.617
Valid CrossEntropyLoss : 0.631

100%|██████████| 235/235 [00:27<00:00,  8.61it/s]

Train CrossEntropyLoss : 0.603
Valid CrossEntropyLoss : 0.619

100%|██████████| 235/235 [00:26<00:00,  8.73it/s]

Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.609

100%|██████████| 235/235 [00:27<00:00,  8.65it/s]

Train CrossEntropyLoss : 0.582
Valid CrossEntropyLoss : 0.601

100%|██████████| 235/235 [00:26<00:00,  8.91it/s]

Train CrossEntropyLoss : 0.573
Valid CrossEntropyLoss : 0.593

100%|██████████| 235/235 [00:26<00:00,  8.72it/s]

Train CrossEntropyLoss : 0.566
Valid CrossEntropyLoss : 0.587

100%|██████████| 235/235 [00:26<00:00,  8.71it/s]

Train CrossEntropyLoss : 0.559
Valid CrossEntropyLoss : 0.581

100%|██████████| 235/235 [00:26<00:00,  8.96it/s]

Train CrossEntropyLoss : 0.554
Valid CrossEntropyLoss : 0.576

100%|██████████| 235/235 [00:27<00:00,  8.68it/s]

Train CrossEntropyLoss : 0.549
Valid CrossEntropyLoss : 0.572

100%|██████████| 235/235 [00:27<00:00,  8.62it/s]

Train CrossEntropyLoss : 0.544
Valid CrossEntropyLoss : 0.568

100%|██████████| 235/235 [00:26<00:00,  8.72it/s]

Train CrossEntropyLoss : 0.540
Valid CrossEntropyLoss : 0.564

100%|██████████| 235/235 [00:27<00:00,  8.60it/s]

Train CrossEntropyLoss : 0.536
Valid CrossEntropyLoss : 0.561

100%|██████████| 235/235 [00:26<00:00,  8.73it/s]

Train CrossEntropyLoss : 0.533
Valid CrossEntropyLoss : 0.558

100%|██████████| 235/235 [00:26<00:00,  8.89it/s]

Train CrossEntropyLoss : 0.530
Valid CrossEntropyLoss : 0.555

In this cell, we have evaluated the performance of the network by calculating accuracy and classification report.

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.8065
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.80      0.76      0.78      1056
     Trouser       0.91      0.94      0.93       969
    Pullover       0.69      0.69      0.69      1007
       Dress       0.85      0.77      0.81      1103
        Coat       0.71      0.68      0.70      1047
      Sandal       0.90      0.91      0.90       981
       Shirt       0.44      0.56      0.49       776
     Sneaker       0.90      0.89      0.89      1015
         Bag       0.93      0.91      0.92      1024
  Ankle boot       0.93      0.91      0.92      1022

    accuracy                           0.81     10000
   macro avg       0.81      0.80      0.80     10000
weighted avg       0.81      0.81      0.81     10000

In the next few cells, we have plotted how the learning rate will change during training if we use FactorScheduler with different settings. This helps us better understand how it works internally.

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9,
                                         stop_factor_lr=1e-6, base_lr=learning_rate)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate, warmup_steps=200,
                                         warmup_begin_lr=0.0009)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

import matplotlib.pyplot as plt

scheduler = lr_scheduler.FactorScheduler(step=1000, factor=0.9, stop_factor_lr=1e-6,
                                         base_lr=learning_rate, warmup_steps=200,
                                         warmup_begin_lr=0.0009, warmup_mode="constant")

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

3. Multi Factor Scheduler ¶

In this section, we have trained our network using SGD with a multi-factor learning rate scheduler. We can create multi-factor scheduler using MultiFactorScheduler() constructor. Below are important parameters of the constructor.

step - This parameter accepts a list of integers specifying boundaries after which to modify the learning rate.
factor - This parameter accepts float value that is used to multiply the current learning rate after specified steps have passed to generate a new learning rate.
base_lr - This is the initial learning rate.
warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
- 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
- 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have initialized MultiFactorScheduler() with step parameter set to [1000,2000,3000], factor parameter set to 0.9 and base LR set to 0.001. This will keep the learning rate at 0.001 for the first 1000 steps, then it'll multiply the learning rate by 0.9 for the next 1000 steps. Then, it'll again multiply the learning rate by 0.9 for the next 1000 steps. Then, it'll again multiply the learning rate by 0.9 for all steps beyond 3000 steps.

In the next cell, we have also evaluated the performance of the network by calculating accuracy and classification report metrics.

In the cell after metrics calculation, we have also plotted how the learning rate will change during training if we use a multi-factor scheduler to anneal it.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.MultiFactorScheduler(step=[1000,2000,3000], factor=0.9, base_lr=learning_rate)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:25<00:00,  9.16it/s]

Train CrossEntropyLoss : 2.266
Valid CrossEntropyLoss : 2.227

100%|██████████| 235/235 [00:40<00:00,  5.80it/s]

Train CrossEntropyLoss : 2.129
Valid CrossEntropyLoss : 1.964

100%|██████████| 235/235 [00:25<00:00,  9.17it/s]

Train CrossEntropyLoss : 1.591
Valid CrossEntropyLoss : 1.226

100%|██████████| 235/235 [00:27<00:00,  8.57it/s]

Train CrossEntropyLoss : 1.033
Valid CrossEntropyLoss : 0.925

100%|██████████| 235/235 [00:27<00:00,  8.61it/s]

Train CrossEntropyLoss : 0.852
Valid CrossEntropyLoss : 0.826

100%|██████████| 235/235 [00:27<00:00,  8.69it/s]

Train CrossEntropyLoss : 0.777
Valid CrossEntropyLoss : 0.771

100%|██████████| 235/235 [00:27<00:00,  8.65it/s]

Train CrossEntropyLoss : 0.730
Valid CrossEntropyLoss : 0.733

100%|██████████| 235/235 [00:26<00:00,  8.76it/s]

Train CrossEntropyLoss : 0.696
Valid CrossEntropyLoss : 0.704

100%|██████████| 235/235 [00:25<00:00,  9.24it/s]

Train CrossEntropyLoss : 0.669
Valid CrossEntropyLoss : 0.681

100%|██████████| 235/235 [00:27<00:00,  8.68it/s]

Train CrossEntropyLoss : 0.649
Valid CrossEntropyLoss : 0.663

100%|██████████| 235/235 [00:27<00:00,  8.62it/s]

Train CrossEntropyLoss : 0.632
Valid CrossEntropyLoss : 0.648

100%|██████████| 235/235 [00:27<00:00,  8.53it/s]

Train CrossEntropyLoss : 0.618
Valid CrossEntropyLoss : 0.635

100%|██████████| 235/235 [00:28<00:00,  8.38it/s]

Train CrossEntropyLoss : 0.606
Valid CrossEntropyLoss : 0.624

100%|██████████| 235/235 [00:26<00:00,  8.83it/s]

Train CrossEntropyLoss : 0.596
Valid CrossEntropyLoss : 0.615

100%|██████████| 235/235 [00:25<00:00,  9.17it/s]

Train CrossEntropyLoss : 0.587
Valid CrossEntropyLoss : 0.608

100%|██████████| 235/235 [00:27<00:00,  8.63it/s]

Train CrossEntropyLoss : 0.580
Valid CrossEntropyLoss : 0.601

100%|██████████| 235/235 [00:26<00:00,  8.74it/s]

Train CrossEntropyLoss : 0.573
Valid CrossEntropyLoss : 0.594

100%|██████████| 235/235 [00:26<00:00,  8.72it/s]

Train CrossEntropyLoss : 0.567
Valid CrossEntropyLoss : 0.588

100%|██████████| 235/235 [00:27<00:00,  8.50it/s]

Train CrossEntropyLoss : 0.561
Valid CrossEntropyLoss : 0.583

100%|██████████| 235/235 [00:25<00:00,  9.10it/s]

Train CrossEntropyLoss : 0.556
Valid CrossEntropyLoss : 0.578

100%|██████████| 235/235 [00:27<00:00,  8.67it/s]

Train CrossEntropyLoss : 0.551
Valid CrossEntropyLoss : 0.574

100%|██████████| 235/235 [00:27<00:00,  8.54it/s]

Train CrossEntropyLoss : 0.547
Valid CrossEntropyLoss : 0.570

100%|██████████| 235/235 [00:39<00:00,  5.96it/s]

Train CrossEntropyLoss : 0.543
Valid CrossEntropyLoss : 0.566

100%|██████████| 235/235 [00:25<00:00,  9.07it/s]

Train CrossEntropyLoss : 0.539
Valid CrossEntropyLoss : 0.562

100%|██████████| 235/235 [00:27<00:00,  8.56it/s]

Train CrossEntropyLoss : 0.535
Valid CrossEntropyLoss : 0.559

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.8032
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.80      0.76      0.78      1050
     Trouser       0.92      0.95      0.93       968
    Pullover       0.69      0.65      0.67      1053
       Dress       0.85      0.78      0.81      1093
        Coat       0.70      0.68      0.69      1027
      Sandal       0.89      0.92      0.91       967
       Shirt       0.43      0.55      0.48       780
     Sneaker       0.88      0.90      0.89       984
         Bag       0.94      0.91      0.92      1022
  Ankle boot       0.94      0.89      0.92      1056

    accuracy                           0.80     10000
   macro avg       0.80      0.80      0.80     10000
weighted avg       0.81      0.80      0.81     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.MultiFactorScheduler(step=[1000,2000,3000], factor=0.9, base_lr=learning_rate)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

4. Polynomial Scheduler ¶

In this section, we have trained our network using SGD with the polynomial scheduler. We can create polynomial scheduler using PolyScheduler() constructor available from lr_scheduler sub-module. Below are important parameters of the scheduler.

max_update - This parameter accepts integer specifying number of steps for which to anneal learning rate.
base_lr - This is the initial learning rate.
pwr - This parameter accepts integers specifying the power of the decay term.
final_lr - This is the final learning rate after max_update steps are completed.
warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
- 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
- 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have created a polynomial scheduler with max_update set to total training batches, power set to 2.5, base learning rate set to 0.001, and final learning rate set to 1e-5. After training the network, we have also evaluated accuracy and classification report on test predictions.

In the cells after accuracy calculation, we have plotted a chart showing how the learning rate will change during training if we use a polynomial scheduler with different settings. If we keep power greater than 1 then the shape of the line in the chart will be convex else it'll be concave if power is less than 1.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=2.5, base_lr=learning_rate, final_lr=1e-5)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:25<00:00,  9.09it/s]

Train CrossEntropyLoss : 2.262
Valid CrossEntropyLoss : 2.218

100%|██████████| 235/235 [00:25<00:00,  9.04it/s]

Train CrossEntropyLoss : 2.125
Valid CrossEntropyLoss : 1.992

100%|██████████| 235/235 [00:27<00:00,  8.47it/s]

Train CrossEntropyLoss : 1.750
Valid CrossEntropyLoss : 1.482

100%|██████████| 235/235 [00:25<00:00,  9.05it/s]

Train CrossEntropyLoss : 1.243
Valid CrossEntropyLoss : 1.082

100%|██████████| 235/235 [00:27<00:00,  8.50it/s]

Train CrossEntropyLoss : 0.978
Valid CrossEntropyLoss : 0.929

100%|██████████| 235/235 [00:25<00:00,  9.11it/s]

Train CrossEntropyLoss : 0.872
Valid CrossEntropyLoss : 0.859

100%|██████████| 235/235 [00:27<00:00,  8.47it/s]

Train CrossEntropyLoss : 0.817
Valid CrossEntropyLoss : 0.818

100%|██████████| 235/235 [00:25<00:00,  9.30it/s]

Train CrossEntropyLoss : 0.782
Valid CrossEntropyLoss : 0.789

100%|██████████| 235/235 [00:25<00:00,  9.25it/s]

Train CrossEntropyLoss : 0.758
Valid CrossEntropyLoss : 0.768

100%|██████████| 235/235 [00:28<00:00,  8.32it/s]

Train CrossEntropyLoss : 0.739
Valid CrossEntropyLoss : 0.753

100%|██████████| 235/235 [00:25<00:00,  9.13it/s]

Train CrossEntropyLoss : 0.725
Valid CrossEntropyLoss : 0.741

100%|██████████| 235/235 [00:28<00:00,  8.39it/s]

Train CrossEntropyLoss : 0.715
Valid CrossEntropyLoss : 0.731

100%|██████████| 235/235 [00:25<00:00,  9.08it/s]

Train CrossEntropyLoss : 0.706
Valid CrossEntropyLoss : 0.724

100%|██████████| 235/235 [00:27<00:00,  8.44it/s]

Train CrossEntropyLoss : 0.700
Valid CrossEntropyLoss : 0.718

100%|██████████| 235/235 [00:25<00:00,  9.17it/s]

Train CrossEntropyLoss : 0.695
Valid CrossEntropyLoss : 0.714

100%|██████████| 235/235 [00:25<00:00,  9.13it/s]

Train CrossEntropyLoss : 0.691
Valid CrossEntropyLoss : 0.711

100%|██████████| 235/235 [00:27<00:00,  8.43it/s]

Train CrossEntropyLoss : 0.687
Valid CrossEntropyLoss : 0.708

100%|██████████| 235/235 [00:37<00:00,  6.35it/s]

Train CrossEntropyLoss : 0.685
Valid CrossEntropyLoss : 0.706

100%|██████████| 235/235 [00:28<00:00,  8.35it/s]

Train CrossEntropyLoss : 0.683
Valid CrossEntropyLoss : 0.705

100%|██████████| 235/235 [00:26<00:00,  9.02it/s]

Train CrossEntropyLoss : 0.682
Valid CrossEntropyLoss : 0.703

100%|██████████| 235/235 [00:27<00:00,  8.44it/s]

Train CrossEntropyLoss : 0.681
Valid CrossEntropyLoss : 0.703

100%|██████████| 235/235 [00:25<00:00,  9.12it/s]

Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.702

100%|██████████| 235/235 [00:28<00:00,  8.21it/s]

Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.702

100%|██████████| 235/235 [00:26<00:00,  8.71it/s]

Train CrossEntropyLoss : 0.680
Valid CrossEntropyLoss : 0.701

100%|██████████| 235/235 [00:27<00:00,  8.66it/s]

Train CrossEntropyLoss : 0.679
Valid CrossEntropyLoss : 0.701

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.7542
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.75      0.75      0.75      1005
     Trouser       0.90      0.92      0.91       977
    Pullover       0.59      0.64      0.61       925
       Dress       0.81      0.74      0.77      1094
        Coat       0.69      0.60      0.64      1151
      Sandal       0.75      0.88      0.81       857
       Shirt       0.36      0.45      0.40       812
     Sneaker       0.86      0.80      0.83      1082
         Bag       0.91      0.90      0.90      1016
  Ankle boot       0.91      0.84      0.88      1081

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.76      0.75      0.76     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=2.5, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

import matplotlib.pyplot as plt

scheduler = lr_scheduler.PolyScheduler(max_update=steps, pwr=0.5, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Multi Factor Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

5. Cosine Scheduler ¶

In this section, we have trained the network using SGD with a cosine scheduler. This scheduler will anneal the learning rate in a cosine curve fashion. We can create an instance of cosine scheduler using CosineScheduler() constructor available from lr_scheduler sub-module. Below are important parameters of the constructor.

max_update - This parameter accepts integer specifying number of steps for which to anneal learning rate.
base_lr - This is the initial learning rate.
final_lr - This is the final learning rate after max_update steps are completed.
warmup_steps - This parameter accepts integer value specifying the number of warm-up steps used by the scheduler before it starts annealing the learning rate.
warmup_begin_lr - This parameter accepts the initial learning rate for warm-up steps. During warm-up steps learning rate starts with this learning rate and reaches till base learning rate.
warmup_mode - This parameter accepts string value specifying how the learning rate changes during warm-up steps.
- 'linear' - This will steadily increase the learning rate from warm-up LR to base LR.
- 'constant' - This will keep the learning rate constant at warm-up LR throughout warm-up steps.

In our case, we have created a cosine scheduler with an initial learning rate of 0.001, the final learning rate of 1e-5, and the number of steps set to total batches of the training process. After completion of training, we have evaluated accuracy and classification report on test predictions as usual.

In the cell after accuracy calculation, we have plotted how the learning rate changes during training if we use a cosine scheduler.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = lr_scheduler.CosineScheduler(max_update=steps, base_lr=learning_rate, final_lr=1e-5)
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:26<00:00,  8.91it/s]

Train CrossEntropyLoss : 2.270
Valid CrossEntropyLoss : 2.225

100%|██████████| 235/235 [00:26<00:00,  8.78it/s]

Train CrossEntropyLoss : 2.115
Valid CrossEntropyLoss : 1.935

100%|██████████| 235/235 [00:29<00:00,  8.02it/s]

Train CrossEntropyLoss : 1.599
Valid CrossEntropyLoss : 1.269

100%|██████████| 235/235 [00:26<00:00,  8.85it/s]

Train CrossEntropyLoss : 1.069
Valid CrossEntropyLoss : 0.951

100%|██████████| 235/235 [00:28<00:00,  8.17it/s]

Train CrossEntropyLoss : 0.871
Valid CrossEntropyLoss : 0.837

100%|██████████| 235/235 [00:25<00:00,  9.23it/s]

Train CrossEntropyLoss : 0.787
Valid CrossEntropyLoss : 0.779

100%|██████████| 235/235 [00:26<00:00,  8.79it/s]

Train CrossEntropyLoss : 0.739
Valid CrossEntropyLoss : 0.741

100%|██████████| 235/235 [00:25<00:00,  9.20it/s]

Train CrossEntropyLoss : 0.706
Valid CrossEntropyLoss : 0.714

100%|██████████| 235/235 [00:25<00:00,  9.24it/s]

Train CrossEntropyLoss : 0.681
Valid CrossEntropyLoss : 0.693

100%|██████████| 235/235 [00:28<00:00,  8.30it/s]

Train CrossEntropyLoss : 0.662
Valid CrossEntropyLoss : 0.677

100%|██████████| 235/235 [00:39<00:00,  5.89it/s]

Train CrossEntropyLoss : 0.647
Valid CrossEntropyLoss : 0.664

100%|██████████| 235/235 [00:28<00:00,  8.36it/s]

Train CrossEntropyLoss : 0.635
Valid CrossEntropyLoss : 0.654

100%|██████████| 235/235 [00:25<00:00,  9.27it/s]

Train CrossEntropyLoss : 0.625
Valid CrossEntropyLoss : 0.646

100%|██████████| 235/235 [00:28<00:00,  8.20it/s]

Train CrossEntropyLoss : 0.617
Valid CrossEntropyLoss : 0.639

100%|██████████| 235/235 [00:25<00:00,  9.21it/s]

Train CrossEntropyLoss : 0.611
Valid CrossEntropyLoss : 0.633

100%|██████████| 235/235 [00:27<00:00,  8.47it/s]

Train CrossEntropyLoss : 0.606
Valid CrossEntropyLoss : 0.629

100%|██████████| 235/235 [00:25<00:00,  9.18it/s]

Train CrossEntropyLoss : 0.601
Valid CrossEntropyLoss : 0.626

100%|██████████| 235/235 [00:25<00:00,  9.18it/s]

Train CrossEntropyLoss : 0.598
Valid CrossEntropyLoss : 0.623

100%|██████████| 235/235 [00:28<00:00,  8.14it/s]

Train CrossEntropyLoss : 0.596
Valid CrossEntropyLoss : 0.621

100%|██████████| 235/235 [00:25<00:00,  9.21it/s]

Train CrossEntropyLoss : 0.594
Valid CrossEntropyLoss : 0.619

100%|██████████| 235/235 [00:28<00:00,  8.16it/s]

Train CrossEntropyLoss : 0.592
Valid CrossEntropyLoss : 0.618

100%|██████████| 235/235 [00:25<00:00,  9.26it/s]

Train CrossEntropyLoss : 0.592
Valid CrossEntropyLoss : 0.618

100%|██████████| 235/235 [00:25<00:00,  9.17it/s]

Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.617

100%|██████████| 235/235 [00:28<00:00,  8.28it/s]

Train CrossEntropyLoss : 0.591
Valid CrossEntropyLoss : 0.617

100%|██████████| 235/235 [00:25<00:00,  9.16it/s]

Train CrossEntropyLoss : 0.590
Valid CrossEntropyLoss : 0.617

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.779
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.75      0.75      0.75      1011
     Trouser       0.91      0.93      0.92       978
    Pullover       0.63      0.66      0.64       957
       Dress       0.82      0.75      0.78      1092
        Coat       0.72      0.66      0.69      1093
      Sandal       0.85      0.89      0.87       954
       Shirt       0.41      0.50      0.45       824
     Sneaker       0.86      0.85      0.86      1018
         Bag       0.93      0.90      0.92      1024
  Ankle boot       0.92      0.87      0.89      1049

    accuracy                           0.78     10000
   macro avg       0.78      0.77      0.78     10000
weighted avg       0.79      0.78      0.78     10000

import matplotlib.pyplot as plt

scheduler = lr_scheduler.CosineScheduler(max_update=steps, base_lr=learning_rate, final_lr=1e-5)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Cosine Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

6. Custom Scheduler (Combining Multiple Schedulers) ¶

In this section, we have explained how we can create a custom scheduler. We can create a custom scheduler as a class that has two methods implemented.

__init__() - This method has total logic to initialize scheduler.
__call__() - This method takes iteration/step number as input and returns learning rate for that iteration/step.

In our case below, we have created a scheduler that takes two parameters as input. The initial learning rate and boundaries parameter. The boundaries parameter accept a list of integer specifying boundaries of changing learning rate. The scheduler then creates cosine schedulers based on the length of boundaries parameters. If boundaries have 3 integers then it creates 4 cosine schedulers, if it has 4 integers then it creates 5 cosine schedulers. The first cosine scheduler has a base learning rate set to the initial learning rate and a final learning rate set to half of the initial learning rate. The second cosine scheduler has a base learning rate to the final learning rate of the previous cosine scheduler and the final learning rate is set to half of the final learning rate of the previous scheduler. The same process goes for all upcoming schedulers where we keep on halving the learning rate.

This example can also be considered as an example of how we can combine multiple schedulers in mxnet.

In the next two cells, we have explained for example how the learning rate will change if we use our custom scheduler.

class CustomScheduler:
    def __init__(self, base_lr=0.001, boundaries=None):
        self.base_lr = base_lr
        self.boundaries = boundaries
        if boundaries:
            self.schedulers = [lr_scheduler.CosineScheduler(max_update=self.boundaries[0], base_lr=self.base_lr, final_lr=self.base_lr/2)]
            self.base_lr = self.base_lr / 2
            for i in range(1, len(self.boundaries)):
                k = self.boundaries[i]-self.boundaries[i-1]
                scheduler = lr_scheduler.CosineScheduler(max_update=k, base_lr=self.base_lr, final_lr=self.base_lr/2)
                self.schedulers.append(scheduler)
                self.base_lr = self.base_lr/2
            scheduler = lr_scheduler.CosineScheduler(max_update=2000, base_lr=self.base_lr, final_lr=self.base_lr/2)
            self.schedulers.append(scheduler)
        else:
            self.schedulers = [lr_scheduler.CosineScheduler(max_update=1000, base_lr=self.base_lr, final_lr=self.base_lr/2)]

    def __call__(self, iteration):
        if self.boundaries:
            if iteration <= self.boundaries[0]:
                return self.schedulers[0](iteration)
            elif iteration > self.boundaries[-1]:
                return self.schedulers[-1](iteration-self.boundaries[-1])
            else:
                for i in range(1, len(self.boundaries)):
                    if iteration > self.boundaries[i-1] and iteration <= self.boundaries[i]:
                        return self.schedulers[i](iteration-self.boundaries[i-1])
        else:
            return self.schedulers[-1](iteration)

scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])

scheduler.schedulers

[<mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d110>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d250>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d690>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d410>,
 <mxnet.lr_scheduler.CosineScheduler at 0x7f05f550d3d0>]

for s in scheduler.schedulers:
    print(s.base_lr, s.final_lr)

0.001 0.0005
0.0005 0.00025
0.00025 0.000125
0.000125 6.25e-05
6.25e-05 3.125e-05

In the below cell, we have initialized our custom scheduler with an initial learning rate of 0.001 and boundaries set to [1000,2000,3000,4000]. This will create 5 cosine schedulers that will anneal learning rate as per boundaries parameter.

In the cell after the below cell, we have plotted a chart showing how the learning rate will change during training if we use our custom scheduler.

from mxnet import gluon
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
from mxnet import lr_scheduler

batch_size=256
epochs=25
learning_rate = 0.001

model = CNN()
model.initialize()
loss_func = loss.SoftmaxCrossEntropyLoss()

steps = (X_train.shape[0]//batch_size)*epochs + epochs
scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])
grad_descent = optimizer.SGD(lr_scheduler=scheduler)

trainer = gluon.Trainer(model.collect_params(), grad_descent)

TrainModelInBatches(trainer, X_train, Y_train, X_test, Y_test, epochs, batch_size=batch_size)

100%|██████████| 235/235 [00:25<00:00,  9.23it/s]

Train CrossEntropyLoss : 2.292
Valid CrossEntropyLoss : 2.279

100%|██████████| 235/235 [00:25<00:00,  9.32it/s]

Train CrossEntropyLoss : 2.263
Valid CrossEntropyLoss : 2.243

100%|██████████| 235/235 [00:41<00:00,  5.71it/s]

Train CrossEntropyLoss : 2.211
Valid CrossEntropyLoss : 2.167

100%|██████████| 235/235 [00:25<00:00,  9.17it/s]

Train CrossEntropyLoss : 2.099
Valid CrossEntropyLoss : 2.012

100%|██████████| 235/235 [00:25<00:00,  9.13it/s]

Train CrossEntropyLoss : 1.885
Valid CrossEntropyLoss : 1.735

100%|██████████| 235/235 [00:25<00:00,  9.09it/s]

Train CrossEntropyLoss : 1.571
Valid CrossEntropyLoss : 1.418

100%|██████████| 235/235 [00:29<00:00,  8.01it/s]

Train CrossEntropyLoss : 1.300
Valid CrossEntropyLoss : 1.206

100%|██████████| 235/235 [00:30<00:00,  7.79it/s]

Train CrossEntropyLoss : 1.136
Valid CrossEntropyLoss : 1.086

100%|██████████| 235/235 [00:31<00:00,  7.50it/s]

Train CrossEntropyLoss : 1.038
Valid CrossEntropyLoss : 1.007

100%|██████████| 235/235 [00:30<00:00,  7.64it/s]

Train CrossEntropyLoss : 0.970
Valid CrossEntropyLoss : 0.951

100%|██████████| 235/235 [00:33<00:00,  6.94it/s]

Train CrossEntropyLoss : 0.922
Valid CrossEntropyLoss : 0.913

100%|██████████| 235/235 [00:30<00:00,  7.83it/s]

Train CrossEntropyLoss : 0.891
Valid CrossEntropyLoss : 0.888

100%|██████████| 235/235 [00:30<00:00,  7.72it/s]

Train CrossEntropyLoss : 0.869
Valid CrossEntropyLoss : 0.869

100%|██████████| 235/235 [00:30<00:00,  7.66it/s]

Train CrossEntropyLoss : 0.851
Valid CrossEntropyLoss : 0.854

100%|██████████| 235/235 [00:34<00:00,  6.83it/s]

Train CrossEntropyLoss : 0.837
Valid CrossEntropyLoss : 0.842

100%|██████████| 235/235 [00:30<00:00,  7.63it/s]

Train CrossEntropyLoss : 0.826
Valid CrossEntropyLoss : 0.833

100%|██████████| 235/235 [00:30<00:00,  7.62it/s]

Train CrossEntropyLoss : 0.818
Valid CrossEntropyLoss : 0.826

100%|██████████| 235/235 [00:30<00:00,  7.68it/s]

Train CrossEntropyLoss : 0.812
Valid CrossEntropyLoss : 0.820

100%|██████████| 235/235 [00:33<00:00,  6.93it/s]

Train CrossEntropyLoss : 0.806
Valid CrossEntropyLoss : 0.814

100%|██████████| 235/235 [00:30<00:00,  7.65it/s]

Train CrossEntropyLoss : 0.800
Valid CrossEntropyLoss : 0.809

100%|██████████| 235/235 [00:38<00:00,  6.11it/s]

Train CrossEntropyLoss : 0.795
Valid CrossEntropyLoss : 0.805

100%|██████████| 235/235 [00:37<00:00,  6.34it/s]

Train CrossEntropyLoss : 0.791
Valid CrossEntropyLoss : 0.801

100%|██████████| 235/235 [00:30<00:00,  7.66it/s]

Train CrossEntropyLoss : 0.787
Valid CrossEntropyLoss : 0.798

100%|██████████| 235/235 [00:30<00:00,  7.66it/s]

Train CrossEntropyLoss : 0.784
Valid CrossEntropyLoss : 0.795

100%|██████████| 235/235 [00:30<00:00,  7.68it/s]

Train CrossEntropyLoss : 0.781
Valid CrossEntropyLoss : 0.792

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Y_test_preds = model(X_test)
Y_test_preds = Y_test_preds.argmax(axis=-1)

print("Test Accuracy : {}".format(accuracy_score(Y_test_preds.asnumpy(), Y_test.asnumpy())))
print("Classification Report : ")
print(classification_report(Y_test_preds.asnumpy(), Y_test.asnumpy(), target_names=class_labels))

Test Accuracy : 0.7249
Classification Report :
              precision    recall  f1-score   support

 T-shirt/top       0.74      0.73      0.74      1016
     Trouser       0.89      0.94      0.91       941
    Pullover       0.59      0.57      0.58      1028
       Dress       0.81      0.70      0.75      1162
        Coat       0.61      0.55      0.58      1108
      Sandal       0.66      0.85      0.74       771
       Shirt       0.29      0.44      0.35       665
     Sneaker       0.84      0.73      0.78      1147
         Bag       0.89      0.87      0.88      1030
  Ankle boot       0.92      0.82      0.87      1132

    accuracy                           0.72     10000
   macro avg       0.72      0.72      0.72     10000
weighted avg       0.74      0.72      0.73     10000

import matplotlib.pyplot as plt

scheduler = CustomScheduler(base_lr=0.001, boundaries=[1000,2000,3000,4000])

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Custom Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

import matplotlib.pyplot as plt

scheduler = CustomScheduler(base_lr=0.003)

lrs = [scheduler(i) for i in range(steps)]
plt.scatter(range(steps), lrs)
plt.title('Custom Scheduler')
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

This ends our small tutorial explaining how we can use learning rate schedulers available from mxnet to anneal learning rate during training. Please feel free to let us know your views in the comments section.

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

mxnet, learning-rate-schedulers

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

MXNet: Learning Rate Schedules¶

Important Sections Of Tutorial¶

Load Data ¶

Define CNN ¶

1. Constant Learning Rate ¶

2. Factor Scheduler ¶

3. Multi Factor Scheduler ¶

4. Polynomial Scheduler ¶

5. Cosine Scheduler ¶

6. Custom Scheduler (Combining Multiple Schedulers) ¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription