Updated On : Jan-30,2022 Time Investment : ~30 mins

Simple Guide to Learning Rate Schedules for Keras Networks

When training Python Keras networks using optimizers like stochastic gradient descent (SGD), the learning rate of the network stays constant throughout the training process. This will work in many scenarios. But as we get closer to optima, reducing a learning rate a bit over time can help get better results. It can boost the performance of the model. There are various ways to reduce the learning rate over time during the training process. It's commonly referred to as learning rate scheduling or learning rate annealing. Keras provides many learning rate schedulers that we can use to anneal the learning rate over time.

As a part of this tutorial, we'll discuss various learning rate schedulers available from keras as well as, we'll explain how one can implement a custom scheduler if existing schedulers do not satisfy their requirements. We have used the Fashion MNIST dataset for our tutorial and have trained a simple convolutional neural network on it to explain various schedulers.

Below, we have highlighted important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial

Below, we have loaded keras and printed the version of it that we'll use in our tutorial.

import tensorflow as tf
from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))
Keras Version : 2.6.0

Load Fashion MNIST Dataset

In this section, we have loaded the Fashion MNIST dataset available from the keras detasets module. The dataset has grayscale images of size (28,28) pixels for 10 different fashion items. The dataset is already divided into the train (60k images) and test (10k images) sets. The below table shows the mapping from index to fashion item names.

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
from tensorflow.keras import datasets
import numpy as np
from tensorflow.keras.utils import to_categorical

(X_train, Y_train), (X_test, Y_test) = datasets.fashion_mnist.load_data()

X_train,X_test = X_train.reshape(-1,28,28,1), X_test.reshape(-1,28,28,1)
Y_train, Y_test = to_categorical(Y_train), to_categorical(Y_test)

classes =  np.unique(Y_train)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step
((60000, 28, 28, 1), (10000, 28, 28, 1), (60000, 10), (10000, 10))

Define CNN Model

In this section, we have defined the CNN that we'll be using for our classification task when explaining various learning rate schedules. We have created a simple function that will create a neural network and return it each time it is called. The neural network has the simple architecture of 2 convolution layers followed by one dense layer. The convolution layers have filters of sizes 32 and 16 respectively. Both convolution layers will apply kernels of shape (3,3) on input image data. We have applied relu (rectified linear unit) activation after each convolution layer application. Then, we have flattened the output of the second convolution layer and has filled it into a dense layer that has 10 output units (same as a number of classes). The output of the dense layer has been converted to probabilities by applying softmax activation function.

from tensorflow.keras import Sequential
from tensorflow.keras import layers

def create_model():
    return Sequential([
                    layers.Conv2D(filters=32, kernel_size=(3,3), padding="same", activation="relu", input_shape=(28,28,1)),
                    layers.Conv2D(filters=16, kernel_size=(3,3), padding="same", activation="relu"),

                    layers.Flatten(),
                    layers.Dense(10, activation="softmax")
                    ])

model = create_model()
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 28, 28, 32)        320
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 16)        4624
_________________________________________________________________
flatten (Flatten)            (None, 12544)             0
_________________________________________________________________
dense (Dense)                (None, 10)                125450
=================================================================
Total params: 130,394
Trainable params: 130,394
Non-trainable params: 0
_________________________________________________________________

1. Constant Learning Rate With SGD

In this section, we have trained our CNN using SGD which has a constant learning rate. We have set the learning rate to a constant value of 0.001. We have trained the network for only 5 epochs.

from tensorflow.keras.optimizers import SGD

grad_descent = keras.optimizers.SGD(learning_rate=0.001)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=5, validation_data=(X_test,Y_test))
2022-02-01 11:36:12.071172: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
938/938 [==============================] - 20s 20ms/step - loss: 0.7852 - accuracy: 0.8004 - val_loss: 0.4542 - val_accuracy: 0.8410
Epoch 2/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3958 - accuracy: 0.8616 - val_loss: 0.4336 - val_accuracy: 0.8423
Epoch 3/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3506 - accuracy: 0.8763 - val_loss: 0.3900 - val_accuracy: 0.8636
Epoch 4/5
938/938 [==============================] - 19s 20ms/step - loss: 0.3242 - accuracy: 0.8848 - val_loss: 0.3697 - val_accuracy: 0.8727
Epoch 5/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3064 - accuracy: 0.8905 - val_loss: 0.3617 - val_accuracy: 0.8709
<keras.callbacks.History at 0x7f13c1d77790>

2. SGD With Decay

In this section, we are training the neural network again with SGB but this time we have provided decay rate as well. It'll decay the learning rate by following the below formula.

learning_rate = learning_rate / (1. + decay * local_step)

In the next cell after training, we have also printed the code that has logic to handle decay in the keras codebase. We have used the python inspect module for retrieving code.

model = create_model() ## 

epochs = 5
lr = 0.001

grad_descent = keras.optimizers.SGD(learning_rate=lr, decay=lr/epochs)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test))
Epoch 1/5
938/938 [==============================] - 20s 20ms/step - loss: 0.7287 - accuracy: 0.7914 - val_loss: 0.4900 - val_accuracy: 0.8247
Epoch 2/5
938/938 [==============================] - 20s 21ms/step - loss: 0.4139 - accuracy: 0.8533 - val_loss: 0.4528 - val_accuracy: 0.8384
Epoch 3/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3738 - accuracy: 0.8692 - val_loss: 0.4038 - val_accuracy: 0.8610
Epoch 4/5
938/938 [==============================] - 19s 21ms/step - loss: 0.3488 - accuracy: 0.8770 - val_loss: 0.3811 - val_accuracy: 0.8707
Epoch 5/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3304 - accuracy: 0.8836 - val_loss: 0.3763 - val_accuracy: 0.8704
<keras.callbacks.History at 0x7f13c1daec10>
import inspect

print("====== SGD Source ====================================")
print(inspect.getsource(keras.optimizers.SGD)[:510])
print()
print("====== Decay Learning Rate Method ====================")
print(inspect.getsource(keras.optimizers.Optimizer._decayed_lr))
====== SGD Source ====================================
class SGD(optimizer_v2.OptimizerV2):
  r"""Gradient descent (with momentum) optimizer.

  Update rule for parameter `w` with gradient `g` when `momentum` is 0:

  ```python
  w = w - learning_rate * g
  ```

  Update rule when `momentum` is larger than 0:

  ```python
  velocity = momentum * velocity - learning_rate * g
  w = w + velocity
  ```

  When `nesterov=True`, this rule becomes:

  ```python
  velocity = momentum * velocity - learning_rate * g
  w = w + momentum * velocity - learning_rate * g
  `

====== Decay Learning Rate Method ====================
  def _decayed_lr(self, var_dtype):
    """Get decayed learning rate as a Tensor with dtype=var_dtype."""
    lr_t = self._get_hyper("learning_rate", var_dtype)
    if isinstance(lr_t, learning_rate_schedule.LearningRateSchedule):
      local_step = tf.cast(self.iterations, var_dtype)
      lr_t = tf.cast(lr_t(local_step), var_dtype)
    if self._initial_decay > 0.:
      local_step = tf.cast(self.iterations, var_dtype)
      decay_t = tf.cast(self._initial_decay, var_dtype)
      lr_t = lr_t / (1. + decay_t * local_step)
    return lr_t

3. Exponential Decay

In this section, we have trained our network using SGD with exponential decay. We can create an instance of exponential decay using ExponentialDecay constructor available from keras.optimizers.schedules module. It has the below-mentioned important parameters.

  • initial_learning_rate - This parameter accepts the initial learning rate of the optimizer.
  • decay_steps - Number of steps after which to reduce learning rate. Here, one step refers to the execution of one batch of data.
  • decay_rate - The float value specifying decay rate.
  • staircase - This parameter accepts boolean value which if set to True will follow staircase function.

The decayed learning rate is calculated using the below formula.

learning_rate = initial_learning_rate * decay_rate ^ (step / decay_steps)

Below, we have created an exponential decay with an initial learning rate of 0.001, 500 decay steps, and a decay rate of 0.98. This scheduler will decay the learning rate after every 500 steps/batches. We have provided a scheduler to SGD.

In the next cell after the training cell, we have also retrieved the learning rate for 5000 steps and plotted them to give an idea of how the learning rate will change during our training process. In our case, the dataset has 60k images and we have used 64 samples per batch which will bring a number of steps per epoch to ~1000. As we are training for 5 epochs, the total steps will be ~5000.

from tensorflow.keras.optimizers.schedules import ExponentialDecay

model = create_model()  ## Create Model

epochs = 5
lr = 0.001

lr_schedule = ExponentialDecay(lr, decay_steps=500, decay_rate=0.98, staircase=True)

grad_descent = keras.optimizers.SGD(learning_rate=lr_schedule)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test))
Epoch 1/5
938/938 [==============================] - 19s 20ms/step - loss: 1.0060 - accuracy: 0.7919 - val_loss: 0.9842 - val_accuracy: 0.7015
Epoch 2/5
938/938 [==============================] - 20s 22ms/step - loss: 0.4625 - accuracy: 0.8374 - val_loss: 0.4404 - val_accuracy: 0.8420
Epoch 3/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3861 - accuracy: 0.8642 - val_loss: 0.3994 - val_accuracy: 0.8553
Epoch 4/5
938/938 [==============================] - 19s 21ms/step - loss: 0.3535 - accuracy: 0.8758 - val_loss: 0.3800 - val_accuracy: 0.8653
Epoch 5/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3327 - accuracy: 0.8830 - val_loss: 0.4079 - val_accuracy: 0.8584
<keras.callbacks.History at 0x7f13b8dfb290>
import matplotlib.pyplot as plt

lrs = [lr_schedule(step) for step in range(5000)]

plt.scatter(range(5000), lrs);
plt.title("ExponentialDecay");
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

Simple Guide to Learning Rate Schedules for Keras Networks

4. Piecewise Constant Decay

In this section, we are training our network using SGD with a piecewise constant decay scheduler. We can create an instance of piece-wise constant decay scheduler using PiecewiseConstantDecay() constructor available from keras.optimizers.schedules module. It has the below-mentioned parameters.

  • boundaries - List of integers specifying boundaries for which learning rate will be constant. This parameter will divide the training process based on a number of steps provided in it and will use the learning rate according to values parameter. It'll become clear when we explain it below with an example.
  • values - List of learning rate values for boundaries specified using boundaries parameter. It'll have one value more than boundaries.

In our case, we have set boundaries to [1000,2000,3000] and values to [0.003,0.002,0.001,0.0001]. We know that our training process has ~5000 steps as we explained earlier. This assign learning rate of 0.003 to first 1000 steps, learning rate of 0.002 to steps from 1000 to 2000, learning rate of 0.001 to steps from 2000 to 3000 and learning rate of 0.0001 to steps beyond 3000.

Later on, in the next cell, we have also displayed a plot showing how the learning rate will change during our training of ~5000 steps.

from tensorflow.keras.optimizers.schedules import PiecewiseConstantDecay

model = create_model()  ## Create Model

epochs = 5

lr_schedule = PiecewiseConstantDecay(boundaries=[1000, 2000, 3000], values=[0.003,0.002,0.001, 0.0001])

grad_descent = keras.optimizers.SGD(learning_rate=lr_schedule)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test))
Epoch 1/5
938/938 [==============================] - 20s 21ms/step - loss: 0.8737 - accuracy: 0.8013 - val_loss: 0.4679 - val_accuracy: 0.8399
Epoch 2/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3894 - accuracy: 0.8633 - val_loss: 0.4020 - val_accuracy: 0.8570
Epoch 3/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3491 - accuracy: 0.8774 - val_loss: 0.3789 - val_accuracy: 0.8677
Epoch 4/5
938/938 [==============================] - 20s 22ms/step - loss: 0.3281 - accuracy: 0.8848 - val_loss: 0.3677 - val_accuracy: 0.8735
Epoch 5/5
938/938 [==============================] - 19s 20ms/step - loss: 0.3209 - accuracy: 0.8869 - val_loss: 0.3680 - val_accuracy: 0.8720
<keras.callbacks.History at 0x7f13b9536b50>
import matplotlib.pyplot as plt

lrs = [lr_schedule(step) for step in range(5000)]

plt.scatter(range(5000), lrs);
plt.title("PiecewiseConstantDecay");
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

Simple Guide to Learning Rate Schedules for Keras Networks

5. Polynomial Decay

In this section, we have trained our network using SGD with polynomial decay. We can create an instance of polynomial decay using PolynomialDecay() constructor available from keras.optimizers.schedules module. It has the below-mentioned parameters.

  • initial_learning_rate - This is the initial learning rate of the training.
  • decay_steps - Total number of steps for which to decay learning rate.
  • end_learning_rate - Final learning rate below which learning rate should not go.
  • power - Float to calculate decay learning rate. If we provide a value less than 1 then the curve of learning rate will be concave else it'll be convex (see below plot).

It uses the below formula to calculate the learning rate at any step.

def decayed_learning_rate(step):
  step = min(step, decay_steps)
  return ((initial_learning_rate - end_learning_rate) *
          (1 - step / decay_steps) ^ (power)
         ) + end_learning_rate

In our case, we have used an initial learning rate of 0.005, an end learning rate of 0.001, and a power value of 1.5.

from tensorflow.keras.optimizers.schedules import PolynomialDecay

model = create_model()  ## Create Model

epochs = 5

lr_schedule = PolynomialDecay(0.003, 5000,0.001, power=1.5)

grad_descent = keras.optimizers.SGD(learning_rate=lr_schedule)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test))
Epoch 1/5
938/938 [==============================] - 20s 21ms/step - loss: 1.7322 - accuracy: 0.7739 - val_loss: 0.5223 - val_accuracy: 0.8155
Epoch 2/5
938/938 [==============================] - 20s 22ms/step - loss: 0.4352 - accuracy: 0.8478 - val_loss: 0.4764 - val_accuracy: 0.8400
Epoch 3/5
938/938 [==============================] - 19s 20ms/step - loss: 0.3893 - accuracy: 0.8629 - val_loss: 0.4126 - val_accuracy: 0.8581
Epoch 4/5
938/938 [==============================] - 20s 22ms/step - loss: 0.3647 - accuracy: 0.8703 - val_loss: 0.3938 - val_accuracy: 0.8655
Epoch 5/5
938/938 [==============================] - 19s 20ms/step - loss: 0.3459 - accuracy: 0.8782 - val_loss: 0.3922 - val_accuracy: 0.8626
<keras.callbacks.History at 0x7f13b91f0dd0>
import matplotlib.pyplot as plt

lrs = [lr_schedule(step) for step in range(5000)]

plt.scatter(range(5000), lrs);
plt.title("PolynomialDecay");
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

Simple Guide to Learning Rate Schedules for Keras Networks

6. Inverse Time Decay

In this section, we are training our network using SGD with an inverse time decay scheduler. We can create an instance of inverse time decay scheduler using InverseTimeDecay() constructor available from keras.optimizers.schedules module. It has the below-mentioned important parameters.

  • initial_learning_rate
  • decay_steps - It's an integer specifying number of steps after which decay learning rate.
  • decay_rate - It's a float value specifying decay rate.
  • staircase

The below formula is used to calculate the learning rate at any step.

def decayed_learning_rate(step):
    return initial_learning_rate / (1 + decay_rate * step / decay_step)

We have created an inverse decay scheduler with an initial learning rate of 0.003, decay steps of 100, and decay rate of 0.5. We have also plotted how the learning rate will change during the training process in the next cell.

from tensorflow.keras.optimizers.schedules import InverseTimeDecay

model = create_model()  ## Create Model

epochs = 5

lr_schedule = InverseTimeDecay(0.003, 100, 0.5)

grad_descent = keras.optimizers.SGD(learning_rate=lr_schedule)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test))
Epoch 1/5
938/938 [==============================] - 19s 20ms/step - loss: 0.7673 - accuracy: 0.8149 - val_loss: 0.4685 - val_accuracy: 0.8385
Epoch 2/5
938/938 [==============================] - 21s 22ms/step - loss: 0.4150 - accuracy: 0.8562 - val_loss: 0.4334 - val_accuracy: 0.8489
Epoch 3/5
938/938 [==============================] - 19s 20ms/step - loss: 0.3942 - accuracy: 0.8637 - val_loss: 0.4240 - val_accuracy: 0.8518
Epoch 4/5
938/938 [==============================] - 21s 22ms/step - loss: 0.3830 - accuracy: 0.8678 - val_loss: 0.4168 - val_accuracy: 0.8545
Epoch 5/5
938/938 [==============================] - 20s 21ms/step - loss: 0.3761 - accuracy: 0.8692 - val_loss: 0.4100 - val_accuracy: 0.8583
<keras.callbacks.History at 0x7f13b92bb350>
import matplotlib.pyplot as plt

lrs = [lr_schedule(step) for step in range(5000)]

plt.scatter(range(5000), lrs);
plt.title("InverseTimeDecay");
plt.xlabel("Steps")
plt.ylabel("Learning Rate");

Simple Guide to Learning Rate Schedules for Keras Networks

7. Custom Learning Rate Scheduler

In this section, we have explained how we can create a learning rate scheduler of our own. In order to create a learning rate scheduler, we need to create a function that takes as input epoch number and current learning rate and then returns a new learning rate. Then, we need to wrap this function inside of LearningRateScheduler callback available from keras.callbacks module. We can provide this callback to fit() method and it'll change the learning rate using this function after each epoch. Please make a NOTE that this will change the learning rate after the complete epoch and not for individual steps.

If you want to know about callbacks in keras then please feel free to check the below link.

We have also plotted in the next cell how the learning rate will change over time after each epoch.

def custome_lr_scheduler(epoch, current_lr):
    return current_lr / 3
from keras.callbacks import LearningRateScheduler
from tensorflow.keras.optimizers.schedules import InverseTimeDecay

model = create_model()  ## Create Model

epochs = 5

lr_schedule = LearningRateScheduler(custome_lr_scheduler)

grad_descent = keras.optimizers.SGD(learning_rate=0.001)

model.compile(optimizer=grad_descent, loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(x=X_train, y=Y_train, batch_size=64, epochs=epochs, validation_data=(X_test,Y_test), callbacks=[lr_schedule])
Epoch 1/5
938/938 [==============================] - 20s 21ms/step - loss: 0.8253 - accuracy: 0.7764 - val_loss: 0.5426 - val_accuracy: 0.8137
Epoch 2/5
938/938 [==============================] - 20s 22ms/step - loss: 0.4630 - accuracy: 0.8425 - val_loss: 0.4866 - val_accuracy: 0.8342
Epoch 3/5
938/938 [==============================] - 20s 21ms/step - loss: 0.4391 - accuracy: 0.8509 - val_loss: 0.4743 - val_accuracy: 0.8404
Epoch 4/5
938/938 [==============================] - 20s 21ms/step - loss: 0.4322 - accuracy: 0.8525 - val_loss: 0.4722 - val_accuracy: 0.8407
Epoch 5/5
938/938 [==============================] - 19s 20ms/step - loss: 0.4300 - accuracy: 0.8535 - val_loss: 0.4708 - val_accuracy: 0.8396
<keras.callbacks.History at 0x7f13b9674150>
import matplotlib.pyplot as plt

current_lr = 0.001
lrs = [current_lr]

for epoch in range(1,5):
    current_lr = custome_lr_scheduler(epoch, current_lr)
    lrs.append(current_lr)

plt.scatter(range(5), lrs);
plt.title("Custome Learning Rate Schedule");
plt.xlabel("Epochs")
plt.ylabel("Learning Rate");

Simple Guide to Learning Rate Schedules for Keras Networks

This ends our small tutorial explaining how we can use learning rate schedulers available from keras to anneal learning rate during the training process. We also explained how we can create our own custom callback. Please feel free to let us know your views in the comments section.

References

Sunny Solanki  Sunny Solanki

Share Views Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Share Views Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.