Updated On : Jan-27,2022 Tags keras, callbacks
Simple Guide To Keras Callbacks

Simple Guide To Keras Callbacks

Keras is one of the most commonly used deep learning libraries of Python to design neural networks. The reason behind such high popularity is that keras has one of the easiest APIs to work with neural networks which automates many tasks that otherwise developers need to code. One such task is the training of the neural network. In other Python deep learning libraries (like Tensorflow, PyTorch, mxnet, Flax (JAX), etc), the developers need to write code for training neural networks. In keras, developers need to just call fit() method to perform training of the neural network. It can even calculate loss and metrics on validation data. This frees developers from writing code for training neural networks which involve loops and can get messy as well introduce bugs sometimes. Though this makes the task of the developer quite easy, there are situations when we need to perform tasks before and after completion of epochs/steps (like logging results/metrics, saving model weights, modifying learning rate, stopping training, etc.). With other deep learning libraries, we are designing training loops so we can add these kinds of functionalities.

But how do we perform these kinds of tasks when using keras where we are using just one line of code to train neural networks?

The answer to this question is keras callbacks. Keras provides us with functions that we can execute at various stages of training like before the start of training, after completion of training, before the start of epochs, after completion of epochs, before the start of a batch, and after completion of the batch. We can execute functions to perform various tasks at earlier mentioned times of our training. Keras provides some ready callbacks for commonly performed tasks like log results/metrics, change learning rate, save model/weights, etc. It also lets us create custom callbacks if any of the existing callbacks are not satisfying our requirements.

As a part of this tutorial, we'll discuss how we can use existing keras callbacks and will also discuss how we can create our own custom callback if existing ones are not enough. We have used the Fashion MNIST dataset for our tutorial and have trained simple CNN on it to explain callbacks.

Below, we have highlighted important sections of our tutorial to give an overview of the material covered.

Important Sections Of Tutorial

Below, we have imported keras and printed the version of it that we have used in our tutorial.

In [1]:
import tensorflow
from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))
Keras Version : 2.6.0

0. Load Dataset

In this section, we have loaded the Fashion MNIST dataset which is available as a part of the keras package. The dataset has images of 10 different fashion items. The dataset is already divided into the train (60k images) and test (10k images) sets. The images are of shape 28 x 28 pixels. The below table shows the mapping from index to item names. We have also displayed a few random images in the cell after the below cell.

Label Description
0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot
In [2]:
from tensorflow.keras import datasets

(X_train, Y_train), (X_test, Y_test) = datasets.fashion_mnist.load_data()

X_train, X_test = X_train.reshape(-1,28,28,1), X_test.reshape(-1,28,28,1)

X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step
4431872/4422102 [==============================] - 0s 0us/step
Out[2]:
((60000, 28, 28, 1), (60000,), (10000, 28, 28, 1), (10000,))
In [ ]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10,5))
plt.imshow(np.hstack(X_train[:5]), cmap="gray");

Simple Guide To Keras Callbacks

1. Decrease Learning Rate (If no performance improvements for few epochs)

In this section, we'll introduce a callback that will help us reduce the learning rate if the metric/loss that it is monitoring is not improving.

Create Neural Network

Below, we have created a simple neural network with 2 convolution layers and one dense layer. The convolutional layers have 32 and 16 filters each. Both of them will apply kernels of size (3,3) on the input. We have applied relu (rectified linear unit) activation to the output of both convolution layers. We have created a method that will initiate a model each time it is called. We'll be reusing this method for each of our upcoming sections.

In [4]:
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

def create_model():
    return Sequential([
                    layers.Conv2D(filters=32, kernel_size=(3,3), padding="same", activation="relu",
                                  input_shape=(28,28,1)),
                    #layers.Conv2D(filters=32, kernel_size=(3,3), padding="same", activation="relu"),
                    layers.Conv2D(filters=16, kernel_size=(3,3), padding="same", activation="relu"),

                    layers.Flatten(),
                    layers.Dense(10, activation="softmax")
                    ])

model = create_model()

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 28, 28, 32)        320
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 16)        4624
_________________________________________________________________
flatten (Flatten)            (None, 12544)             0
_________________________________________________________________
dense (Dense)                (None, 10)                125450
=================================================================
Total params: 130,394
Trainable params: 130,394
Non-trainable params: 0
_________________________________________________________________

Compile Network

In this section, we have compiled our model. We have set SGD (stochastic gradient descent) as our optimizer with a learning rate of 0.001, 'sparse categorical crossentropy' as our loss, and accuracy as our metric.

In [5]:
from tensorflow.keras.optimizers import SGD

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Maximize for Validation Accuracy

In this section, we are training our network for 15 epochs. We have used a callback named ReduceLROnPlateau available from keras. We can initialize callbacks and then give them as list to callbacks argument of fit(), evaluate() and predict() methods.

The ReduceLROnPlateau callback will monitor the learning rate and will reduce it if there is no improvement in the metric. In our case, we have asked it to monitor validation accuracy. We have set patient parameter to 3 which will inform it that if there is no improvement in validation accuracy for 3 constant epochs then decrease learning rate. It'll multiply the learning rate by the number specified using factor parameter which is 0.5 in our case hence it'll halve the learning rate. The value set with min_delta is the amount by which validation accuracy should increase else the learning rate will decrease. The mode parameter is 'auto' by default and can determine easily if the metric provided to monitor parameter needs to be decreased or increased. The other two values are 'min' and 'max' where we explicitly specify whether metric needs to be monitored for maximization or minimization. We can specify a minimum value that the learning rate can go using min_lr parameter. We have also set verbose to True which will log messages when there is a change in learning rate by callback.

We can notice from the results below that the learning rate is changed 2 times.

In [6]:
from tensorflow.keras import callbacks

lr_reduce_max = callbacks.ReduceLROnPlateau(monitor="val_accuracy", factor=0.5,
                                            patience=3, verbose=1, mode="max",
                                            min_delta=0.01 ,min_lr=0.0001)

model.fit(X_train, Y_train, batch_size=256, epochs=15, validation_data=(X_test,Y_test), callbacks=[lr_reduce_max])
2022-02-02 10:58:13.715635: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/15
235/235 [==============================] - 19s 78ms/step - loss: 0.9489 - accuracy: 0.7326 - val_loss: 0.6154 - val_accuracy: 0.7944
Epoch 2/15
235/235 [==============================] - 19s 80ms/step - loss: 0.4941 - accuracy: 0.8321 - val_loss: 0.5071 - val_accuracy: 0.8295
Epoch 3/15
235/235 [==============================] - 18s 77ms/step - loss: 0.4355 - accuracy: 0.8509 - val_loss: 0.4965 - val_accuracy: 0.8287
Epoch 4/15
235/235 [==============================] - 19s 80ms/step - loss: 0.4009 - accuracy: 0.8619 - val_loss: 0.4382 - val_accuracy: 0.8484
Epoch 5/15
235/235 [==============================] - 19s 79ms/step - loss: 0.3779 - accuracy: 0.8695 - val_loss: 0.4213 - val_accuracy: 0.8558
Epoch 6/15
235/235 [==============================] - 19s 82ms/step - loss: 0.3584 - accuracy: 0.8749 - val_loss: 0.4068 - val_accuracy: 0.8622
Epoch 7/15
235/235 [==============================] - 19s 82ms/step - loss: 0.3443 - accuracy: 0.8798 - val_loss: 0.3968 - val_accuracy: 0.8624
Epoch 8/15
235/235 [==============================] - 18s 77ms/step - loss: 0.3313 - accuracy: 0.8840 - val_loss: 0.3978 - val_accuracy: 0.8597
Epoch 9/15
235/235 [==============================] - 19s 82ms/step - loss: 0.3214 - accuracy: 0.8884 - val_loss: 0.3800 - val_accuracy: 0.8690

Epoch 00009: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 10/15
235/235 [==============================] - 19s 79ms/step - loss: 0.3081 - accuracy: 0.8917 - val_loss: 0.3751 - val_accuracy: 0.8700
Epoch 11/15
235/235 [==============================] - 19s 80ms/step - loss: 0.3036 - accuracy: 0.8927 - val_loss: 0.3747 - val_accuracy: 0.8701
Epoch 12/15
235/235 [==============================] - 19s 81ms/step - loss: 0.2994 - accuracy: 0.8943 - val_loss: 0.3723 - val_accuracy: 0.8705

Epoch 00012: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 13/15
235/235 [==============================] - 19s 81ms/step - loss: 0.2938 - accuracy: 0.8963 - val_loss: 0.3669 - val_accuracy: 0.8729
Epoch 14/15
235/235 [==============================] - 19s 82ms/step - loss: 0.2917 - accuracy: 0.8968 - val_loss: 0.3662 - val_accuracy: 0.8743
Epoch 15/15
235/235 [==============================] - 18s 77ms/step - loss: 0.2899 - accuracy: 0.8975 - val_loss: 0.3665 - val_accuracy: 0.8739
Out[6]:
<keras.callbacks.History at 0x7f2f846e57d0>

Minimize for Validation Loss

Below, we have explained another example of using ReduceLROnPlateau. This time, we are monitoring validation loss for minimization.

In [7]:
lr_reduce_min = callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=3,
                                            verbose=1, mode="min", min_delta=0.01 ,min_lr=0.0001)

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test), callbacks=[lr_reduce_min])
Epoch 1/10
235/235 [==============================] - 18s 77ms/step - loss: 0.2881 - accuracy: 0.8975 - val_loss: 0.3630 - val_accuracy: 0.8749
Epoch 2/10
235/235 [==============================] - 20s 84ms/step - loss: 0.2860 - accuracy: 0.8991 - val_loss: 0.3621 - val_accuracy: 0.8737
Epoch 3/10
235/235 [==============================] - 18s 77ms/step - loss: 0.2842 - accuracy: 0.8993 - val_loss: 0.3616 - val_accuracy: 0.8765
Epoch 4/10
235/235 [==============================] - 20s 84ms/step - loss: 0.2824 - accuracy: 0.9007 - val_loss: 0.3604 - val_accuracy: 0.8746

Epoch 00004: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.
Epoch 5/10
235/235 [==============================] - 18s 78ms/step - loss: 0.2799 - accuracy: 0.9014 - val_loss: 0.3591 - val_accuracy: 0.8748
Epoch 6/10
235/235 [==============================] - 20s 84ms/step - loss: 0.2790 - accuracy: 0.9011 - val_loss: 0.3584 - val_accuracy: 0.8761
Epoch 7/10
235/235 [==============================] - 19s 80ms/step - loss: 0.2781 - accuracy: 0.9015 - val_loss: 0.3581 - val_accuracy: 0.8760

Epoch 00007: ReduceLROnPlateau reducing learning rate to 0.0001.
Epoch 8/10
235/235 [==============================] - 19s 82ms/step - loss: 0.2771 - accuracy: 0.9021 - val_loss: 0.3578 - val_accuracy: 0.8765
Epoch 9/10
235/235 [==============================] - 20s 85ms/step - loss: 0.2765 - accuracy: 0.9024 - val_loss: 0.3579 - val_accuracy: 0.8751
Epoch 10/10
235/235 [==============================] - 18s 77ms/step - loss: 0.2757 - accuracy: 0.9029 - val_loss: 0.3577 - val_accuracy: 0.8745
Out[7]:
<keras.callbacks.History at 0x7f2f804b0090>

2. Stop Training Early (If no performance improvements for few epochs)

In this section, we have introduced another callback that will stop training if there is no improvement in a metric that it is monitoring.

Create and Compile Network

Below, we have created a model using the function we designed earlier and compiled it.

In [8]:
from tensorflow.keras.optimizers import SGD

model = create_model()

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Early Stop Training If Validation Accuracy Not Improving

In this section, we are training our neural network for 10 epochs with an early stop callback. We can create a callback using EarlyStopping constructor available from keras. We need to provide the metric to monitor. We can also provide information like how many metrics should improve (min_delta), the number of epochs to wait before stopping training (patience), and whether to maximize or minimize metric (mode).

In our case, we are monitoring validation accuracy which if does not improve by 0.05 for 3 consecutive epochs then training stops. We can notice that training is stopped after 7 epochs even though we asked to run for 10 epochs.

In [9]:
early_stop = callbacks.EarlyStopping(monitor="val_accuracy", min_delta=0.05, patience=3, verbose=1, mode="max")

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test), callbacks=[early_stop])
Epoch 1/10
235/235 [==============================] - 18s 77ms/step - loss: 1.1041 - accuracy: 0.7506 - val_loss: 0.5957 - val_accuracy: 0.7947
Epoch 2/10
235/235 [==============================] - 20s 85ms/step - loss: 0.4861 - accuracy: 0.8320 - val_loss: 0.5044 - val_accuracy: 0.8272
Epoch 3/10
235/235 [==============================] - 19s 81ms/step - loss: 0.4354 - accuracy: 0.8480 - val_loss: 0.4647 - val_accuracy: 0.8387
Epoch 4/10
235/235 [==============================] - 22s 95ms/step - loss: 0.4066 - accuracy: 0.8571 - val_loss: 0.4460 - val_accuracy: 0.8487
Epoch 5/10
235/235 [==============================] - 19s 79ms/step - loss: 0.3852 - accuracy: 0.8645 - val_loss: 0.4718 - val_accuracy: 0.8428
Epoch 6/10
235/235 [==============================] - 20s 85ms/step - loss: 0.3702 - accuracy: 0.8695 - val_loss: 0.4104 - val_accuracy: 0.8593
Epoch 7/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3572 - accuracy: 0.8730 - val_loss: 0.3960 - val_accuracy: 0.8633
Epoch 00007: early stopping
Out[9]:
<keras.callbacks.History at 0x7f2f404fd250>

Early Stop Training If Validation Loss Not Decreasing

Below, we have again called EarlyStopping callback. This time, we are monitoring validation loss. We monitor validation loss, which if does not improve by amount 0.01 for 3 consecutive epochs then training will be stopped. We can notice that training is stopped after 8 epochs even though we asked to run for 10 epochs.

In [10]:
early_stop = callbacks.EarlyStopping(monitor="val_loss", min_delta=0.01, patience=3, verbose=1, mode="min", baseline=0.4, restore_best_weights=True)

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test), callbacks=[early_stop])
Epoch 1/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3456 - accuracy: 0.8773 - val_loss: 0.4016 - val_accuracy: 0.8594
Epoch 2/10
235/235 [==============================] - 20s 86ms/step - loss: 0.3357 - accuracy: 0.8813 - val_loss: 0.3795 - val_accuracy: 0.8685
Epoch 3/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3283 - accuracy: 0.8838 - val_loss: 0.3854 - val_accuracy: 0.8646
Epoch 4/10
235/235 [==============================] - 18s 77ms/step - loss: 0.3196 - accuracy: 0.8870 - val_loss: 0.3714 - val_accuracy: 0.8710
Epoch 5/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3135 - accuracy: 0.8896 - val_loss: 0.3653 - val_accuracy: 0.8711
Epoch 6/10
235/235 [==============================] - 21s 90ms/step - loss: 0.3070 - accuracy: 0.8924 - val_loss: 0.3676 - val_accuracy: 0.8743
Epoch 7/10
235/235 [==============================] - 18s 76ms/step - loss: 0.2997 - accuracy: 0.8947 - val_loss: 0.3739 - val_accuracy: 0.8673
Epoch 8/10
235/235 [==============================] - 19s 79ms/step - loss: 0.2959 - accuracy: 0.8947 - val_loss: 0.3564 - val_accuracy: 0.8771
Restoring model weights from the end of the best epoch.
Epoch 00008: early stopping
Out[10]:
<keras.callbacks.History at 0x7f2f60f2ff90>

3. Checkpoint Model and Weights

In this section, we are explaining a callback that will save the model and its weights when it is giving the best results. We can then later load any of best performing models.

Create and Compile Network

In this section, we have initialized our network and compiled it as usual.

In [11]:
from tensorflow.keras.optimizers import SGD

model = create_model()

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Model And Save Checkpoint

In this section, we are training our neural network for 10 epochs with a model-saving callback. We can create a model checkpoint callback using ModelCheckpoint constructor. It accepts a file path where the model will be saved during training. It'll let us save the model at the end of each epoch or only the best result model or after a specified number of batches. We can also provide string formating to the filename.

In our case, we are saving the model after each epoch if validation accuracy keeps improving. It won't save the model if the validation accuracy of a current epoch is less than the previous epoch. The save_freq parameter can also accept integer values specifying after that many batches model needs to be saved. By default, the whole model state along with its weights will be saved. Later on, we have explained how we can save only weights.

In [12]:
from tensorflow.keras import callbacks

checkpoint = callbacks.ModelCheckpoint(filepath="/home/sunny/fashion_mnist_conv/model-{epoch:02d}-{val_accuracy:.2f}.hdf5",
                                       monitor="val_accuracy", verbose=1, mode="max", save_freq="epoch")
lr_reduce_max = callbacks.ReduceLROnPlateau(monitor="val_accuracy",
                                            factor=0.5, patience=3, verbose=1, mode="max",
                                            min_delta=0.05 ,min_lr=0.0001)

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test),
          callbacks=[lr_reduce_max, checkpoint])
Epoch 1/10
235/235 [==============================] - 19s 81ms/step - loss: 1.0749 - accuracy: 0.7404 - val_loss: 0.5951 - val_accuracy: 0.7875

Epoch 00001: saving model to /home/sunny/fashion_mnist_conv/model-01-0.79.hdf5
Epoch 2/10
235/235 [==============================] - 19s 79ms/step - loss: 0.4771 - accuracy: 0.8335 - val_loss: 0.4879 - val_accuracy: 0.8302

Epoch 00002: saving model to /home/sunny/fashion_mnist_conv/model-02-0.83.hdf5
Epoch 3/10
235/235 [==============================] - 18s 76ms/step - loss: 0.4223 - accuracy: 0.8531 - val_loss: 0.4474 - val_accuracy: 0.8420

Epoch 00003: saving model to /home/sunny/fashion_mnist_conv/model-03-0.84.hdf5
Epoch 4/10
235/235 [==============================] - 21s 88ms/step - loss: 0.3915 - accuracy: 0.8638 - val_loss: 0.4208 - val_accuracy: 0.8529

Epoch 00004: saving model to /home/sunny/fashion_mnist_conv/model-04-0.85.hdf5
Epoch 5/10
235/235 [==============================] - 18s 76ms/step - loss: 0.3681 - accuracy: 0.8706 - val_loss: 0.4181 - val_accuracy: 0.8559

Epoch 00005: saving model to /home/sunny/fashion_mnist_conv/model-05-0.86.hdf5
Epoch 6/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3525 - accuracy: 0.8757 - val_loss: 0.3936 - val_accuracy: 0.8616

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00006: saving model to /home/sunny/fashion_mnist_conv/model-06-0.86.hdf5
Epoch 7/10
235/235 [==============================] - 19s 81ms/step - loss: 0.3355 - accuracy: 0.8817 - val_loss: 0.3851 - val_accuracy: 0.8641

Epoch 00007: saving model to /home/sunny/fashion_mnist_conv/model-07-0.86.hdf5
Epoch 8/10
235/235 [==============================] - 20s 85ms/step - loss: 0.3290 - accuracy: 0.8837 - val_loss: 0.3811 - val_accuracy: 0.8662

Epoch 00008: saving model to /home/sunny/fashion_mnist_conv/model-08-0.87.hdf5
Epoch 9/10
235/235 [==============================] - 19s 79ms/step - loss: 0.3236 - accuracy: 0.8855 - val_loss: 0.3855 - val_accuracy: 0.8630

Epoch 00009: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.

Epoch 00009: saving model to /home/sunny/fashion_mnist_conv/model-09-0.86.hdf5
Epoch 10/10
235/235 [==============================] - 18s 78ms/step - loss: 0.3168 - accuracy: 0.8883 - val_loss: 0.3746 - val_accuracy: 0.8675

Epoch 00010: saving model to /home/sunny/fashion_mnist_conv/model-10-0.87.hdf5
Out[12]:
<keras.callbacks.History at 0x7f2f80cc7ad0>
In [13]:
%ls /home/sunny/fashion_mnist_conv/
model-01-0.79.hdf5  model-04-0.85.hdf5  model-07-0.86.hdf5  model-10-0.87.hdf5
model-02-0.83.hdf5  model-05-0.86.hdf5  model-08-0.87.hdf5
model-03-0.84.hdf5  model-06-0.86.hdf5  model-09-0.86.hdf5
In [14]:
model.evaluate(X_test, Y_test)
313/313 [==============================] - 1s 5ms/step - loss: 0.3746 - accuracy: 0.8675
Out[14]:
[0.374594122171402, 0.8675000071525574]

Loading Model From Checkpoint

In this section, we are loading one of the models that were saved during training. After loading the model, we are also checking its accuracy of it. We can load model using load_model() function available from keras.models module.

In [15]:
from keras.models import load_model
import os

model_files = os.listdir("/home/sunny/fashion_mnist_conv/")
model_files = [f for f in model_files if "model" in f]

print("Loading Model : {}".format(model_files[-1]))

loaded_model1 = load_model(os.path.join("/home/sunny/fashion_mnist_conv/", model_files[-1]))

loaded_model1.evaluate(X_test, Y_test)
Loading Model : model-09-0.86.hdf5
313/313 [==============================] - 2s 4ms/step - loss: 0.3855 - accuracy: 0.8630
Out[15]:
[0.3854518234729767, 0.8629999756813049]

Train Model And Save Checkpoint

In this section, we have designed a callback asking it to save only the weights of the model instead of the whole model.

In [16]:
checkpoint = callbacks.ModelCheckpoint(filepath="/home/sunny/fashion_mnist_conv/weights-{epoch:02d}-{val_accuracy:.2f}.hdf5",
                                       monitor="val_accuracy",
                                       save_best_only=True, save_weights_only=True,
                                       verbose=1, mode="max", save_freq="epoch")

model.fit(X_train, Y_train, batch_size=256, epochs=5, validation_data=(X_test,Y_test), callbacks=[checkpoint])
Epoch 1/5
235/235 [==============================] - 19s 79ms/step - loss: 0.3139 - accuracy: 0.8888 - val_loss: 0.3701 - val_accuracy: 0.8703

Epoch 00001: val_accuracy improved from -inf to 0.87030, saving model to /home/sunny/fashion_mnist_conv/weights-01-0.87.hdf5
Epoch 2/5
235/235 [==============================] - 18s 76ms/step - loss: 0.3113 - accuracy: 0.8898 - val_loss: 0.3701 - val_accuracy: 0.8698

Epoch 00002: val_accuracy did not improve from 0.87030
Epoch 3/5
235/235 [==============================] - 19s 79ms/step - loss: 0.3088 - accuracy: 0.8905 - val_loss: 0.3691 - val_accuracy: 0.8714

Epoch 00003: val_accuracy improved from 0.87030 to 0.87140, saving model to /home/sunny/fashion_mnist_conv/weights-03-0.87.hdf5
Epoch 4/5
235/235 [==============================] - 22s 92ms/step - loss: 0.3064 - accuracy: 0.8915 - val_loss: 0.3662 - val_accuracy: 0.8718

Epoch 00004: val_accuracy improved from 0.87140 to 0.87180, saving model to /home/sunny/fashion_mnist_conv/weights-04-0.87.hdf5
Epoch 5/5
235/235 [==============================] - 18s 77ms/step - loss: 0.3038 - accuracy: 0.8924 - val_loss: 0.3652 - val_accuracy: 0.8720

Epoch 00005: val_accuracy improved from 0.87180 to 0.87200, saving model to /home/sunny/fashion_mnist_conv/weights-05-0.87.hdf5
Out[16]:
<keras.callbacks.History at 0x7f2ff07900d0>
In [17]:
%ls /home/sunny/fashion_mnist_conv/
model-01-0.79.hdf5  model-06-0.86.hdf5  weights-01-0.87.hdf5
model-02-0.83.hdf5  model-07-0.86.hdf5  weights-03-0.87.hdf5
model-03-0.84.hdf5  model-08-0.87.hdf5  weights-04-0.87.hdf5
model-04-0.85.hdf5  model-09-0.86.hdf5  weights-05-0.87.hdf5
model-05-0.86.hdf5  model-10-0.87.hdf5

Load Model From Checkpoint

In this section, we are loading the model based on weights only. In order to load the model with weights, we first need to create model architecture and compile it. Then, we can call load_weights() method on it to load the model with weights from the file.

After loading the model, we have evaluated it on test data to check accuracy.

In [18]:
import os

weights_files = os.listdir("/home/sunny/fashion_mnist_conv/")
weights_files = [f for f in weights_files if "weights" in f]

print("Loading Weights : {}".format(weights_files[-1]))

loaded_model2 = create_model()

loaded_model2.compile(optimizer="sgd", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

loaded_model2.load_weights(os.path.join("/home/sunny/fashion_mnist_conv/",weights_files[-1]))

loaded_model2.evaluate(X_test, Y_test)
Loading Weights : weights-03-0.87.hdf5
313/313 [==============================] - 2s 5ms/step - loss: 0.3691 - accuracy: 0.8714
Out[18]:
[0.36911898851394653, 0.871399998664856]

4. Log Loss and Metrics to CSV File

In this section, we have explained a callback that let us log loss and metrics to a CSV file.

Create and Compile Network

We have initialized our network and compiled it as usual.

In [19]:
from tensorflow.keras.optimizers import SGD

model = create_model()

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Train Model and Log Metrics to CSV File

In this section, we are training our network with a callback that will save loss and metric values to a CSV file. We can create callback using CSVLogger() constructor. We just need to give a filename to it to which details will be stored.

In [20]:
from tensorflow.keras import callbacks

csv_logger = callbacks.CSVLogger("/home/sunny/model.csv", append=True)
lr_reduce_max = callbacks.ReduceLROnPlateau(monitor="val_accuracy",
                                            factor=0.5, patience=3, verbose=1, mode="max",
                                            min_delta=0.05 ,min_lr=0.0001)

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test),
          callbacks=[lr_reduce_max, csv_logger])
Epoch 1/10
235/235 [==============================] - 21s 87ms/step - loss: 1.6341 - accuracy: 0.7475 - val_loss: 0.5668 - val_accuracy: 0.8096
Epoch 2/10
235/235 [==============================] - 19s 80ms/step - loss: 0.4857 - accuracy: 0.8334 - val_loss: 0.4954 - val_accuracy: 0.8274
Epoch 3/10
235/235 [==============================] - 18s 77ms/step - loss: 0.4271 - accuracy: 0.8514 - val_loss: 0.4558 - val_accuracy: 0.8451
Epoch 4/10
235/235 [==============================] - 22s 92ms/step - loss: 0.3933 - accuracy: 0.8622 - val_loss: 0.4297 - val_accuracy: 0.8543

Epoch 00004: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 5/10
235/235 [==============================] - 18s 77ms/step - loss: 0.3685 - accuracy: 0.8712 - val_loss: 0.4168 - val_accuracy: 0.8556
Epoch 6/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3579 - accuracy: 0.8755 - val_loss: 0.4079 - val_accuracy: 0.8583
Epoch 7/10
235/235 [==============================] - 20s 85ms/step - loss: 0.3491 - accuracy: 0.8782 - val_loss: 0.3969 - val_accuracy: 0.8635
Epoch 8/10
235/235 [==============================] - 20s 86ms/step - loss: 0.3407 - accuracy: 0.8809 - val_loss: 0.3944 - val_accuracy: 0.8631
Epoch 9/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3331 - accuracy: 0.8834 - val_loss: 0.3899 - val_accuracy: 0.8658
Epoch 10/10
235/235 [==============================] - 18s 77ms/step - loss: 0.3269 - accuracy: 0.8856 - val_loss: 0.3842 - val_accuracy: 0.8667

Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Out[20]:
<keras.callbacks.History at 0x7f2ff06b87d0>

Below, we have loaded the CSV file and printed the result logged by callback.

In [21]:
import pandas as pd

pd.read_csv("/home/sunny/model.csv")
Out[21]:
epoch accuracy loss lr val_accuracy val_loss
0 0 0.747483 1.634120 0.0010 0.8096 0.566825
1 1 0.833367 0.485673 0.0010 0.8274 0.495377
2 2 0.851417 0.427076 0.0010 0.8451 0.455792
3 3 0.862200 0.393324 0.0010 0.8543 0.429729
4 4 0.871200 0.368476 0.0005 0.8556 0.416805
5 5 0.875533 0.357913 0.0005 0.8583 0.407943
6 6 0.878233 0.349084 0.0005 0.8635 0.396940
7 7 0.880933 0.340656 0.0005 0.8631 0.394440
8 8 0.883400 0.333100 0.0005 0.8658 0.389867
9 9 0.885583 0.326920 0.0005 0.8667 0.384159

5. Tensorboard

In this section, we have explained another callback that logs training detail that can be used by tensorboard later on. Tensorboard is a tool created by the tensorflow team that can let us analyze various metrics of training that can give us meaningful insights.

Create and Compile Network

We have initialized our neural network and compiled it as usual.

In [22]:
from tensorflow.keras.optimizers import SGD

model = create_model()

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Logs For Tensorboard

Below, we are training our neural network for 10 epochs with a callback that will log details for the tensorboard. We can create a callback using TensorBoard constructor. We need to provide a path where it'll log information. The histogram_freq parameter if set will calculate activation and weights histogram after that many epochs. The update_freq parameter accepts either string ('epoch' or 'batch') or integer value and will log losses/metrics after each epoch/batch. If we provide an integer value then it'll log losses/metrics after that many batches. There is also a parameter named write_images which if set to True will log model weights that can be visualized as images.

In [23]:
from tensorflow.keras import callbacks

tensorboard_logs = callbacks.TensorBoard("/home/sunny/logs", histogram_freq=1, write_graph=True,
                                         update_freq="epoch")

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test),
          callbacks=[tensorboard_logs])
2022-02-02 11:22:06.634756: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-02-02 11:22:06.635031: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-02-02 11:22:06.636209: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
Epoch 1/10
  3/235 [..............................] - ETA: 22s - loss: 57.3017 - accuracy: 0.1419
2022-02-02 11:22:07.128706: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-02-02 11:22:07.128862: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-02-02 11:22:07.205040: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-02-02 11:22:07.211415: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-02-02 11:22:07.224040: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07

2022-02-02 11:22:07.225520: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.trace.json.gz
2022-02-02 11:22:07.240666: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07

2022-02-02 11:22:07.241652: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.memory_profile.json.gz
2022-02-02 11:22:07.242281: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07
Dumped tool data for xplane.pb to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.xplane.pb
Dumped tool data for overview_page.pb to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.overview_page.pb
Dumped tool data for input_pipeline.pb to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /home/sunny/logs/train/plugins/profile/2022_02_02_11_22_07/a69fc6089fd3.kernel_stats.pb

235/235 [==============================] - 19s 78ms/step - loss: 1.5504 - accuracy: 0.7147 - val_loss: 0.5920 - val_accuracy: 0.7912
Epoch 2/10
235/235 [==============================] - 19s 80ms/step - loss: 0.5131 - accuracy: 0.8203 - val_loss: 0.5202 - val_accuracy: 0.8137
Epoch 3/10
235/235 [==============================] - 18s 77ms/step - loss: 0.4548 - accuracy: 0.8403 - val_loss: 0.4744 - val_accuracy: 0.8345
Epoch 4/10
235/235 [==============================] - 22s 94ms/step - loss: 0.4223 - accuracy: 0.8524 - val_loss: 0.4607 - val_accuracy: 0.8416
Epoch 5/10
235/235 [==============================] - 18s 78ms/step - loss: 0.4006 - accuracy: 0.8597 - val_loss: 0.4411 - val_accuracy: 0.8424
Epoch 6/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3836 - accuracy: 0.8663 - val_loss: 0.4215 - val_accuracy: 0.8507
Epoch 7/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3699 - accuracy: 0.8703 - val_loss: 0.4121 - val_accuracy: 0.8565
Epoch 8/10
235/235 [==============================] - 22s 94ms/step - loss: 0.3585 - accuracy: 0.8739 - val_loss: 0.4263 - val_accuracy: 0.8497
Epoch 9/10
235/235 [==============================] - 19s 80ms/step - loss: 0.3488 - accuracy: 0.8775 - val_loss: 0.4012 - val_accuracy: 0.8604
Epoch 10/10
235/235 [==============================] - 18s 77ms/step - loss: 0.3398 - accuracy: 0.8798 - val_loss: 0.3922 - val_accuracy: 0.8624
Out[23]:
<keras.callbacks.History at 0x7f2ff07295d0>
In [24]:
%ls /home/sunny/logs
train/  validation/

Below, we have loaded tensorboard as an external extension in the jupyter notebook first. Then, we have started it by giving a directory where logs are stored.

The %ls, %load_ext and %tensorboard are jupyter notebook magic commands. Jupyter notebook has many other magic commands that can help developers. If you want to learn about different magic commands then please check the below link which explains many of them.

In [ ]:
%load_ext tensorboard

%tensorboard --logdir=/home/sunny/logs

6. Custom Callback

In this section, we have explained how we can create our custom callback if none of the existing callbacks available from keras does satisfy our requirements.

Create and Compile Model

Below, we have initialized our model and compiled it as usual.

In [26]:
from tensorflow.keras.optimizers import SGD

model = create_model()

model.compile(optimizer=SGD(learning_rate=0.001), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Define Custom Callback

We can create a callback by extending Callback class available from keras. We need to implement methods according to our needs. Below, we have listed down all possible methods that can be implemented. The methods with 'train' word in them will be executed during training (fit() call), with 'test' word will be executed during evaluation (evaluate() call) and with 'predict' word will be executed during prediction (predict() call).

We can perform operations before the start of training, after the end of the training, before the start of the epoch, after the end of an epoch, before the start of a batch, and after the end of a batch. The methods on_epoch_begin() and on_epoch_end() will help execute particular steps before and after epoch during training. There are separate methods for batches. Each method has a parameter named logs that will have a dictionary of metrics and loss values till now.

In our case, we have implemented on_train_end() method that will save the model after completion of training. The implementation of on_epoch_begin() method simply prints the learning rate that will be used for that epoch. The on_epoch_end() method halves the learning rate after completion of the epoch and save model as well.

Please make a NOTE that we can access the model object from self object as we have done below in callback implementation. This can let us perform many things that will require access to the model like we can update/normalize weights as well using a callback.

In [27]:
from tensorflow.keras.callbacks import Callback

class CustomCallback(Callback):
    def on_train_begin(self, logs=None):
        print("Training Started")
    def on_train_end(self, logs=None):
        self.model.save("/home/sunny/convnet.hdf5")
    def on_test_begin(self, logs=None):
        pass
    def on_test_end(self, logs=None):
        pass
    def on_predict_begin(self, logs=None):
        pass
    def on_predict_end(self, logs=None):
        pass

    def on_epoch_begin(self, epoch, logs=None):
        current_lr = tensorflow.keras.backend.get_value(self.model.optimizer.learning_rate)
        print("Epoch Learning Rate : {}".format(current_lr))

    def on_epoch_end(self, epoch, logs=None):
        self.model.optimizer.learning_rate = self.model.optimizer.learning_rate / 2

        self.model.save("/home/sunny/convnet/model-{}.hdf5".format(epoch+1))

    def on_train_batch_begin(self, batch, logs=None):
        pass
    def on_train_batch_end(self, batch, logs=None):
        pass
    def on_test_batch_begin(self, batch, logs=None):
        pass
    def on_test_batch_end(self, batch, logs=None):
        pass
    def on_predict_batch_begin(self, batch, logs=None):
        pass
    def on_predict_batch_end(self, batch, logs=None):
        pass

Below, we are executing our model for 10 epochs with our custom callback. We can notice from the result that the learning rate is getting printed at the beginning of each epoch. We have later on also listed down directory contents and we can notice that models are saved after each epoch as well as after training completion.

In [28]:
from tensorflow.keras import callbacks

custom_callback = CustomCallback()

model.fit(X_train, Y_train, batch_size=256, epochs=10, validation_data=(X_test,Y_test), callbacks=[custom_callback])
Training Started
Epoch 1/10
Epoch Learning Rate : 0.0010000000474974513
235/235 [==============================] - 19s 78ms/step - loss: 1.3887 - accuracy: 0.6982 - val_loss: 0.5888 - val_accuracy: 0.8013
Epoch 2/10
Epoch Learning Rate : 0.0005000000237487257
235/235 [==============================] - 19s 80ms/step - loss: 0.5334 - accuracy: 0.8155 - val_loss: 0.5351 - val_accuracy: 0.8155
Epoch 3/10
Epoch Learning Rate : 0.0002500000118743628
235/235 [==============================] - 19s 80ms/step - loss: 0.4953 - accuracy: 0.8280 - val_loss: 0.5120 - val_accuracy: 0.8243
Epoch 4/10
Epoch Learning Rate : 0.0001250000059371814
235/235 [==============================] - 18s 77ms/step - loss: 0.4805 - accuracy: 0.8321 - val_loss: 0.5044 - val_accuracy: 0.8278
Epoch 5/10
Epoch Learning Rate : 6.25000029685907e-05
235/235 [==============================] - 19s 80ms/step - loss: 0.4737 - accuracy: 0.8345 - val_loss: 0.5005 - val_accuracy: 0.8288
Epoch 6/10
Epoch Learning Rate : 3.125000148429535e-05
235/235 [==============================] - 18s 77ms/step - loss: 0.4705 - accuracy: 0.8359 - val_loss: 0.4987 - val_accuracy: 0.8296
Epoch 7/10
Epoch Learning Rate : 1.5625000742147677e-05
235/235 [==============================] - 22s 95ms/step - loss: 0.4689 - accuracy: 0.8364 - val_loss: 0.4979 - val_accuracy: 0.8293
Epoch 8/10
Epoch Learning Rate : 7.812500371073838e-06
235/235 [==============================] - 20s 83ms/step - loss: 0.4681 - accuracy: 0.8365 - val_loss: 0.4975 - val_accuracy: 0.8290
Epoch 9/10
Epoch Learning Rate : 3.906250185536919e-06
235/235 [==============================] - 18s 77ms/step - loss: 0.4677 - accuracy: 0.8367 - val_loss: 0.4973 - val_accuracy: 0.8296
Epoch 10/10
Epoch Learning Rate : 1.9531250927684596e-06
235/235 [==============================] - 19s 80ms/step - loss: 0.4675 - accuracy: 0.8368 - val_loss: 0.4972 - val_accuracy: 0.8295
Out[28]:
<keras.callbacks.History at 0x7f2ff054bdd0>
In [29]:
%ls /home/sunny/convnet
model-1.hdf5   model-2.hdf5  model-4.hdf5  model-6.hdf5  model-8.hdf5
model-10.hdf5  model-3.hdf5  model-5.hdf5  model-7.hdf5  model-9.hdf5
In [30]:
model.evaluate(X_test,Y_test)
313/313 [==============================] - 2s 6ms/step - loss: 0.4972 - accuracy: 0.8295
Out[30]:
[0.4971538186073303, 0.8295000195503235]

Below, we have loaded our model saved after training completion and evaluated the test set using it.

In [31]:
from keras.models import load_model

print("Loading Model : convnet.hdf5")

loaded_model = load_model("/home/sunny/convnet.hdf5")

loaded_model.evaluate(X_test, Y_test)
Loading Model : convnet.hdf5
313/313 [==============================] - 2s 4ms/step - loss: 0.4972 - accuracy: 0.8295
Out[31]:
[0.4971538186073303, 0.8295000195503235]

7. Other Available Callbacks

In this section, we have listed other callbacks available from keras that we have not covered in this tutorial as they are self-explanatory. We have included a small explanation for them below.

  • ProgbarLogger - This callback prints metrics to standard output.
  • TerminateOnNaN - This callback will halt training if loss becomes NaN.
  • BackupAndRestore - This callback is used to resume training if it has been halted during a call to fit().
  • LearningRateScheduler - This callback lets us modify the learning rate using different schedulers available from keras. We have a separate tutorial explaining learning rate schedules.
  • RemoteMonitor - This callback can help us stream events to remove the server.
  • LambdaCallback(on_epoch_begin=None, on_epoch_end=None, on_batch_begin=None, on_batch_end=None, on_train_begin=None, on_train_end=None) - This callback let us create custom callbacks on fly by giving method references to parameters of callback.

This ends our small tutorial explaining how we can use callbacks available from keras and create custom callbacks if needed. Please feel free to let us know your views in the comments section.

References

Sunny Solanki  Sunny Solanki

 Want to Share Your Views? Have Any Suggestions?

If you want to

  • provide some suggestions on topic
  • share your views
  • include some details in tutorial
  • suggest some new topics on which we should create tutorials/blogs
Please feel free to let us know in the comments section below (Guest Comments are allowed). We appreciate and value your feedbacks.

If you like our work please give a thumbs-up to our article in the comments section below. You can also support us with a small contribution by clicking on Support Us link in the footer section.