Updated On : Sep-18,2021 Tags optuna, hyperparameters-optimization
Simple Guide to Optuna for Hyperparameters Optimization/Tuning

Simple Guide to Optuna for Hyperparameters Optimization/Tuning

Machine learning is a branch of artificial intelligence that focuses on designing algorithms that can automate a task by learning from data or from experience. Machine learning algorithms are nowadays used in the majority of fields like object detection, image classification, house price prediction, email classification, and many more. Majority of machine learning algorithm has a bunch of parameters whose different values need to be tried in order to get good results. These parameters of the algorithms are generally referred to as hyperparameters. Hyperparameters are generally penalties for an algorithm (l1 or l2), a number of layers for neural networks, activation functions, learning rate, optimization algorithms (SGD, adam, etc), etc.

When designing an ML algorithm, we try different combinations of these hyperparameters to get good results. We want a generalized algorithm that can work well in different conditions. With an increase in complexity, people started creating algorithms with lots of different hyperparameters. Trying many different combinations for all hyperparameters can take a lot of time (sometimes even days if there is a lot of data) even on powerful computers. Libraries like scikit-learn provide an implementation for grid search which tries all different combinations of a list of hyperparameters combination and for random search which tries a random list of combinations from all possible combinations. The grid search algorithm can take a lot of time if there are many different combinations and the random search algorithm can ignore some combinations which might have given good results. We need some ways to try only hyperparameters settings that are giving good results.

Optuna is a framework designed specifically for the purpose of hyperparameters optimization. Optuna helps us find the best hyperparameters for our algorithm faster and it works with a majority of current famous ML libraries like scikit-learn, xgboost, PyTorch, TensorFlow, skorch, lightgbm, Keras, fast-ai, etc.

Optuna Strategy for Optimization

Optuna overall uses the below strategy for finding the best hyperparameters combination.

  1. Sampling Strategy - It uses a sampling algorithm for selecting the best parameter combination from a list of all possible combinations. It concentrates on areas where hyperparameters are giving good results and ignores others resulting in time savings.
  2. Pruning Strategy - It uses a pruning strategy that constantly checks for algorithm performance during training and prunes (terminates) training for particular hyperparameters combination if it's not giving good results. This also results in time-saving.

As a part of this tutorial, we'll explain how we can use optuna to find the best hyperparameter settings faster. We'll be using scikit-learn and its algorithms for explanation purposes. This tutorial will get you started with Optuna. We have tried to explain the usage of it with simple and easy-to-understand examples.

Steps to use Optuna

Below we have listed steps that will be most commonly followed to use optuna.

  1. Create an objective function.
    • This function will have logic for creating a model, training it, and evaluating it on the validation dataset. After evaluation, it should return a single value which is generally the output of the evaluation metric (accuracy, MSE, etc.) and needs to be minimized/maximized.
    • This function takes as input a single parameter which is an instance of Trial class. This object has details about one combination of hyperparameters with which the ML algorithm will be executed.
  2. Create Study object.
  3. Call optimize() method on Study by giving objective function created in the first step to find best hyperparameters combination. It'll execute the objective function more than once by giving different Trial instances each having different hyperparameters combinations.

Optuna is based on the concept of Study and Trial.

  • The trial is one combination of hyperparameters that will be tried with an algorithm.
  • The study is the process of trying different combinations of hyperparameters to find the one combination that gives the best results. The study generally consists of many trials.

Sections of Tutorial

This ends our small introduction to Optuna. We'll now start explaining the usage with examples. Our tutorial consists of the below sections.

We'll start by importing optuna.

In [2]:
import optuna

print("Optuna Version : {}".format(optuna.__version__))
Optuna Version : 2.9.1

Minimize Simple Line Formula

As a part of this section, we'll introduce how we can use Optuna to find the best parameters that can minimize the output of the simple line function. We'll be trying to minimize the line function 5x-21. We want to find the best value of x at which the output of function 5x-21 is 0. This is a simple function and we can easily calculate the output but we'll let optuna suggest us values of parameter x which will minimize the function.

We'll be following the steps which we had discussed earlier at the beginning of the tutorial.

We'll start by creating an objective function that takes as input instance of Trial and return the value which we want to minimize/maximize. In our case, we want to minimize the value of line function 5x-21. We have wrapped the line formula in abs function because we want to minimize the function to 0. If we don't use abs then it'll take negative values of the line formula as a minimum.

Our hyperparameter for this line formula is x. We want to find the value of x at which formula abs(5-21) is minimum. We'll be using methods of Trial instance for suggesting values of hyperparameter x for this purpose.


Important Methods of Trial Instance

  • suggest_float(name,low,high,step=None,log=False) - This method takes as input hyperparameter name and it's low and high values as input. It then suggests float values in the range of [low, high]. We can specify step value if we want to increase the value using that step size. We can also set log parameter to True to follow a logarithmic pattern in suggesting values. Logarithmic suggestion slowly increases the value of a parameter.
  • suggest_int(name,low,high,step=1,log=False) - This method works exactly like suggest_float() with only difference that it suggests integer values instead.
  • suggest_uniform(name,low,high) - This method takes parameter name and low & high value of parameter. It then uniformly suggest values of parameter.
  • suggest_categorical(name, choices) - This method takes as input parameter name and list of different values of that parameter that we want to try. It's generally used for categorical variables of data.

We have declared hyperparameter x using suggest_float() method inside of objective function. This will make sure that values of x are suggested as float and in the range 0-5. The new value of x will be suggested each time the objective function is called with Trial instance.

In [2]:
def objective(trial):
    x = trial.suggest_float("x", 0, 5)
    return abs(5*x - 21)

In this cell, we have created an instance of Study using create_study() method. This object will be used to try different combinations of hyperparameter combinations. In our case, different values of x will be tried by this study object.


  • create_study(study_name=None,direction=None,sampler=None,pruner=None,storage=None,load_if_exists=None) - This method creates an instance of Study which will be used for optimization. It has list of optional parameters.
    • The study_name parameter accepts string specifying the name of the study.
    • The direction accepts string 'minimize' if we want to minimize the output of objective function else 'maximize'. By default, the objective function is minimized.
    • The sampler parameter accepts an instance of Sampler specifying which sampling strategy to use for selecting hyperparameter combinations. Below is a list of samplers available with Optuna.
      • TPESampler - By default this sampler will be used if none is provided. It uses a Tree-structured Parzen Estimator algorithm for selecting hyperparameter combinations.
      • CmaEsSampler - This sampler uses cmaes library for selecting hyperparameter combinations. It's implementation of covariance matrix adaptation evolution strategy (CMA-ES) algorithm.
      • NSGAIISampler - This is a multi-objective sampler based on the NSGA-II algorithm.
      • MOTPESampler - This is a multi-objective sampler based on the MOTPE algorithm.
      • GridSampler - This is the same sampler as scikit-learn's grid search will try all combinations.
      • RandomSampler - This is the same sampler as scikit-learn's random search which will randomly select few combinations.
    • The pruner parameter accepts an instance of Pruner which will be used to prune a particular trial of objective function if it's not giving good results before it completes. Below is a list of pruners available with Optuna.
      • MedianPruner - By default this pruner will be used if none is provided. It uses the median stopping rule to prune trials.
      • NopPruner - This pruner will not perform pruning.
      • PercentilePruner - This pruner will keep the specified percentile of trials from all possible trials.
      • SuccessiveHalvingPruner - This pruner uses an asynchronous successive halving algorithm for pruning trials.
      • HyperbandPruner - It uses a hyperband algorithm for pruning.
      • ThresholdPruner - It uses a certain threshold to prune trials.
      • PatientPruner - This pruner wraps another pruner with particular tolerance and prunes trial based on it.

We'll be using the default sampler and pruner available in our examples of this tutorial.


Once we have created a Study object, we can instruct it to try different values of hyperparameters and find the combination which gives us the best result. In our case, it'll try to find the best value of x which minimizes the line formula. We also need to give it a number of trials to perform.


Important Methods of Study Object

  • optimize(func,n_trials=None,storage=None,timeout=None,n_jobs=1,catch=(),show_progress_bar=False) - This method takes as input objective function and tries different combination of hyperparameters with it. It has list of important parameters.
    • The n_trials parameter accepts integer values specifying the number of trials to execute.
    • The timeout parameter accepts float value specifying the number of seconds to wait before terminating the study. It'll try different combinations until a specified amount of seconds is passed.
    • The n_jobs parameter accepts integer values specifying the number of cores/CPU to use on the computer. If we set it to -1 then it'll use all cores of the computer.
    • The catch parameter accepts a list of exceptions. The study will continue with other trials if one of the exception specified in this list happens during the execution of an objective function. By default, this list is empty which means that any exception that happened in the objective function will result in a halt of study. We can provide exceptions that we want to avoid as a part of this parameter.
    • The show_progress_bar parameter accepts a boolean value which if set to True will show the progress bar of study.
    • The storage parameter accepts database URL where trial results will be saved. This will be useful when we want to run trials in parallel on many different computers. They will communicate and divide trials using this database.

Below we have instructed Study object to try the objective function 10 times so that it'll try 10 different values of parameter x and will keep track of formula output for each try. We can then retrieve which value gave the best result using Study object attributes.

In [3]:
study1 = optuna.create_study(study_name="MinimizeFunction")
study1.optimize(objective, n_trials=10)
[I 2021-09-17 07:06:51,088] A new study created in memory with name: MinimizeFunction
[I 2021-09-17 07:06:51,094] Trial 0 finished with value: 13.841006675454917 and parameters: {'x': 1.4317986649090164}. Best is trial 0 with value: 13.841006675454917.
[I 2021-09-17 07:06:51,096] Trial 1 finished with value: 19.763318552570876 and parameters: {'x': 0.24733628948582498}. Best is trial 0 with value: 13.841006675454917.
[I 2021-09-17 07:06:51,099] Trial 2 finished with value: 12.105439771280711 and parameters: {'x': 1.7789120457438579}. Best is trial 2 with value: 12.105439771280711.
[I 2021-09-17 07:06:51,102] Trial 3 finished with value: 1.359435784302029 and parameters: {'x': 4.471887156860406}. Best is trial 3 with value: 1.359435784302029.
[I 2021-09-17 07:06:51,104] Trial 4 finished with value: 2.0720136928330817 and parameters: {'x': 4.614402738566616}. Best is trial 3 with value: 1.359435784302029.
[I 2021-09-17 07:06:51,107] Trial 5 finished with value: 2.6231990654329707 and parameters: {'x': 3.675360186913406}. Best is trial 3 with value: 1.359435784302029.
[I 2021-09-17 07:06:51,110] Trial 6 finished with value: 15.83207170250891 and parameters: {'x': 1.0335856594982178}. Best is trial 3 with value: 1.359435784302029.
[I 2021-09-17 07:06:51,113] Trial 7 finished with value: 1.016466974164011 and parameters: {'x': 4.403293394832803}. Best is trial 7 with value: 1.016466974164011.
[I 2021-09-17 07:06:51,116] Trial 8 finished with value: 1.504496584376195 and parameters: {'x': 3.899100683124761}. Best is trial 7 with value: 1.016466974164011.
[I 2021-09-17 07:06:51,119] Trial 9 finished with value: 9.916125912917185 and parameters: {'x': 2.216774817416563}. Best is trial 7 with value: 1.016466974164011.

Study object has a list of important attributes which can be used to find out the best parameter settings and result details once the study completes executing all trials.

The best_params attribute has a dictionary specifying a value of each hyperparameter that gave the best results (minimum value of an objective function in this case).

Below we have printed the best value of x which gave the minimum value for our line formula.

In [4]:
best_params = study1.best_params

best_params
Out[4]:
{'x': 4.403293394832803}
In [5]:
found_x = best_params["x"]
print("Found x: {}, (5*x - 21): {}".format(found_x, (5*found_x - 21)))
Found x: 4.403293394832803, (5*x - 21): 1.016466974164011

The best_value attribute has a value of the best result. It'll be holding the minimum value of the line formula that we got after performing different trials.

In [6]:
study1.best_value
Out[6]:
1.016466974164011

The best_trial attribute has an instance of FrozenTrial which has details about a trial that gave the best results. It has information about a trial state which is COMPLETE in this case. The trial state can be failed or pruned as well.

In [7]:
study1.best_trial
Out[7]:
FrozenTrial(number=7, values=[1.016466974164011], datetime_start=datetime.datetime(2021, 9, 17, 7, 6, 51, 112203), datetime_complete=datetime.datetime(2021, 9, 17, 7, 6, 51, 112935), params={'x': 4.403293394832803}, distributions={'x': UniformDistribution(high=5.0, low=0.0)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=7, state=TrialState.COMPLETE, value=None)

The trials attribute has a list of FrozenTrial instances holding information about each individual trial and their states.

In [8]:
print("Total Trials : {}".format(len(study1.trials)))
Total Trials : 10

The trials_dataframe() method returns pandas dataframe summarizing all trials of study.

In [9]:
study1.trials_dataframe()
Out[9]:
number value datetime_start datetime_complete duration params_x state
0 0 13.841007 2021-09-17 07:06:51.092090 2021-09-17 07:06:51.093569 0 days 00:00:00.001479 1.431799 COMPLETE
1 1 19.763319 2021-09-17 07:06:51.095844 2021-09-17 07:06:51.096511 0 days 00:00:00.000667 0.247336 COMPLETE
2 2 12.105440 2021-09-17 07:06:51.098360 2021-09-17 07:06:51.099039 0 days 00:00:00.000679 1.778912 COMPLETE
3 3 1.359436 2021-09-17 07:06:51.101022 2021-09-17 07:06:51.101690 0 days 00:00:00.000668 4.471887 COMPLETE
4 4 2.072014 2021-09-17 07:06:51.103664 2021-09-17 07:06:51.104350 0 days 00:00:00.000686 4.614403 COMPLETE
5 5 2.623199 2021-09-17 07:06:51.106561 2021-09-17 07:06:51.107281 0 days 00:00:00.000720 3.675360 COMPLETE
6 6 15.832072 2021-09-17 07:06:51.109402 2021-09-17 07:06:51.110122 0 days 00:00:00.000720 1.033586 COMPLETE
7 7 1.016467 2021-09-17 07:06:51.112203 2021-09-17 07:06:51.112935 0 days 00:00:00.000732 4.403293 COMPLETE
8 8 1.504497 2021-09-17 07:06:51.115067 2021-09-17 07:06:51.115816 0 days 00:00:00.000749 3.899101 COMPLETE
9 9 9.916126 2021-09-17 07:06:51.118023 2021-09-17 07:06:51.118741 0 days 00:00:00.000718 2.216775 COMPLETE

We can continue our trials further by calling optimize() function. It'll try that many more trials. If we are not satisfied with the results of the initial trials then we can call optimize() again so that it tries a few more trials to improve results further.

Below we have executed 15 more trials using optimize().

In [10]:
study1.optimize(objective, n_trials=15)
[I 2021-09-17 07:06:51,978] Trial 10 finished with value: 4.52979130620173 and parameters: {'x': 3.2940417387596543}. Best is trial 7 with value: 1.016466974164011.
[I 2021-09-17 07:06:51,983] Trial 11 finished with value: 3.807081636611514 and parameters: {'x': 4.961416327322302}. Best is trial 7 with value: 1.016466974164011.
[I 2021-09-17 07:06:51,990] Trial 12 finished with value: 6.35847995222232 and parameters: {'x': 2.928304009555536}. Best is trial 7 with value: 1.016466974164011.
[I 2021-09-17 07:06:51,997] Trial 13 finished with value: 0.16666595745332557 and parameters: {'x': 4.233333191490665}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,000] Trial 14 finished with value: 0.7946278771336708 and parameters: {'x': 4.0410744245732655}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,005] Trial 15 finished with value: 7.03633128623189 and parameters: {'x': 2.792733742753622}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,010] Trial 16 finished with value: 1.4948023423391064 and parameters: {'x': 3.901039531532179}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,014] Trial 17 finished with value: 4.100580715523694 and parameters: {'x': 3.379883856895261}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,017] Trial 18 finished with value: 9.587812097922953 and parameters: {'x': 2.2824375804154093}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,023] Trial 19 finished with value: 0.5429195542121477 and parameters: {'x': 4.091416089157571}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,028] Trial 20 finished with value: 4.907654147905518 and parameters: {'x': 3.2184691704188966}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,034] Trial 21 finished with value: 0.38395126138827607 and parameters: {'x': 4.123209747722345}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,040] Trial 22 finished with value: 3.9635596452450024 and parameters: {'x': 4.992711929049}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,045] Trial 23 finished with value: 0.29619340688321216 and parameters: {'x': 4.259238681376642}. Best is trial 13 with value: 0.16666595745332557.
[I 2021-09-17 07:06:52,049] Trial 24 finished with value: 1.1121650571943604 and parameters: {'x': 4.422433011438872}. Best is trial 13 with value: 0.16666595745332557.

Below we have printed the best parameter and the best value after trying 15 more trials.

In [11]:
best_params = study1.best_params

best_params
Out[11]:
{'x': 4.233333191490665}
In [12]:
found_x = best_params["x"]
print("Found x: {}, (5*x - 21): {}".format(found_x, (5*found_x - 21)))

print("Total Trials : {}".format(len(study1.trials)))
Found x: 4.233333191490665, (5*x - 21): 0.16666595745332557
Total Trials : 25

Regression (Ridge)

As a part of this section, we'll explain how we can use optuna with scikit-learn estimators. We'll be working on a regression problem and try to solve it using ridge regression. We'll be using the Boston housing dataset available from scikit-learn for our purpose. We'll start by importing all necessary libraries and functions that we'll be using throughout this section. We'll also compare the results of optuna with results of grid search and random search of scikit-learn.

In [4]:
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error

import pandas as pd
import numpy as np

import warnings

warnings.filterwarnings("ignore")

Below we have loaded the Boston housing dataset available from scikit-learn. It has information about houses like the average number of rooms per dwelling, property tax, the crime rate in the area, etc. We'll be predicting the median value of the house in 1000 dollars. We have loaded the dataset and saved it in a data frame for display purposes.

We have stored 13 features of data into variable X and target value in variable Y.

In [14]:
boston = datasets.load_boston()

X,Y = boston.data, boston.target

boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)

boston_df["HousePrice"] = boston.target

boston_df
Out[14]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT HousePrice
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns

Below we have divided data into train (80%) and test (20%) sets using train_test_split() scikit-learn function.

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, random_state=123)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Out[15]:
((404, 13), (102, 13), (404,), (102,))

Below we have declared the objective function that we'll be using for our purpose. We have declared 4 hyperparameters that we'll be optimizing.

  • alpha
  • fit_intercept
  • tol
  • solver

We have used suggest_float() to suggest floating-point values for hyperparameters alpha and tol. We have used suggest_categorical() to suggest categorical values for hyperparameters fit_intercept and solver. Values of hyperparameters will be selected from ranges suggested by these methods during each trial of the optuna study.

We have then created a model with these parameters and fitted training data to it. At last, we have calculated the mean squared error (MSE) on test data and returned the value of it. We'll be minimizing this MSE during the study.

In [16]:
def objective(trial):
    alpha = trial.suggest_float("alpha", 0, 10)
    intercept = trial.suggest_categorical("fit_intercept", [True, False])
    tol = trial.suggest_float("tol", 0.001, 0.01, log=True)
    solver = trial.suggest_categorical("solver", ["auto", "svd","cholesky", "lsqr", "saga", "sag"])

    ## Create Model
    regressor = Ridge(alpha=alpha,fit_intercept=intercept,tol=tol,solver=solver)
    ## Fit Model
    regressor.fit(X_train, Y_train)

    return mean_squared_error(Y_test, regressor.predict(X_test))

Below we have created an instance of Study and run 15 trials of the objective function that we created in the previous cell. This will try to find the best hyperparameter settings that minimize MSE on test data using TPESampler of optuna.

In [17]:
%%time

study2 = optuna.create_study(study_name="RidgeRegression")
study2.optimize(objective, n_trials=15)
[I 2021-09-17 07:06:52,922] A new study created in memory with name: RidgeRegression
[I 2021-09-17 07:06:53,137] Trial 0 finished with value: 32.64447951723032 and parameters: {'alpha': 1.867480452457675, 'fit_intercept': False, 'tol': 0.00556052962595254, 'solver': 'svd'}. Best is trial 0 with value: 32.64447951723032.
[I 2021-09-17 07:06:53,166] Trial 1 finished with value: 32.68980260938696 and parameters: {'alpha': 7.278974327802254, 'fit_intercept': False, 'tol': 0.0010713988518477274, 'solver': 'auto'}. Best is trial 0 with value: 32.64447951723032.
[I 2021-09-17 07:06:53,200] Trial 2 finished with value: 47.296449938193476 and parameters: {'alpha': 1.1523506771478653, 'fit_intercept': False, 'tol': 0.0036028261758664004, 'solver': 'saga'}. Best is trial 0 with value: 32.64447951723032.
[I 2021-09-17 07:06:53,205] Trial 3 finished with value: 28.73396591701217 and parameters: {'alpha': 5.313845395415553, 'fit_intercept': True, 'tol': 0.002148432418111609, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,207] Trial 4 finished with value: 32.66661576509439 and parameters: {'alpha': 3.0398934979591132, 'fit_intercept': False, 'tol': 0.003023111779978088, 'solver': 'auto'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,214] Trial 5 finished with value: 33.58357226126697 and parameters: {'alpha': 3.0978137052532495, 'fit_intercept': True, 'tol': 0.005942102414029408, 'solver': 'sag'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,228] Trial 6 finished with value: 48.62188956451509 and parameters: {'alpha': 2.876265443215188, 'fit_intercept': False, 'tol': 0.004563590007643111, 'solver': 'saga'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,235] Trial 7 finished with value: 35.198720293987286 and parameters: {'alpha': 0.26558710738021185, 'fit_intercept': True, 'tol': 0.007069246250713737, 'solver': 'saga'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,238] Trial 8 finished with value: 46.42478093403398 and parameters: {'alpha': 5.357296052762834, 'fit_intercept': False, 'tol': 0.0037751409287642063, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,267] Trial 9 finished with value: 29.032403073948693 and parameters: {'alpha': 0.5783886723257092, 'fit_intercept': True, 'tol': 0.00128362461455875, 'solver': 'sag'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,274] Trial 10 finished with value: 28.739088682086184 and parameters: {'alpha': 8.258784559432177, 'fit_intercept': True, 'tol': 0.002166033853218858, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,281] Trial 11 finished with value: 28.74145197340928 and parameters: {'alpha': 9.607618399810457, 'fit_intercept': True, 'tol': 0.002149992494088231, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,289] Trial 12 finished with value: 28.737263040518396 and parameters: {'alpha': 7.2126373868035705, 'fit_intercept': True, 'tol': 0.0018929045090597943, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,296] Trial 13 finished with value: 29.709277592704538 and parameters: {'alpha': 5.7322444147230085, 'fit_intercept': True, 'tol': 0.0016580512349069821, 'solver': 'cholesky'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,303] Trial 14 finished with value: 28.736517064344486 and parameters: {'alpha': 6.784107207413218, 'fit_intercept': True, 'tol': 0.002486887883603407, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
CPU times: user 163 ms, sys: 8.21 ms, total: 171 ms
Wall time: 382 ms

Below we have printed hyperparameters combination that gave the least MSE. We have then created a ridge regression model using the best hyperparameters that we found out using optuna. We have evaluated the performance of the model on train and test set by evaluating MSE on both.

In [18]:
print("Best Params : {}".format(study2.best_params))

print("\nBest MSE : {}".format(study2.best_value))
Best Params : {'alpha': 5.313845395415553, 'fit_intercept': True, 'tol': 0.002148432418111609, 'solver': 'lsqr'}

Best MSE : 28.73396591701217
In [19]:
ridge = Ridge(**study2.best_params)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 26.041719809438128
Ridge Regression MSE on Test  Dataset : 28.73396591701217

Here, we have created a ridge regression model with default parameters for comparison purposes. We have a default model with train data and then evaluated it on both train and test sets.

In [20]:
ridge = Ridge()

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 20.82386585083267
Ridge Regression MSE on Test  Dataset : 28.932169896813704

Now we'll optimize the objective function again for 10 trials to check whether it's improving results further or not. We have printed the best parameter settings and MSE after these 10 trials. This trial will work keeping 15 trials that we performed earlier as a part of this study. It'll continue to search in a direction where it had got good results (least MSE) when it ran 15 trials earlier.

In [21]:
%%time

study2.optimize(objective, n_trials=10)
[I 2021-09-17 07:06:53,747] Trial 15 finished with value: 35.247931728664014 and parameters: {'alpha': 4.509417803878234, 'fit_intercept': True, 'tol': 0.009959449012008523, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,758] Trial 16 finished with value: 29.75149488375915 and parameters: {'alpha': 6.886178108780481, 'fit_intercept': True, 'tol': 0.0028285638127246316, 'solver': 'cholesky'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,767] Trial 17 finished with value: 29.63207821459252 and parameters: {'alpha': 4.444617399664175, 'fit_intercept': True, 'tol': 0.0015349667709175423, 'solver': 'svd'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,775] Trial 18 finished with value: 28.73546022692276 and parameters: {'alpha': 6.175932369070919, 'fit_intercept': True, 'tol': 0.0023950631917306784, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,782] Trial 19 finished with value: 29.223660109328904 and parameters: {'alpha': 8.726114972182415, 'fit_intercept': True, 'tol': 0.0014227610367588625, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,792] Trial 20 finished with value: 28.734970758324675 and parameters: {'alpha': 5.893831757917285, 'fit_intercept': True, 'tol': 0.0025549409904053215, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,800] Trial 21 finished with value: 28.735028681367677 and parameters: {'alpha': 5.927229350751814, 'fit_intercept': True, 'tol': 0.002534883104768401, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
[I 2021-09-17 07:06:53,809] Trial 22 finished with value: 28.732108565206946 and parameters: {'alpha': 4.238734413985723, 'fit_intercept': True, 'tol': 0.003364829406180997, 'solver': 'lsqr'}. Best is trial 22 with value: 28.732108565206946.
[I 2021-09-17 07:06:53,818] Trial 23 finished with value: 33.36295505570488 and parameters: {'alpha': 3.9770986508681387, 'fit_intercept': True, 'tol': 0.0034584998459242433, 'solver': 'lsqr'}. Best is trial 22 with value: 28.732108565206946.
[I 2021-09-17 07:06:53,826] Trial 24 finished with value: 33.361383040039605 and parameters: {'alpha': 4.873770097079565, 'fit_intercept': True, 'tol': 0.004135478064574861, 'solver': 'lsqr'}. Best is trial 22 with value: 28.732108565206946.
CPU times: user 83.6 ms, sys: 7.69 ms, total: 91.3 ms
Wall time: 89.6 ms

Below we have printed the best parameters and MSE for that model found out after another 10 trials. We have also trained the model using these settings and evaluated it as well.

In [22]:
print("Best Params : {}".format(study2.best_params))

print("\nBest MSE : {}".format(study2.best_value))
Best Params : {'alpha': 4.238734413985723, 'fit_intercept': True, 'tol': 0.003364829406180997, 'solver': 'lsqr'}

Best MSE : 28.732108565206946
In [23]:
ridge = Ridge(**study2.best_params)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 26.04165999021236
Ridge Regression MSE on Test  Dataset : 28.732108565206946

As a part of this section, we'll compare optuna with the grid search algorithm of scikit-learn. We'll be trying the same parameter settings that we tried with optuna but we'll use grid search for training purposes. The grid search algorithm will try all possible combinations rather than looking in the direction of good results.

If you are interested in learning about how to use grid search from scikit-learn then please feel free to check our tutorial on the same.

Grid Search without Parallelization

Below we are trying grid search without any kind of parallelization. We have performed a grid search on training data first. Later on, we have created a model with the best parameter setting that grid search found. We have also evaluated the model on train and test set for verifying results. We have chosen the same hyperparameter ranges that we had used when using optuna. Grid search in total below will try 3000 different combinations (25-alpha x 2-fit_intercept x 10-tol x 6-solver) of hyperparameters.

We can notice from the output that results are almost the same as that of optuna. The MSE is almost the same in both train and test sets but the time taken by grid search is a lot more than that compared to optuna.

In [24]:
%%time

param_grid = {"alpha" : np.linspace(0, 10, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.001, 0.01,10),
              "solver": ["auto", "svd","cholesky", "lsqr", "saga", "sag"]
             }

grid = GridSearchCV(Ridge(), param_grid, cv=5)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 46.9 s, sys: 59.2 ms, total: 47 s
Wall time: 47 s
Out[24]:
{'alpha': 0.0, 'fit_intercept': True, 'solver': 'svd', 'tol': 0.001}
In [25]:
ridge = Ridge(**grid.best_params_)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 20.67710794781513
Ridge Regression MSE on Test  Dataset : 28.19248575846956

Grid Search with Parallelization

Below we have tried the same grid search algorithm that we tried in the previous step but this time we have used parallelization to check whether there is any improvement in speed. The code has only one change compared to the previous section that we have set n_jobs parameter to -1 instructing it to use all cores of the computer. We can notice from the output that the grid search now completes in almost half of the time compared to the previous run but still even after parallelization it took a lot more time compared to optuna.

In [26]:
%%time

param_grid = {"alpha" : np.linspace(0, 10, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.001, 0.01,10),
              "solver": ["auto", "svd","cholesky", "lsqr", "saga", "sag"]
             }

grid = GridSearchCV(Ridge(), param_grid, cv=5, n_jobs=-1)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 2.29 s, sys: 37.6 ms, total: 2.32 s
Wall time: 21.4 s
Out[26]:
{'alpha': 0.0, 'fit_intercept': True, 'solver': 'svd', 'tol': 0.001}
In [27]:
ridge = Ridge(**grid.best_params_)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 20.67710794781513
Ridge Regression MSE on Test  Dataset : 28.19248575846956

As a part of this section, we'll be comparing the performance of optuna with that of the random search algorithm of scikit-learn for hyperparameters optimization. We have covered details about how to use random search in the same tutorial in which we have discussed grid search. We have given the link above for it.

Random Search without Parallelization

Below we have used random search with the same range of hyperparameters values which we had used with optuna. We have instructed the random search algorithm to try 25 random iterations so that it'll try the algorithm with 25 different randomly chosen hyperparameters settings. We chose 25 because we had run optuna earlier for 25 trials (15 first and then 10).

We can notice that random search completes quite faster compared to grid search because it tries only 25 hyperparameters combinations whereas grid search was trying 3000. If we compare timing with optuna, the time is still quite more compared to optuna and results are almost the same. Optuna can run still more fast if we had used parallelization with it by setting n_jobs to -1 when optimizing an objective function.

In [28]:
%%time

param_grid = {"alpha" : np.linspace(0, 10, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.001, 0.01,10),
              "solver": ["auto", "svd","cholesky", "lsqr", "saga", "sag"]
             }

grid = RandomizedSearchCV(Ridge(), param_grid, cv=5, n_iter=25, random_state=123)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 386 ms, sys: 7 µs, total: 386 ms
Wall time: 384 ms
Out[28]:
{'tol': 0.003, 'solver': 'svd', 'fit_intercept': True, 'alpha': 3.75}
In [29]:
ridge = Ridge(**grid.best_params_)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 21.096131514541256
Ridge Regression MSE on Test  Dataset : 29.569638216510906

Random Search with Parallelization

Below we have run a random search with parallelization by setting n_jobs to -1. It now runs faster compared to the non-parallelized version but still slower compared to optuna. The results are almost the same or a little bad compared to optuna.

In [30]:
%%time

param_grid = {"alpha" : np.linspace(0, 10, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.001, 0.01,10),
              "solver": ["auto", "svd","cholesky", "lsqr", "saga", "sag"]
             }

grid = RandomizedSearchCV(Ridge(), param_grid, cv=5, n_iter=25, n_jobs=-1, random_state=123)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 75 ms, sys: 7.9 ms, total: 82.9 ms
Wall time: 230 ms
Out[30]:
{'tol': 0.003, 'solver': 'svd', 'fit_intercept': True, 'alpha': 3.75}
In [31]:
ridge = Ridge(**grid.best_params_)

ridge.fit(X_train, Y_train)

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))
print("Ridge Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 21.096131514541256
Ridge Regression MSE on Test  Dataset : 29.569638216510906

Classification (Logistic Regression)

As a part of this section, we'll explain how we can use Optuna for classification problems. We'll be using the wine dataset available from scikit-learn for our purpose. It has information about various ingredients of wines for three different categories of wines. We'll be using logistic regression for explanation purposes and try to find out the best hyperparameters combination that gives the best accuracy.

Below we have loaded the wine dataset from scikit-learn. We have created a pandas data frame from wine data for the display purpose of its features and target (Wine Type) variables.

We have loaded information about wine features into a variable named X and information about wine type into a target variable named Y.

In [32]:
wine = datasets.load_wine()

X,Y = wine.data, wine.target

wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)

wine_df["WineType"] = wine.target

wine_df
Out[32]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline WineType
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0 2
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0 2
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0 2
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0 2
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0 2

178 rows × 14 columns

Below we have divided the wine dataset into the train (80%) and test (20%) sets.

In [33]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, stratify=Y, random_state=123)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Out[33]:
((142, 13), (36, 13), (142,), (36,))

Below we have created an objective function for our classification problem. We'll be optimizing 5 different hyperparameters of a logistic regression model.

  • penalty
  • tol
  • C
  • fit_intercept
  • solver

We have used suggest_categorical() method of Trial instance for suggesting categorical values for hyperparameters penalty, fit_intercept and solver. We have used suggest_float() method for suggesting float values for hyperparameters tol and C. Values of hyperparameters will be selected from ranges suggested by these methods during each trial of the optuna study.

We have then created logistic regression models using these hyperparameter variables. We have then trained the model on test data and evaluated it on test data. The evaluation will find out accuracy in this case which will tell us how many percent of test labels our model predicted correctly.

In [34]:
def objective(trial):
    penalty = trial.suggest_categorical("penalty", ["l1", "l2"])
    tol = trial.suggest_float("tol", 0.0001, 0.01, log=True)
    C = trial.suggest_float("C", 1.0, 10.0, log=True)
    intercept = trial.suggest_categorical("fit_intercept", [True, False])
    solver = trial.suggest_categorical("solver", ["liblinear", "saga"])

    ## Create Model
    classifier = LogisticRegression(penalty=penalty,
                                    tol=tol,
                                    C=C,
                                    fit_intercept=intercept,
                                    solver=solver,
                                    multi_class="auto",
                                   )
    ## Fit Model
    classifier.fit(X_train, Y_train)

    return classifier.score(X_test, Y_test)

Below we have created an instance of Study with the name LogisticRegression. We have set direction to maximize this time because we want to maximize accuracy which is the output of an objective function. The default value of parameter direction is minimize which will minimize the output of an objective function. It was used during the regression section where we wanted to minimize MSE.

We have then used the study object and instructed it to run the objective function for 15 trials with different hyperparameter combinations. It'll try 15 combinations and store information about each.

In [35]:
%%time

study3 = optuna.create_study(study_name="LogisticRegression", direction="maximize")
study3.optimize(objective, n_trials=15)
[I 2021-09-17 07:08:04,282] A new study created in memory with name: LogisticRegression
[I 2021-09-17 07:08:04,323] Trial 0 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C': 4.227876357919739, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,348] Trial 1 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l2', 'tol': 0.00027839176368309095, 'C': 4.045876618765414, 'fit_intercept': False, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,371] Trial 2 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l1', 'tol': 0.0017661636335698377, 'C': 3.5819218693443595, 'fit_intercept': True, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,383] Trial 3 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l1', 'tol': 0.0011113283180062658, 'C': 2.8253108009422, 'fit_intercept': True, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,392] Trial 4 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l2', 'tol': 0.0002531325014708905, 'C': 6.80115440151418, 'fit_intercept': False, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,404] Trial 5 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l1', 'tol': 0.00604775404136257, 'C': 4.9420276682995, 'fit_intercept': False, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,425] Trial 6 finished with value: 1.0 and parameters: {'penalty': 'l1', 'tol': 0.0007052999354687168, 'C': 6.581866524287295, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,429] Trial 7 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0012112077504395891, 'C': 1.6592762781687822, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,435] Trial 8 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.00022492234875997168, 'C': 1.1435410531610801, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,444] Trial 9 finished with value: 1.0 and parameters: {'penalty': 'l1', 'tol': 0.008955179306703085, 'C': 5.369124344110566, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,461] Trial 10 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.00011707817689005435, 'C': 9.966257153734214, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,480] Trial 11 finished with value: 0.9722222222222222 and parameters: {'penalty': 'l1', 'tol': 0.0005947383488025259, 'C': 2.4953987523032026, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,513] Trial 12 finished with value: 1.0 and parameters: {'penalty': 'l1', 'tol': 0.0005160908089876725, 'C': 7.737327943929015, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,522] Trial 13 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0021703601568192937, 'C': 6.551578644832481, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,548] Trial 14 finished with value: 0.9722222222222222 and parameters: {'penalty': 'l1', 'tol': 0.0005102712159548399, 'C': 2.062790817916783, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
CPU times: user 235 ms, sys: 8.18 ms, total: 243 ms
Wall time: 267 ms

Below we have printed the best hyperparameters settings which gave the best accuracy on train data.

In [36]:
print("Best Params : {}".format(study3.best_params))

print("\nBest Accuracy : {}".format(study3.best_value))
Best Params : {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C': 4.227876357919739, 'fit_intercept': False, 'solver': 'liblinear'}

Best Accuracy : 1.0

Below we have created a logistic regression model with the best parameters that we found using optuna. We have then trained it and evaluated it on train and test datasets.

We have also created a logistic regression model with default parameters to compare its performance with optuna. The results might look almost the same because the dataset that we have used is small. But the result of optuna in the real world where there is a lot of data will easily beat the models with default hyperparameter settings.

In [37]:
classifier = LogisticRegression(**study3.best_params, multi_class="auto")

classifier.fit(X_train, Y_train)

print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))
print("Logistic Regression Accuracy on Test  Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.971830985915493
Logistic Regression Accuracy on Test  Dataset : 1.0
In [38]:
classifier = LogisticRegression(multi_class="auto")

classifier.fit(X_train, Y_train)

print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))
print("Logistic Regression Accuracy on Test  Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.971830985915493
Logistic Regression Accuracy on Test  Dataset : 1.0

Below we have instructed the study object to optimize the objective function for 10 more trials to check whether it's improving the results further or not.

In [39]:
%%time

study3.optimize(objective, n_trials=10)
[I 2021-09-17 07:08:04,799] Trial 15 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.003713115494739006, 'C': 4.490437667436356, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,847] Trial 16 finished with value: 0.9166666666666666 and parameters: {'penalty': 'l1', 'tol': 0.00010380694985183597, 'C': 8.808456215349976, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,856] Trial 17 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0026196376974885567, 'C': 5.796925147588712, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,869] Trial 18 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.005315093180966793, 'C': 4.097645651629723, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,882] Trial 19 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0037396033965007795, 'C': 3.242087494925973, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,896] Trial 20 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0020803724890759096, 'C': 5.79752759831944, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,912] Trial 21 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0035788434709027894, 'C': 3.937534837955347, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,926] Trial 22 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.006037830492622762, 'C': 3.2088399089814894, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,938] Trial 23 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0014480851208891485, 'C': 2.337383118296276, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,952] Trial 24 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.0008729768443963656, 'C': 4.874010459610394, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
CPU times: user 177 ms, sys: 105 µs, total: 177 ms
Wall time: 172 ms

We have then printed the results after another 10 trials and the results are almost the same.

In [40]:
print("Best Params : {}".format(study3.best_params))

print("\nBest Accuracy : {}".format(study3.best_value))
Best Params : {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C': 4.227876357919739, 'fit_intercept': False, 'solver': 'liblinear'}

Best Accuracy : 1.0
In [41]:
classifier = LogisticRegression(**study3.best_params, multi_class="auto")

classifier.fit(X_train, Y_train)

print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))
print("Logistic Regression Accuracy on Test  Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.971830985915493
Logistic Regression Accuracy on Test  Dataset : 1.0

As a part of this section, we are comparing the grid search algorithm with optuna. We are trying the same ranges for each hyperparameter that we tried when using optuna. We can notice after training and evaluation that results are almost the same but the time taken by grid search is a lot more compared to optuna.

In [42]:
%%time

param_grid = {
              "penalty": ["l1", "l2"],
              "C" : np.linspace(1, 10.0, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.0001, 0.01,10),
              "solver": ["liblinear", "saga"]
             }

grid = GridSearchCV(LogisticRegression(multi_class="auto", max_iter=1000), param_grid, cv=5)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 1min 2s, sys: 13.5 ms, total: 1min 2s
Wall time: 1min 2s
Out[42]:
{'C': 7.375,
 'fit_intercept': True,
 'penalty': 'l1',
 'solver': 'liblinear',
 'tol': 0.0034}
In [43]:
classifier = LogisticRegression(**grid.best_params_, multi_class="auto")

classifier.fit(X_train, Y_train)

print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))
print("Logistic Regression Accuracy on Test  Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.9929577464788732
Logistic Regression Accuracy on Test  Dataset : 1.0

Below we have compared a random search algorithm with optuna. We have run a random search algorithm for 25 iterations which will try 25 different hyperparameters combinations on data. We can notice from the output that accuracy is almost the same as the one we got with optuna. The random search runs a little faster compared to grid search because it does not try all possible combinations of hyperparameters but is still it is slower compared to optuna.

We can come to the conclusion that optuna finds out the best hyperparameters combination in quite less time compared to random search and grid search. This can increase the productivity of ml practitioners a lot as it'll save time which could have been wasted in trying all possible settings rather than concentrating on ones that are giving good and ignoring others.

In [44]:
%%time

param_grid = {
              "penalty": ["l1", "l2"],
              "C" : np.linspace(1, 10.0, 25),
              "fit_intercept": [True, False],
              "tol": np.linspace(0.0001, 0.01,10),
              "solver": ["liblinear", "saga"]
             }

grid = RandomizedSearchCV(LogisticRegression(multi_class="auto", max_iter=1000), param_grid, cv=5, n_iter=25, random_state=123)

grid.fit(X_train, Y_train)

grid.best_params_
CPU times: user 879 ms, sys: 0 ns, total: 879 ms
Wall time: 878 ms
Out[44]:
{'tol': 0.0067,
 'solver': 'liblinear',
 'penalty': 'l1',
 'fit_intercept': True,
 'C': 9.625}
In [45]:
classifier = LogisticRegression(**grid.best_params_, multi_class="auto")

classifier.fit(X_train, Y_train)

print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))
print("Logistic Regression Accuracy on Test  Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.9929577464788732
Logistic Regression Accuracy on Test  Dataset : 1.0

Pruning Under Performing Hyperparameter Settings Earlier

As a part of this section, we'll explain how we can instruct Optuna to prune trials that are not performing well during the study process.

Typical machine learning algorithm deals with a lot of data in which case training does not complete in one go like we had explained in our earlier examples that had quite less data that can fit in the main memory of the computer. Real-world problems generally have a lot of data and the training process consists of going through batches of samples of data. It goes through the total data in batches to cover the total dataset. Many neural networks even go through a dataset more than once during the training process.

When going through data in batches or even looping through the same data more than once during the particular trial of study, we can check the performance of a model on set aside validation or test set. If it's not performing well then it can be pruned before it completes to save time and resources for other trials of the study process. Whether to prune a particular trial or not is decided by the internal pruning algorithm of Optuna.

We'll be using the California housing dataset available from scikit-learn as a part of this section. We'll be training a dataset in batches on a multi-layer perceptron algorithm available from scikit-learn.

California housing dataset has information about houses (average bedrooms, the population of an area, house age, etc) in California and their median house price. The median house price will be the target variable that our ML algorithm will be predicting. It'll be a regression problem.

Below we have loaded the California housing dataset which is available from scikit-learn. It's a big dataset compared to our previous datasets. It has around 20k+ entries. We have stored the dataset in a pandas dataframe for display purposes. We have stored housing features in variable X and our target variable (median house price) in variable Y.

In [46]:
calif_housing = datasets.fetch_california_housing()

X, Y = calif_housing.data, calif_housing.target

calif_housing_df = pd.DataFrame(calif_housing.data, columns=calif_housing.feature_names)

calif_housing_df["MedianHousePrice"] = calif_housing.target

calif_housing_df
Out[46]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedianHousePrice
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
... ... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -121.09 0.781
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -121.21 0.771
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -121.22 0.923
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -121.32 0.847
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -121.24 0.894

20640 rows × 9 columns

We have then divided the dataset into training (90%) and test (10%) sets as usual.

In [47]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.90, random_state=123)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Out[47]:
((18576, 8), (2064, 8), (18576,), (2064,))

Below we have reshaped our training and test dataset so that it has now entries in batches of samples. Each entry is a batch of 16 samples of data. The ML model will loop through training data in batches where 16 samples will be fed to it for training each time.

In [48]:
X_train_batched, Y_train_batched = X_train.reshape(-1,16,8), Y_train.reshape(-1,16)

X_train_batched.shape, Y_train_batched.shape
Out[48]:
((1161, 16, 8), (1161, 16))

Below we have created an objective function that we'll be using for our purpose. We'll be using the neural network algorithm available from scikit-learn for our purpose. We'll be optimizing the values of the below hyperparameters of the model.

  • hidden_layer_sizes
  • activation
  • learning_rate
  • learning_rate_init

Our objective function uses suggest_categorical() method of Trial instance for suggesting categorical values for hyperparameters hidden_layer_sizes, activation and learning_rate. We have used suggest_float() method for suggesting float values for hyperparameter learning_rate_init. After that, we have initialized model with these parameters.

The training process this time consists of a loop where we go through training data and each time partially fit a model to a single batch of training data. We calculate the mean squared error on test data during each batch as well.

In order to prune underperforming trials, we have introduced few extra lines of code. We are calling report() method of Trial instance which takes as input value which we are optimizing and number inside of our training loop. Then we have put if condition which checks whether this particular trial should be pruned using should_prune() method of Trial instance. If this method returns True then we raise TrialPruned() which will raise an exception. This will inform Study instance that we should prune this trial and should not train it more. This will result in saving of time and resources which would have been wasted behind this trial which would have resulted in underperformed results.

The default pruning algorithm of the study instance is MedianPruner which decides whether to prune a particular trial or not. Based on the decision taken by this algorithm, should_prune() method returns True or False. The MedianPruner algorithm takes decisions based on MSE values that we reported through various calls of report() method during the training process.

In [49]:
from sklearn.neural_network import MLPRegressor

def objective(trial):
    hidden_layers = trial.suggest_categorical("hidden_layer_sizes", [(50,100),(100,100),(50,75,100),(25,50,75,100)])
    activation = trial.suggest_categorical("activation", ["relu", "identity"])
    #solver = trial.suggest_categorical("solver", ["sgd", "adam"])
    learning_rate = trial.suggest_categorical("learning_rate", ['constant', 'invscaling', 'adaptive'])
    learning_rate_init = trial.suggest_float("learning_rate_init", 0.001, 0.01)

    ## Create Model
    mlp_regressor = MLPRegressor(
                            hidden_layer_sizes=hidden_layers,
                            activation=activation,
                            #solver=solver,
                            learning_rate=learning_rate,
                            learning_rate_init=learning_rate_init,
                            #early_stopping=True
                            )
    ## Fit Model
    for i, (X_batch, Y_batch) in enumerate(zip(X_train_batched,Y_train_batched)):
        mlp_regressor.partial_fit(X_batch, Y_batch)

        mse = mean_squared_error(Y_test, mlp_regressor.predict(X_test))

        trial.report(mse, i+1)

        if trial.should_prune():
            raise optuna.TrialPruned()

    return mse

Below we have created an instance of Study for trying various trials. We are then running 15 different trials to optimize the output (MSE on test dataset) of the objective function.

This time, we can notice from the output that few of the trials are pruned by the algorithm during a study which it thinks would not have resulted in a good performance.

In [50]:
%%time

study4 = optuna.create_study(study_name="MLPRegressor")
study4.optimize(objective, n_trials=15)
[I 2021-09-17 07:09:09,083] A new study created in memory with name: MLPRegressor
[I 2021-09-17 07:09:11,468] Trial 0 finished with value: 5.430809559222708 and parameters: {'hidden_layer_sizes': (100, 100), 'activation': 'identity', 'learning_rate': 'invscaling', 'learning_rate_init': 0.006063649495374317}. Best is trial 0 with value: 5.430809559222708.
[I 2021-09-17 07:09:16,623] Trial 1 finished with value: 2.6933340702103807 and parameters: {'hidden_layer_sizes': (25, 50, 75, 100), 'activation': 'relu', 'learning_rate': 'adaptive', 'learning_rate_init': 0.006593756750422365}. Best is trial 1 with value: 2.6933340702103807.
[I 2021-09-17 07:09:20,239] Trial 2 finished with value: 1.465392838316882 and parameters: {'hidden_layer_sizes': (100, 100), 'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init': 0.0035340788684059556}. Best is trial 2 with value: 1.465392838316882.
[I 2021-09-17 07:09:23,682] Trial 3 finished with value: 2.0013989256305473 and parameters: {'hidden_layer_sizes': (50, 75, 100), 'activation': 'relu', 'learning_rate': 'adaptive', 'learning_rate_init': 0.006840653092104957}. Best is trial 2 with value: 1.465392838316882.
[I 2021-09-17 07:09:31,306] Trial 4 finished with value: 1.6047483038968486 and parameters: {'hidden_layer_sizes': (25, 50, 75, 100), 'activation': 'identity', 'learning_rate': 'adaptive', 'learning_rate_init': 0.004993568961591413}. Best is trial 2 with value: 1.465392838316882.
[I 2021-09-17 07:09:31,481] Trial 5 pruned.
[I 2021-09-17 07:09:31,510] Trial 6 pruned.
[I 2021-09-17 07:09:31,527] Trial 7 pruned.
[I 2021-09-17 07:09:31,551] Trial 8 pruned.
[I 2021-09-17 07:09:31,557] Trial 9 pruned.
[I 2021-09-17 07:09:31,582] Trial 10 pruned.
[I 2021-09-17 07:09:31,594] Trial 11 pruned.
[I 2021-09-17 07:09:31,612] Trial 12 pruned.
[I 2021-09-17 07:09:31,830] Trial 13 pruned.
[I 2021-09-17 07:09:31,859] Trial 14 pruned.
CPU times: user 54.8 s, sys: 27.1 s, total: 1min 21s
Wall time: 22.8 s

Below we have printed the best parameter settings and the least MSE that we got using those parameters.

In [51]:
print("Best Params : {}".format(study4.best_params))

print("\nBest MSE : {}".format(study4.best_value))
Best Params : {'hidden_layer_sizes': (100, 100), 'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init': 0.0035340788684059556}

Best MSE : 1.465392838316882

Below we have printed the count of total trials, trials that were pruned, and trials that were completed successfully. We have used the state of the trial to determine whether they completed or got pruned. We can notice that 10 trials were pruned out of a total of 15 trials.

In [52]:
print("Total Trials : {}".format(len(study4.trials)))
print("Finished Trials : {}".format(len([t for t in study4.trials if t.state == optuna.trial.TrialState.COMPLETE])))
print("Prunned Trials : {}".format(len([t for t in study4.trials if t.state == optuna.trial.TrialState.PRUNED])))
Total Trials : 15
Finished Trials : 5
Prunned Trials : 10

Below we have trained multilayer perceptron with the best parameter settings that we got through optuna. We are then evaluating its performance on the train and test dataset by calculating MSE on both.

We have also created a multi-layer perceptron with default parameter setting for comparison with optuna hyperparameters combination trained model.

In [53]:
mlp_regressor = MLPRegressor(**study4.best_params, random_state=123)

mlp_regressor.fit(X_train, Y_train)

print("MLP Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, mlp_regressor.predict(X_train))))
print("MLP Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, mlp_regressor.predict(X_test))))
MLP Regression MSE on Train Dataset : 0.7297391637455242
MLP Regression MSE on Test  Dataset : 0.7772869137936619
In [54]:
mlp_regressor = MLPRegressor(random_state=123)

mlp_regressor.fit(X_train, Y_train)

print("MLP Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, mlp_regressor.predict(X_train))))
print("MLP Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, mlp_regressor.predict(X_test))))
MLP Regression MSE on Train Dataset : 0.6106332788212708
MLP Regression MSE on Test  Dataset : 0.6483518877964272

Below we have instructed the study instance to try 5 more trials to optimize an objective function. We are doing this to check whether it's able to find hyperparameters combination which can give better results than we got during our earlier 15 trials.

In [55]:
%%time

study4.optimize(objective, n_trials=5)
[I 2021-09-17 07:09:38,756] Trial 15 pruned.
[I 2021-09-17 07:09:39,356] Trial 16 pruned.
[I 2021-09-17 07:09:39,378] Trial 17 pruned.
[I 2021-09-17 07:09:39,401] Trial 18 pruned.
[I 2021-09-17 07:09:39,438] Trial 19 pruned.
CPU times: user 1.48 s, sys: 695 ms, total: 2.18 s
Wall time: 698 ms

Below we have printed the best hyperparameter settings as usual. Then we have again trained the model with the best hyperparameters settings that we got through optuna for verification purposes.

In [56]:
print("Best Params : {}".format(study4.best_params))

print("\nBest MSE : {}".format(study4.best_value))
Best Params : {'hidden_layer_sizes': (100, 100), 'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init': 0.0035340788684059556}

Best MSE : 1.465392838316882
In [57]:
mlp_regressor = MLPRegressor(**study4.best_params,random_state=123)

mlp_regressor.fit(X_train, Y_train)

print("MLP Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, mlp_regressor.predict(X_train))))
print("MLP Regression MSE on Test  Dataset : {}".format(mean_squared_error(Y_test, mlp_regressor.predict(X_test))))
MLP Regression MSE on Train Dataset : 0.7297391637455242
MLP Regression MSE on Test  Dataset : 0.7772869137936619

Visualizations

As a part of this section, we'll be exploring various visualizations available through Optuna which can help us make better decisions. It gives us inside into various hyperparameters and their impact on model performance.

We'll start by checking whether visualization support is available or not using is_available() function. It checks whether proper versions of plotly and matplotlib are available or not for creating visualizations.

In [58]:
optuna.visualization.is_available()
Out[58]:
True

Optimization History Plot

The first chart that we'll introduce is the optimization history chart. It plots the number of trials that we tried for finding the best hyperparameters combination on the Y-axis and an objective value that we got for each trial on the Y-axis.

We can use this chart to check whether hyperparameters optimization is going in the right direction or not. This means that in our case of regression task the value (MSE) of objective function should be minimized over time and in the case of classification value (Accuracy) of objective function should increase.


  • plot_optimization_history(study,target_name='Objective Value') - This function takes as input Study object and plots optimization history chart using plotly. We can give name of objective value which we were trying to minimize/maximize as the value of parameter target_name.

Below we have plotted the optimization history chart using the study object that we created during the classification section of this tutorial.

In [ ]:
optuna.visualization.plot_optimization_history(study3, target_name="Accuracy")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have plotted the optimization history chart using the study object that we created in the multi-layer perceptron section. We can notice from the output that the value of MSE is decreasing with an increase in trials. This confirms that Optuna was looking for a hyperparameters combination in the right direction.

In [ ]:
optuna.visualization.plot_optimization_history(study4, target_name="MSE of Median House Prices")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have plotted an optimization history plot using matplotlib. Optuna provides us a majority of charts with matplotlib as backend as well.

In [ ]:
optuna.visualization.matplotlib.plot_optimization_history(study4, target_name="MSE of Median House Prices");

Simple Guide to Optuna for Hyperparameters Optimization

Parameter Importance Plot

The second chart that we'll plot is a bar chart representing the hyperparameter importance of hyperparameters whose combinations were tried during the study process. This can help us understand which hyperparameters are contributing more towards minimizing/maximizing objective value.


  • plot_param_importances(study, target_name='Objective Value') - This function takes as input Study object and plots bar chart of hyperparameters importance using it.

Below we have plotted hyperparameters importance chart using study object from multi-layer perceptron model section. We can notice that Optuna thinks that learning_rate is most important parameter to optimize followed by learning_rate_init, activation and hidden_layer_sizes.

In [ ]:
optuna.visualization.plot_param_importances(study4, target_name="MSE of Median House Prices")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have plotted hyperparameters importance chart using study object from the classification section. It seems that solver is the most important hyperparameter to tune from a chart.

In [ ]:
optuna.visualization.plot_param_importances(study3, target_name="Accuracy")

Simple Guide to Optuna for Hyperparameters Optimization

HyperParameters Relationship Contour Plot

As a part of this section, we'll introduce a contour chart of the relationship between hyperparameters. It shows a relationship between different combinations of hyperparameters and objective value for those combinations as a contour plot.


  • plot_contour(study,params=None,target_name='Objective Value') - This function takes as input study object and returns contour chart showing relationship between all combinations of hyperparameters. We can provide params parameter with a list of hyperparameters between which we want to see the relationship.

Below we have plotted contour plot using study object from multi layer perceptron section. We have plotted relationship between hyperparameters hidden_layer_sizes and activation. We can notice from the chart that value of objective function is less where hidden_layer_sizes is set to (50,75,100) and activation is set to identity.

In [ ]:
optuna.visualization.plot_contour(study4, params=["hidden_layer_sizes", "activation"],
                                  target_name="MSE of Median House Prices"
                                 )

Simple Guide to Optuna for Hyperparameters Optimization

Below we have plotted another contour chart showing the relationship between learning_rate and learning_rate_init.

In [ ]:
optuna.visualization.plot_contour(study4, params=["learning_rate", "learning_rate_init"],
                                  target_name="MSE of Median House Prices"
                                 )

Simple Guide to Optuna for Hyperparameters Optimization

Below we have plotted our third contour chart using the study object from the classification section. We have used 3 hyperparameters (penalty, C, and solver) this time. This will create a plot with 9 contour charts where each contour chart will be showing the relationship between 2 hyperparameters.

In [ ]:
optuna.visualization.plot_contour(study3, params=["penalty", "C", "solver"],
                                  target_name="Accuracy"
                                 )

Simple Guide to Optuna for Hyperparameters Optimization

HyperParameters Combinations and Objective Value Relationship Parallel Coordinates Plot

As part of this section, we'll introduce parallel coordinates chart of parameter combinations that leads to a particular value of an objective function. The parallel coordinates chart has a single vertical line for each hyperparameter that we have tried using optuna. The vertical lines will have different values for those parameters. Then there will be lines connecting various values of these hyperparameters showing one combinations of these hyperparameters. The color of the line will be based on colormap which represents an objective value that we get using those combinations of hyperparameters. The first vertical line will be representing actual values of the objective function that we were trying to minimize/maximize.

Optuna provides plot_parallel_coordinate() function for this purpose.


  • plot_parallel_coordinate(study, target_name='Objective Value') - This function takes as input Study instance and creates parallel coordinates chart showing relationship between hyperparameters combination and objective value.

Below we have created parallel coordinates plot using our Study instance from the multi-layer perceptron section. We had minimized the MSE of median house prices in that section. We can notice below in parallel coordinates chart showing different combinations of hyperparameters and their relationship with MSE.

In [ ]:
optuna.visualization.plot_parallel_coordinate(study4, target_name="MSE of Median House Prices")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have created another parallel coordinates chart using Study object from the classification section.

In [ ]:
optuna.visualization.plot_parallel_coordinate(study3, target_name="Accuracy")

Simple Guide to Optuna for Hyperparameters Optimization

HyperParameters Combination and Objective Value Relationship Slice Plot

As a part of this section, we'll introduce a slice chart that shows the relationship between hyperparameter value and objective value. It has a hyperparameter value on X-axis and an objective value on Y-axis. We have then dots for different combinations and the opacity of that dot represents a number of trials taken to reach that objective value with that hyperparameter value of the ML model.

Optuna provides plot_slice() function for this purpose.


  • plot_slice(study,params=None,target_name=None) - This function takes as input Study instance and list of hyperparameter names and then creates slice plot from it. The slice plot consists of a list of charts where an individual chart represents the relationship between one hyperparameter and the objective value.

Below we have created a slice plot from Study object of the multi-layer perceptron section. We have created a slice plot of hyperparameters learning_rate and learning_rate_init. We can notice from dot opacity that how many trials it took for that value of hyperparameter to reach the value of MSE on the Y-axis.

In [ ]:
optuna.visualization.plot_slice(study4, params=["learning_rate", "learning_rate_init"],
                                  target_name="MSE of Median House Prices")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have created a slice plot using Study instance of the classification section. We have included hyperparameters penalty, C, and solver in the chart. It shows how many trials were taken by a particular value of hyperparameter to get particular accuracy.

In [ ]:
optuna.visualization.plot_slice(study3, params=["penalty", "C", "solver"],
                                  target_name="Accuracy"
                                 )

Simple Guide to Optuna for Hyperparameters Optimization

Intermediate Values of Trials

As a part of this section, we'll introduce a chart that shows the progress of all trials on the study process. This chart shows one line per trial showing how objective value is progressing (increasing/decreasing) during the training process of that trial. This can be useful to analyze trial progress and why a particular set of trials were pruned. Optuna provides a method named plot_intermediate_values() for the creation of this chart.


  • plot_intermediate_values(study) - This method takes as input Study instance and plots chart of lines where each line represents the progress of the individual trial of study. The x-axis of the chart represents a number of steps of the trial and the Y-axis represents the objective value.

The chart will have lines decreasing where we are trying to minimize objective value (MSE) and increasing where we are trying to maximize objective value (Accuracy) over time. It'll have an entry for some trials till the end of the steps and for some till in between. The reason behind some of the lines not running all steps of training is because they were deemed underperforming by Optuna and pruned before completion.

Below we have created an intermediate objective values chart of trials using Study object from the multi-layer perceptron section.

In [ ]:
fig = optuna.visualization.plot_intermediate_values(study4)

fig

Simple Guide to Optuna for Hyperparameters Optimization

Below we have recreated the previous chart using matplotlib.

In [ ]:
optuna.visualization.matplotlib.plot_intermediate_values(study4);

Simple Guide to Optuna for Hyperparameters Optimization

Empirical Distribution Function Plot

As a part of this section, we'll introduce the empirical cumulative distribution function of objective value. The chart consists of a single-step line. The value on the X-axis represents an objective value that we are trying to minimize/maximize and Y-axis represents cumulative probability. The cumulative probability at any point on the line represents the percentage of trials whose objective value is less than the objective value at that point.

To explain it with an example, let’s say we take a point on the line where cumulative probability is 0.80 and objective value is 2.7. Then of all trials that we tried as a part of the study process, 80% will have an objective value less than 2.7.


  • plot_edf(study,target_name='Objective Value') - This method takes as input study instance and creates eCDF chart of objective value.

Below we have created an eCDF chart from Study instance from the multi-layer perceptron section. We can notice that MSE ranges from 0-5.0+.

In [ ]:
optuna.visualization.plot_edf(study4, target_name="MSE of Median House Prices")

Simple Guide to Optuna for Hyperparameters Optimization

Below we have created eCDF chart of objective value using Study instance from the classification section. The objective value for the classification section was accuracy hence X-axis value ranges in 0-1.

In [ ]:
optuna.visualization.plot_edf(study3, target_name="Accuracy")

Simple Guide to Optuna for Hyperparameters Optimization

Optuna Logging

As a part of this section, we'll introduce few functions which can be used to handle logging messages generated by Optuna.

Optuna by default displays all logging messages of level INFO and above. We can modify this default logging level. Optuna provides two functions for checking and modifying logging levels.


  • get_verbosity() - This method returns current set logging level.
  • set_verbosity(level) - This method sets new logging level given to it.

If you are interested in learning about logging in python then please feel free to check our tutorial on the same. It tries to explain the topic with simple and easy-to-understand examples.

Below we have printed the logging level which is default by optuna. The default logging level is INFO for optuna as we said above.

In [75]:
optuna.logging.get_verbosity()
Out[75]:
20
In [76]:
optuna.logging.INFO
Out[76]:
20

Below we have modified the logging level to WARNING from INFO. This will now suppress all messages with level INFO and below. It'll now only print all messages with level WARNING and above.

In [77]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

Below we have run our study object from a multi-layer perceptron section for 5 more trials. We can notice now that the info messages about individual trials which were getting displayed earlier are suppressed now.

In [78]:
%%time

study4.optimize(objective, n_trials=5)
CPU times: user 13.3 s, sys: 5.96 s, total: 19.2 s
Wall time: 5.62 s
In [79]:
optuna.logging.WARNING
Out[79]:
30

This ends our small tutorial explaining how we can use Optuna with scikit-learn models. We also covered various visualizations provided by Optuna as a part of this tutorial. Please feel free to let us know your views in the comments section.

References



Sunny Solanki  Sunny Solanki