LightGBM is a framework that provides an implementation of gradient boosted decision trees. The gradient boosted decision trees is a type of gradient boosted machines algorithm that uses decision trees as estimators of an ensemble. An ensemble consists of many weak models/estimators (decision trees) whose predictions are combined to make final prediction.
It was created by the researchers and developers team at Microsoft.
Light GBM is known for its
LightGBM provides API in C, Python, and R Programming.
LightGBM even provides CLI (Command Line Interface) which lets us use the library from the command line.
LightGBM estimators provide a large set of hyperparameters to tune the model. It even has a large set of optimization/loss functions and evaluation metrics already implemented.
As a part of this tutorial, we have explained how to use Python library LighGBM to solve machine learning tasks (Regression and Classification). Tutorial explains majority of Python API of library with simple and easy-to-understand examples.
Apart from training model and making predictions, it explains many different concepts like cross-validation, saving & loading model, visualizing features importances, early stopping training to avoid overfitting, how to create custom loss functions, how to create a custom evaluation metrics, how to use callbacks, etc.
All our examples have lightgbm models trained on toy datasets (structured - tabular) available from scikit-learn.
The main aim of this tutorial is to make readers aware of the majority of functionalities available through lightgbm and get them started with the framework.
Below, we have listed important sections of tutorial to give an overview of the material covered. We know that the list below is big but you can skip some sections of tutorial which has a theory or repeat example of some concepts. We have included NOTE in those sections so you can skip them to complete tutorial faster. You can then refer to those sections in your free time or as per need.
We'll start by importing the necessary Python libraries and printing the versions that we have used in our tutorial.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 50)
import lightgbm as lgb
import sklearn
print("LightGBM Version : ", lgb.__version__)
print("Scikit-Learn Version : ", sklearn.__version__)
We'll be using the below-mentioned three different datasets which are available from sklearn as a part of this tutorial for explanation purposes.
We have loaded all three datasets mentioned one by one below. We have printed descriptions of datasets which gives us an overview of dataset features and size. We have even loaded each dataset as a pandas data frame and displayed the first few samples of data.
from sklearn.datasets import load_boston
boston = load_boston()
for line in boston.DESCR.split("\n")[5:29]:
print(line)
boston_df = pd.DataFrame(data=boston.data, columns = boston.feature_names)
boston_df["Price"] = boston.target
boston_df.head()
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
for line in breast_cancer.DESCR.split("\n")[5:31]:
print(line)
breast_cancer_df = pd.DataFrame(data=breast_cancer.data, columns = breast_cancer.feature_names)
breast_cancer_df["TumorType"] = breast_cancer.target
breast_cancer_df.head()
from sklearn.datasets import load_wine
wine = load_wine()
for line in wine.DESCR.split("\n")[5:29]:
print(line)
wine_df = pd.DataFrame(data=wine.data, columns = wine.feature_names)
wine_df["WineType"] = wine.target
wine_df.head()
LightGBM provides four different estimators to perform classification and regression tasks.
The simplest way to create an estimator in lightgbm is by using the train() method. It takes as input estimator parameter as dictionary and training dataset. It then trains the estimator and returns an object of type Booster which is a trained estimator that can be used to make future predictions.
Below are some of the important parameters of the train() method.
The dataset is a lightgbm internal data structure for holding data and labels. Below are important parameters of the class.
The first problem that we'll solve using lightgbm is a simple regression problem using the Boston housing dataset which we loaded earlier. We have divided the dataset into train/test sets and created a Dataset instance out of them. We have then called the lightgbm.train() method giving it train and validation set. We have set the number of boosting rounds to 10 hence it'll create 10 boosted trees to solve the problem. After training completes, it'll return an instance of type Booster which we can later use to make future predictions on the dataset. As we have given the validation set as input, it'll print the validation l2 score after each iteration of training. Please make a note that by default lightgbm minimizes l2 loss for regression problems.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression"},
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
Below we have made predictions on train and test data using a trained booster. We have then calculated R2 metrics for both using the sklearn metric method. Please make a note that the predict() method accepts numpy array, pandas dataframe, scipy sparse matrix, or h2o data table’s frame as input for making predictions.
If you are interested in learning the list of available metrics in scikit-learn then please feel free to check our tutorial on the same.
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
The predict() method of a few important parameters which can be used to make a different kind of predictions.
If you are interested in learning about SHAP values and our tutorial on the awesome SHAP package which lets us visualize these SHAP values in different ways to understand the performance of the model then check our tutorial on the same.
idxs = booster.predict(X_test, pred_leaf=True)
print("Shape : ", idxs.shape)
idxs
shap_vals = booster.predict(X_test, pred_contrib=True)
print("Shape : ", shap_vals.shape)
print("\nShap Values of 0th Sample : ", shap_vals[0])
print("\nPrediction of 0th using SHAP Values : ", shap_vals[0].sum())
print("Actual Prediction of 0th Sample : ", test_preds[0])
We can call the num_trees() method on the booster instance to get a number of trees in the ensemble. Please make a note that if we don't stop training early then a number of trees will be the same as num_boost_round. But if we are stopping training early then a number of trees will be different from num_boost_round. We have explained later in this tutorial how we can stop training if the ensemble's performance is not improving when evaluated on the validation set.
booster.num_trees()
The booster instance has another important method named feature_importance() which can return us the importance of features based on gain and split values of the trees.
booster.feature_importance(importance_type="gain")
booster.feature_importance(importance_type="split")
In this section, we have explained how we can use the train() method to create a booster for a binary classification problem. We are training the model on the breast cancer dataset and later evaluating the accuracy of it using a metric from sklearn. We have set an objective to binary for informing the train() method that we'll be giving data for binary classification problem. We have also set the verbosity parameter value to -1 in order to prevent training messages. It'll still print validation set evaluation results which can be turned off by setting the verbose_eval parameter to False.
Please make a note that for classification problems predict() method of booster return probabilities. We have included logic to convert probabilities to the target class.
LightGBM evaluates binary log loss function by default on the validation set for binary classification problems. We can give the metric parameter in the dictionary which we are giving to the train() method with any metric names available with lightgbm and it'll evaluate that metric. We'll later explain the list of available metrics with lightgbm.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=breast_cancer.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=breast_cancer.feature_names.tolist())
booster = lgb.train({"objective": "binary", "verbosity": -1},
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
from sklearn.metrics import accuracy_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
test_preds = [1 if pred > 0.5 else 0 for pred in test_preds]
train_preds = [1 if pred > 0.5 else 0 for pred in train_preds]
print("\nTest Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds))
NOTE: Please feel free to skip this section if you are in hurry and have understood how to use LightGBM for classification tasks using our previous binary classification example.
As a part of this section, we have explained how we can use the train() method for multi-class classification problems. We are using it on the wine dataset which has three different types of wine as the target variable. We have set an objective function to multiclass. We need to provide the num_class parameter with an integer specifying a number of classes whenever we are using the method for multi-class classification problems.
The predict() method returns the probabilities of each class in case of multi-class problems. We have included logic to select the class with maximum probability as a prediction.
LightGBM evaluates multi-class log loss function by default on the validation set for binary classification problems.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(wine.data, wine.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=wine.feature_names)
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=wine.feature_names)
booster = lgb.train({"objective": "multiclass", "num_class":3, "verbosity": -1},
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
from sklearn.metrics import accuracy_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
test_preds = np.argmax(test_preds, axis=1)
train_preds = np.argmax(train_preds, axis=1)
print("\nTest Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds))
NOTE: Please feel free to skip this section if you are in hurry. It is theoretical section listing parameters of "train()" constructor. You can refer them later as you need to tweak model.
We'll now list down important parameters of lightgbm which can be provided in a dictionary when calling the 'train()' method. We can provide the same parameters to estimators (LGBMModel, LGBMRegressor, and LGBMClassifier) that are readily available in lightgbm with the only difference that we don't need to provide them as a dictionary but we can provide them directly when creating an instance. We'll be introducing those estimators from the next section onwards.
Please make a NOTE that this is not the full list of parameters available with lightgbm but only a few important parameters list. If you are interested in learning about all parameters then please feel free to check the below link.
LGBMModel class is a wrapper around Booster class that provides scikit-learn like API for training and prediction in lightgbm. It let us create an estimator object with a list of parameters as input. We can then call the fit() method giving train data for training and the predict() method for making a prediction. The parameters which we had given as a dictionary to params parameter of train() can now directly be given to the constructor of LGBMModel to create a model. LGBMModel let us perform both classification and regression tasks by specifying the objective of the task.
Below we have explained with a simple example of how we can use LGBMModel to perform regression tasks with Boston housing data. We have first created an instance of LGBMModel with the objective as regression and number of trees set to 10. The n_estimators parameter is an alias of num_boost_round parameter of train() method.
We have then called the fit() method for the training model giving train data to it. Please make a note that it accepts numpy arrays as input and not lightgbm Dataset object. We have also given a dataset to be used as an evaluation set and metrics to be evaluated on the evaluation dataset. The parameter of the fit() method is almost the same as that of the train() method.
At last, we have called the predict() method to make predictions.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="regression", n_estimators=10,)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),], eval_metric="rmse")
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Below we have explained with a simple example of how we can use LGBMModel for classification tasks. We have a trained model with a breast cancer dataset. Please make a note that the predict() method returns probabilities. We have included logic to calculate class from probabilities.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="binary", n_estimators=10,)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),])
from sklearn.metrics import accuracy_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
test_preds = [1 if pred > 0.5 else 0 for pred in test_preds]
train_preds = [1 if pred > 0.5 else 0 for pred in train_preds]
print("\nTest Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds))
LGBMRegressor is another wrapper estimator around the Booster class provided by lightgbm which has the same API as that of sklearn estimators. As its name suggests, it’s designed for regression tasks. LGBMRegressor is almost the same as that of LGBMModel with the only difference that it’s designed for only regression tasks. Below we have explained the usage of LGBMRegressor with a simple example using the Boston housing dataset. Please make a note that LGBMRegressor provides the score() method which evaluates the R2 score for us which we used to evaluate using the sklearn metric method till now.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMRegressor(objective="regression_l2", n_estimators=10,)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),], eval_metric=["rmse", "l2", "l1"])
print("\nTest R2 Score : %.2f"%booster.score(X_train, Y_train))
print("Train R2 Score : %.2f"%booster.score(X_test, Y_test))
LGBMClassifier is one more wrapper estimator around the Booster class that provides a sklearn-like API for classification tasks. It works exactly like LGBMModel but for only classification tasks. It also provides a score() method which evaluates the accuracy of data passed to it.
Please make a note that LGBMClassifier predicts actual class labels for the classification tasks with the predict() method. It provides the predict_proba() method if we want probabilities of target classes.
Below we have explained with a simple example of how we can use LGBMClassifier for binary classification tasks. We have explained its usage with the Breast cancer dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMClassifier(objective="binary", n_estimators=10)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),])
print("\nTest Accuracy Score : %.2f"%booster.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%booster.score(X_train, Y_train))
NOTE: Please feel free to skip this section if you are in hurry and have understood how to use LightGBM for classification tasks using our previous binary classification example.
Below we have explained the usage of LGBMClassifier for multi-class classification tasks using the Wine classification dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(wine.data, wine.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMClassifier(objective="multiclassova", n_estimators=10, num_class=3)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),])
print("\nTest Accuracy Score : %.2f"%booster.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%booster.score(X_train, Y_train))
Please make a note that LGBMModel, LGBMRegressor and LGBMClassifier provides an attribute named 'booster_' which returns an instance of the Booster class which we can save to disk after training and later load for prediction.
booster.booster_
We'll now explain how we can save the trained model to a disk to use later for predictions. Lightgbm provides the below-mentioned methods for our purpose of saving and loading models.
Below we have explained with simple examples how we can use above mentioned methods to save models to a disk and then load it.
Please make a note that in order to save model trained using LGBMModel, LGBMRegressor, and LGBMClassifier, we first need to get their Booster instance by using the booster_ attribute of an estimator and then save it. LGBMModel, LGBMRegressor, and LGBMClassifier do not provide saving and loading functionalities. It’s only available with the Booster instance.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression", "verbosity": -1},
train_set=train_dataset, valid_sets=(test_dataset,),
verbose_eval=False,
feature_name=boston.feature_names.tolist(),
num_boost_round=10)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
booster.save_model("lgb.model")
loaded_booster = lgb.Booster(model_file="lgb.model")
loaded_booster
from sklearn.metrics import r2_score
test_preds = loaded_booster.predict(X_test)
train_preds = loaded_booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
model_as_str = booster.model_to_string()
with open("booster2.model", "w") as f:
f.write(model_as_str)
model_str = open("booster2.model").read()
booster_frm_str = lgb.Booster(model_str = model_str)
booster_frm_str
from sklearn.metrics import r2_score
test_preds = booster_frm_str.predict(X_test)
train_preds = booster_frm_str.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Lightgbm let us perform cross-validation using cv() method. It accepts model parameters as a dictionary like the train() method. We can then give a dataset on which to perform cross-validation. It performs 5-fold cross-validation by default. We can change the number of folds by setting the nfold parameter. It also accepts sklearn's data splitter like KFold, StratifiedKFold, ShuffleSplit, and StratifiedShuffleSplit. We can provide these data splitters to the folds parameter of the method.
The cv() method returns a dictionary that has information about the mean and standard deviation of loss for each round of training. We can even ask the method to return an instance of CVBooster by setting the return_cvbooster parameter to True. CVBooster object has information about cross-validation.
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=breast_cancer.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=breast_cancer.feature_names.tolist())
lgb.cv({"objective": "binary", "verbosity": -1},
train_set=test_dataset, num_boost_round=10,
nfold=5, stratified=True, shuffle=True,
verbose_eval=True)
from sklearn.model_selection import StratifiedShuffleSplit
cv_output = lgb.cv({"objective": "binary", "verbosity": -1},
train_set=test_dataset, num_boost_round=10,
metrics=["auc", "average_precision"],
folds=StratifiedShuffleSplit(n_splits=3),
verbose_eval=True,
return_cvbooster=True)
for key, val in cv_output.items():
print("\n" + key, " : ", val)
cvbooster = cv_output['cvbooster']
cvbooster.boosters
Lightgbm provides a list of the below-mentioned plotting functions.
This method accepts a booster instance and plots feature importance using it. Below we have created a feature importance plot using the booster trained earlier for the regression task. The method has a parameter named importance_type which can be set to string split will plot the number of times feature was used for split and plots gains of splits if set to string gain. The value of parameter importance_type is split. The plot_importance() method has another important parameter max_num_features which accepts an integer specifying how many features to include in the plot. We can limit the number of features using this parameter as it'll include only that many top features in the plot.
lgb.plot_importance(booster, figsize=(8,6));
This method plots the results of an evaluation metric. We need to give a booster instance to the method in order to plot an evaluation metric evaluated on the evaluation dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="regression", n_estimators=10,)
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse", eval_names = ["Validation Set"],
feature_name=boston.feature_names.tolist()
)
lgb.plot_metric(booster, figsize=(8,6));
lgb.plot_metric(booster, metric="rmse", figsize=(8,6));
This method takes as input booster instance and feature name/index. It then plots a split value histogram for the feature.
lgb.plot_split_value_histogram(booster, feature="LSTAT", figsize=(8,6));
This method lets us plot the individual tree of the ensemble. We need to give a booster instance and index of the tree which we want to plot to it.
lgb.plot_tree(booster, tree_index = 1, figsize=(20,12));
Early stopping training is a process where we stop training if the evaluation metric evaluated on the evaluation dataset is not improving for a specified number of rounds. Lightgbm provides parameter named early_stopping_rounds as a part of train() method as well as fit() method of lightgbm sklearn-like estimators. This parameter accepts integer value specifying that stop the training process if the evaluation metric result has not improved for that many rounds.
Please make a note that we need an evaluation dataset in order for this to work as it’s based on evaluation metric results evaluated on the evaluation dataset.
Below we have explained the usage of the parameter early_stopping_rounds for regression and classification tasks with simple examples.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression", "verbosity": -1, "metric": "rmse"},
train_set=train_dataset, valid_sets=(test_dataset,),
early_stopping_rounds=5,
num_boost_round=100)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="binary", n_estimators=100, metric="auc")
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),],
early_stopping_rounds=3)
from sklearn.metrics import accuracy_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
test_preds = [1 if pred > 0.5 else 0 for pred in test_preds]
train_preds = [1 if pred > 0.5 else 0 for pred in train_preds]
print("\nTest Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds))
Lightgbm provides early stopping training functionality using the early_stopping() callback function as well. We can give number of rounds to early_stopping() function and give that function to callbacks parameter of train()/fit() method. We have explained callbacks in an upcoming section.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="binary", n_estimators=100, metric="auc")
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),],
callbacks=[lgb.early_stopping(3)]
)
from sklearn.metrics import accuracy_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
test_preds = [1 if pred > 0.5 else 0 for pred in test_preds]
train_preds = [1 if pred > 0.5 else 0 for pred in train_preds]
print("\nTest Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds))
When lightgbm has completed training trees of the ensemble on a dataset, the individual node of trees represents some condition based on some value of the feature. When we are making predictions using an individual tree, we start from the root node of the tree, checking the feature condition specified in the node with our sample feature values. We make decisions based on the feature values in our sample and the condition present in the tree. This way we follow a particular path reaching the leaf of the tree to make the final prediction. By default, there is no restriction on which node can have which feature as a condition. This process of making a final decision by going through nodes of tree checking feature condition is called feature interaction because predictor has come to the particular node after evaluating the condition of the previous node. Lightgbm can let us define restrictions on which feature to interact with which another feature. We can give a list of indices and only that many features will interact with one another. Those features won't be allowed to interact with other features and this restriction will be forced when creating trees during the training process.
Below we have explained with a simple example of how we can force feature interaction constraint on estimator in lightgbm. Lighgbm estimators provide a parameter named interaction_constraints which accepts a list of lists where individual lists are indices of parameters that are allowed to interact with one another.
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression", "verbosity": -1, "metric": "rmse",
'interaction_constraints':[[0,1,2,11,12], [3, 4],[6,10], [5,9], [7,8]]},
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="regression", n_estimators=10,
interaction_constraints = [[0,1,2,11,12], [3, 4],[6,10], [5,9], [7,8]])
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse",
)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Lightgbm let us specify monotonic constraints on a model that specifies whether the individual feature has increasing, decreasing, or no relation with the target value. It let us specify monotone values of -1, 0, and 1 forcing model to impose decreasing, none, and increasing relationship of the feature with the target. We can provide a list with the same length as a number of features specifying 1,0 or -1 for the monotonic relationship by using the monotone_constraints parameter. We have explained below with a simple example of how we can enforce monotonic constraints in lightgbm.
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression", "verbosity": -1, "metric": "rmse",
'monotone_constraints':(1,0,1,-1,1,0,1,0,-1,1,1, -1, 1)},
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective="regression", n_estimators=10,
monotone_constraints = (1,0,1,-1,1,0,1,0,-1,1,1, -1, 1))
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse",
)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Lightgbm let us define custom objective function as well. We need to define a function that takes a list of prediction and actual labels as input and returns the first derivative and second derivative of the loss function. We need to return the first derivative and the second derivative of loss function evaluated using predictions and actual values. We can give a custom-defined objective/loss function to the objective parameter of the estimator. If we are using the train() method then we need to give this function to the fobj parameter.
Below we have designed the mean squared error objective function. We have then given this function to an objective parameter of LGBMModel for an explanation.
def first_grad(predt, dmat):
'''Compute the first derivative for mean squared error.'''
y = dmat.get_label() if isinstance(dmat, lgb.Dataset) else dmat
return 2*(y-predt)
def second_grad(predt, dmat):
'''Compute the second derivative for mean squared error.'''
y = dmat.get_label() if isinstance(dmat, lgb.Dataset) else dmat
return [1] * len(predt)
def mean_sqaured_error(predt, dmat):
''''Mean squared error function.'''
predt[predt < -1] = -1 + 1e-6
grad = first_grad(predt, dmat)
hess = second_grad(predt, dmat)
return grad, hess
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective=mean_sqaured_error, n_estimators=10,)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),], eval_metric="rmse")
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Lightgbm lets us define our own evaluation metric if we don't want to use evaluation metrics available with lightgbm. We need to define a function that takes an input list of predictions and actual target values and returns a string specifying metric name, metric evaluation value, and boolean value specifying whether higher is better or not. The value higher is better should be returned True if we want the metric value to be maximized else it should be False if we want the metric value to be minimized.
We need to give reference to this function as the value of parameter feval if we are using train() method to design our estimator. If we are using a sklearn-like estimator then we need to give this function to the eval_metric parameter of the fit() method.
Below we have explained with simple examples of how we can use custom evaluation metrics with lightgbm.
def mean_absolute_error(preds, dmat):
actuals = dmat.get_label() if isinstance(dmat, lgb.Dataset) else dmat
err = (actuals - preds).sum()
is_higher_better = False
return "MAE", err, is_higher_better
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")
train_dataset = lgb.Dataset(X_train, Y_train, feature_name=boston.feature_names.tolist())
test_dataset = lgb.Dataset(X_test, Y_test, feature_name=boston.feature_names.tolist())
booster = lgb.train({"objective": "regression", "verbosity": -1, "metric": "rmse"},
feval=mean_absolute_error,
train_set=train_dataset, valid_sets=(test_dataset,),
num_boost_round=10)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective=mean_sqaured_error, n_estimators=10,)
booster.fit(X_train, Y_train, eval_set=[(X_test, Y_test),], eval_metric=mean_absolute_error)
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
Lightgbm provides users with a list of callback functions for a different purpose that gets executed after each iteration of training. Below is a list of available callback functions with lightgbm:
The callbacks parameter which is available with the train() method and the fit() method of estimators accepts a list of callback functions.
Below we have explained with simple examples of how we can use different callback functions. The explanation of the early_stopping() callback function has been covered in the early stopping training section of this tutorial.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective=mean_sqaured_error, n_estimators=10,)
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse", verbose=False,
callbacks=[lgb.callback.print_evaluation(period=3)])
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective=mean_sqaured_error, n_estimators=10,)
evals_results = {}
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse", verbose=False,
callbacks=[lgb.print_evaluation(period=3), lgb.record_evaluation(evals_results)])
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
print("Evaluation Results : ", evals_results)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target)
print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
booster = lgb.LGBMModel(objective=mean_sqaured_error, n_estimators=10,)
booster.fit(X_train, Y_train,
eval_set=[(X_test, Y_test),], eval_metric="rmse",
callbacks=[lgb.reset_parameter(learning_rate=np.linspace(0.1,1,10).tolist())])
from sklearn.metrics import r2_score
test_preds = booster.predict(X_test)
train_preds = booster.predict(X_train)
print("\nTest R2 Score : %.2f"%r2_score(Y_test, test_preds))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds))
This ends our tutorial explaining the Python API of LightGBM end to end.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to