Updated On : Nov-23,2020 Tags xgboost
XGBoost - An In-Depth Guide [Python]

XGBoost - An In-Depth Guide [Python]

Table of Contents

XGBoost (Extreme Gradient Boosting)

Xgboost is a machine learning library that implements the gradient boosting trees concept. It's designed to be quite fast compared to the implementation available in sklearn. Xgboost lets us handle a large amount of data that can have samples in billions with ease. It can run in parallel and distributed environments to speed up the training process. The distributed algorithm can be useful if data does not fit into to main memory of the machine. Currently, it has support for dask to run the algorithm in a distributed environment. Xgboost even supports running an algorithm on GPU with simple configuration which will complete quite fast compared to when run on CPU. Xgboost provides API in C, C++, Python, R, Java, Julia, Ruby, and Swift. Xgboost code can be run on a distributed environment like AWS YARN, Hadoop, etc. It even provides an interface to run the algorithm from the command line/shell. Apart from this, xgboost provides support for controlling feature interactions, custom evaluation functions, callbacks during training, monotonic constraints, etc. As a part of this tutorial, we'll explain the API of xgboost and its various features through different examples. We'll try to cover the majority of features available from xgboost to make this tutorial a short reference to master xgboost API.

We'll start by importing the necessary libraries which we'll use as a part of this tutorial.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 50)

import xgboost as xgb
import sklearn

print("XGB Version          : ", xgb.__version__)
print("Scikit-Learn Version : ", sklearn.__version__)
XGB Version          :  1.2.1
Scikit-Learn Version :  0.21.2

Load Datasets

We’ll be using the below-mentioned three different datasets which are available from sklearn as a part of this tutorial for explanation purposes.

  • Boston Housing Dataset: It's a regression problem dataset which has information about a various attribute of houses in Boston and their price in dollar. This will be used for regression tasks.
  • Breast Cancer Dataset: It's a classification dataset which has information about two different types of tumor. It'll be used for explaining binary classification tasks.
  • Wine Dataset - It's a classification dataset which has information about ingredients used in three different types of wines. It'll be used for explaining multi-class classification tasks.

We have loaded all three datasets mentioned one by one below. We are printing descriptions of datasets which gives us an overview of dataset features and size. We have even loaded each dataset as a pandas data frame and displayed the first few samples of data.

Boston Housing Dataset

In [2]:
from sklearn.datasets import load_boston

boston = load_boston()

for line in boston.DESCR.split("\n")[5:29]:
    print(line)

boston_df = pd.DataFrame(data=boston.data, columns = boston.feature_names)
boston_df["Price"] = boston.target

boston_df.head()
**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

Out[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Breast Cancer Dataset

In [3]:
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()

for line in breast_cancer.DESCR.split("\n")[5:31]:
    print(line)

breast_cancer_df = pd.DataFrame(data=breast_cancer.data, columns = breast_cancer.feature_names)
breast_cancer_df["TumorType"] = breast_cancer.target

breast_cancer_df.head()
**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign
Out[3]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension TumorType
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

Wine Dataset

In [4]:
from sklearn.datasets import load_wine

wine = load_wine()

for line in wine.DESCR.split("\n")[5:29]:
    print(line)

wine_df = pd.DataFrame(data=wine.data, columns = wine.feature_names)
wine_df["WineType"] = wine.target

wine_df.head()
**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2

Out[4]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline WineType
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0

Core API

As a part of this section, we'll explain the core API of xgboost which will have an explanation for different machine learning estimators available with the library. We'll even explain the parameters of these estimators as well as important attributes and methods available through them.

Booster: Regression

We'll start with the creation of a simple estimator for the regression task of predicting prices of houses in Boston. We'll explain how we can use API to create an estimator with default parameters which will just work fine. We'll then explain various parameters available for a different purpose.

We'll first divide Boston dataset into train (90%) and test (10%) datasets using sklearn's function train_test_split().

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Out[5]:
((455, 13), (51, 13), (455,), (51,))

Xgboost default API only accepts a dataset that is wrapped in DMatrix. DMatrix is an internal data structure of xgboost which wraps data features and labels both into it. It's designed to be efficient and fastens the training process.

DMatrix()

We can create a DMatrix instance by setting a list of the below parameters. Only the data parameter is required and all others are optional.

  • data - This parameter accepts one of the below as input which has values for data features.
    • pandas dataframe
    • numpy array
    • scipy sparse matrix
    • path to libsvm format text file
    • libsvm format text
  • label - It accepts a numpy array of pandas data frame containing labels of the dataset.
  • missing - It accepts float value in the dataset which should be treated as a float value. The default is None mean np.nan is considered missing.
  • feature_names - It accepts a list of string specifying feature names of data.
  • feature_types - It accepts a list of string specifying feature data types.
  • nthread - It accepts integer specifying the number of threads to use when loading data. The value of -1 uses all available threads on the system.

Below we have created train DMatrix and test DMatrix using numpy arrays of features data and labels. We have also passed feature names to the constructor.

In [6]:
dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

dmat_train, dmat_test
Out[6]:
(<xgboost.core.DMatrix at 0x7f7416cf0240>,
 <xgboost.core.DMatrix at 0x7f7416cf0208>)

The simplest way of creating a booster using xgboost is by calling the train() method of xgboost. The train() method returns an instance of class xgboost.core.Booster after training is completed. We need to pass parameters for boosting algorithm as a dictionary to train method.

train()

Below we have given a list of important parameters of the train() method. Only params and dtrain are required and all other parameters are optional and have default values set to them.

  • params - It accepts a dictionary of gradient boosting algorithm parameters. We can give it an even empty dictionary and it'll take the default value for all parameters. By default, it'll consider the task to be a regression task and will calculate RMSE loss. We need to specify at least an objective function if we want it to consider a classification task for the data.
  • dtrain - It accepts DMatrix instances of train data.
  • num_boost_round - It accepts integer specifying the number of rounds of the training process. The algorithm will iterate over whole training data that many times.
  • evals - We can provide a list of tuples specifying datasets to be used for evaluation when performing training. We have passed our train and test datasets as evaluation sets hence RMSE for each will be printed after all iterations.
  • obj - We can give customized objective function which will be maximized/minimized when training algorithm.
  • feval - We can give a customized evaluation function which will be used to evaluate datasets given to evals.
  • maximize - It accepts a boolean specifying whether to maximize or minimize our objective/loss function.
  • early_stopping_rounds - It accepts an integer which instructs the algorithm to stop training if the last eval set in the list has not improved for that many rounds. If the objective/loss of the last eval dataset has not improved for that many consecutive rounds of training then the training process will stop. This parameter requires us to provide an evals parameter for it to work.
  • evals_result - We can provide an empty dictionary to this parameter and it'll store evaluation results into it.
  • verbose_eval - It accepts bool or integer specifying whether to print evaluation results. The integer value greater than 0 will print evaluation results at every that many iterations.
  • callbacks - It accepts a list of callbacks that are applied at the end of each iteration of the training process.

Below we have called the train() method of xgboost by passing it a few parameters for boosting algorithm, train data for training, and evaluation set of training and test dataset on which evaluation after each iteration will happen.

In [7]:
booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:squarederror'},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

booster
[0]	train-rmse:3.94894	test-rmse:3.59159
[1]	train-rmse:3.37195	test-rmse:3.26373
[2]	train-rmse:3.09769	test-rmse:3.12218
[3]	train-rmse:2.78200	test-rmse:2.94107
[4]	train-rmse:2.53499	test-rmse:2.75222
[5]	train-rmse:2.37140	test-rmse:2.78515
[6]	train-rmse:2.23286	test-rmse:2.64519
[7]	train-rmse:2.16047	test-rmse:2.64290
[8]	train-rmse:2.03129	test-rmse:2.58895
[9]	train-rmse:1.96511	test-rmse:2.61442
Out[7]:
<xgboost.core.Booster at 0x7f7416cedcf8>

We can use the predict() method of booster instance to predict labels for data passed to it. The predict() method requires us to pass the DMatrix instance only.

predict()

The predict method provides a list of the below important parameters that can be useful in a different situation.

  • data - It accepts DMatrix of feature values.
  • ntree_limit - It accepts integer specifying the number of trees to use from the total tree to make a prediction. The default is 0 which means to use all trees.
  • pred_leaf - It accepts boolean which is set to True returns array of size n_samples x n_trees where each entry is an index of leaf in a tree which was used for prediction. The entry (0,1) refers to an index of leaf for the 2nd tree which was used to make a prediction for the first sample. The default is False.
  • pred_contribs - It accepts boolean which if set to True returns an array of size n_sample x n_features+1 where each entry specifies contributions of features in making a final prediction for that sample. It's referred to as SHAP values. If we add all values for a particular sample then we can get the actual prediction. The default is False.
  • pred_interactions - It accepts boolean which if set to True returns array of size n_sample x n_features+1 xn_features+1 indicating features SHAP interaction values for each sample.

Below we have created a data frame showing the first 10 actual test labels and 10 predicted labels for test data.

In [8]:
pd.DataFrame({ "Actuals":Y_test[:10], "Prediction":booster.predict(dmat_test)[:10]})
Out[8]:
Actuals Prediction
0 23.6 25.580267
1 32.4 31.743393
2 13.6 13.508162
3 22.8 23.470869
4 16.1 13.658171
5 20.0 22.350372
6 17.8 17.217281
7 14.0 14.332675
8 19.6 20.501831
9 16.8 20.756474

Below we have retrieved shap values for our test samples. We have even summed up shap values for each sample to calculate the final prediction which is the same as the actual prediction printed above.

If you are interested in learning about the SHAP python library which provides various methods for calculating SHAP values and different types of plots to interpret them then please feel free to check our tutorial on the same.

In [9]:
shap_values = booster.predict(dmat_test, pred_contribs=True)

print("SHAP Values Size : ", shap_values.shape)

print("\nSample SHAP Values : ",shap_values[0])
print("\nSumming SHAP Values for Prediction : ",shap_values.sum(axis=1)[:5]) # First 5 preds are only printed
SHAP Values Size :  (51, 14)

Sample SHAP Values :  [ 5.31424880e-01  0.00000000e+00  3.62157822e-04  1.90089308e-02
  1.01445103e+00 -2.51514196e+00 -5.74439168e-01  2.83589065e-01
  4.30885423e-03 -2.59072632e-01  3.69396627e-01  1.22908555e-01
  3.89043856e+00  2.26930332e+01]

Summing SHAP Values for Prediction :  [25.580269 31.743395 13.508162 23.470871 13.658173]
In [10]:
booster.predict(dmat_test, pred_leaf=True)[:5]
Out[10]:
array([[ 8, 11, 11, 13, 11,  8, 11,  5,  9, 13],
       [ 8,  7, 11,  8, 12,  8,  8,  5,  7, 13],
       [13, 11, 11, 12,  7, 13, 11,  6, 13, 11],
       [ 8, 11, 11, 13,  7, 12,  8,  5,  7, 13],
       [14, 11, 11, 14, 11, 13, 11,  5, 13, 14]], dtype=int32)
In [11]:
shap_interactions = booster.predict(dmat_test, pred_interactions=True)

print("SHAP Interactions Size : ", shap_interactions.shape)
SHAP Interactions Size :  (51, 14, 14)

We can explicitly evaluate the dataset using a trained booster instance with the help of the eval() method. It'll evaluate the dataset and return an objective function value for it. below we are using the eval() method on the train and test DMatrix to get RMSE for both.

In [12]:
print("Train RMSE : ",booster.eval(dmat_train))
print("Test  RMSE : ",booster.eval(dmat_test))
Train RMSE :  [0]	eval-rmse:1.965108
Test  RMSE :  [0]	eval-rmse:2.614419

Below we have evaluated the R2 score for train and test datasets using the r2_score() function of sklearn. We have then evaluated the R2 score based on using only 5 trees from the ensemble rather than using all trees.

In [13]:
from sklearn.metrics import r2_score

print("Test  R2 Score : %.2f"%r2_score(Y_test, booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, booster.predict(dmat_train)))
Test  R2 Score : 0.89
Train R2 Score : 0.96
In [14]:
print("Number of Trees in Ensemble : ",booster.best_ntree_limit)

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, booster.predict(dmat_test, ntree_limit=5)))
print("Train R2 Score : %.2f"%r2_score(Y_train, booster.predict(dmat_train, ntree_limit=5)))
Number of Trees in Ensemble :  10

Test  R2 Score : 0.88
Train R2 Score : 0.93

plot_importance()

The xgboost provides functionality that lets us print feature importance. We need to pass our booster instance to the method and it'll plot feature importance bar chart using matplotlib. The plot_importance() method has an important parameter named importance_type which accepts one of the below mentioned 3 string values to plot feature importance in three different ways.

  • weight - It plots the number of times a feature appears in a tree. This is the default value.
  • gain - It plots the average gain of splits that uses the feature.
  • cover - It plots the average coverage of splits for each feature.
In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(9,6))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_importance(booster, ax=ax, height=0.6, importance_type="weight")

XGBoost - An In-Depth Guide

plot_tree()

Xgboost also lets us plot the individual trees in the ensemble of trees using the plot_tree() method. It accepts booster instance and index of a tree which we want to plot. Below we have plotted the 10th tree of an ensemble. Please make a note that indexing starts at 0.

In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(25,10))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_tree(booster, ax=ax, num_trees=9)

XGBoost - An In-Depth Guide

get_split_value_histogram()

The get_split_value_histogram() method returns histogram of splits for feature values. Below we have created split values histogram for feature LSTAT of data. It gives us value and how many times a split has happened at that value.

In [17]:
booster.get_split_value_histogram("LSTAT")
Out[17]:
SplitValue Count
0 7.182500 2.0
1 9.530000 1.0
2 11.877500 4.0
3 16.572500 2.0
4 18.920001 1.0
5 21.267501 1.0
6 30.657501 1.0
7 33.005001 1.0

trees_to_dataframe()

The trees_to_dataframe() method will dump information on trees used in an ensemble as a pandas dataframe. It'll have information on each tree-like individual node ids, feature name, and its values used for a split at each node, gain at each node, cover at each node, etc.

In [18]:
booster.trees_to_dataframe()
Out[18]:
Tree Node ID Feature Split Yes No Missing Gain Cover
0 0 0 0-0 LSTAT 9.72500 0-1 0-2 0-1 16866.609400 455.0
1 0 1 0-1 RM 6.94100 0-3 0-4 0-3 6006.859380 196.0
2 0 2 0-2 LSTAT 16.21500 0-5 0-6 0-5 2317.007810 259.0
3 0 3 0-3 DIS 1.48495 0-7 0-8 0-7 562.812500 129.0
4 0 4 0-4 RM 7.43700 0-9 0-10 0-9 496.929688 67.0
... ... ... ... ... ... ... ... ... ... ...
133 9 10 9-10 Leaf NaN NaN NaN NaN 0.277203 2.0
134 9 11 9-11 Leaf NaN NaN NaN NaN 0.477180 104.0
135 9 12 9-12 Leaf NaN NaN NaN NaN -1.080109 11.0
136 9 13 9-13 Leaf NaN NaN NaN NaN 0.046793 249.0
137 9 14 9-14 Leaf NaN NaN NaN NaN -0.831588 78.0

138 rows × 10 columns

Important Parameters of Boosting

Below we have given a list of important parameters of the boosting algorithm which we can pass as a dictionary to the params parameter.

  • booster - It specifies which gradient boosting algorithm to use for training. Below is a list of possible options.
    • gbtree - It’s a tree-based algorithm. Default.
    • gblinear - It’s a linear function based algorithm.
    • dart - It’s a tree-based algorithm.
  • eta - It accepts float [0,1] specifying learning rate for training process. Default = 0.3
  • tree_method - It accepts string specifying tree construction algorithm. Below is a list of possible options.

    • auto - It automatically decides the algorithm based on dataset size. For the small datasets, it uses exact and for larger datasets approx.
    • exact - It specifies the exact greedy algorithm. It tries all possible splits to create trees.
    • approx - It’s an approximate greedy algorithm that uses quantile sketch and gradient histogram.
    • hist - It’s an approximate greedy algorithm optimized using a faster histogram.
    • gpu_hist - Its a GPU implementation of hist.
  • max_depth - It accepts integer specifying the maximum depth of the tree. The default is 6.

  • gamma - It accepts float specifying minimum loss required to make a further partition on a particular node of the tree during training. The default is 0.
  • subsample - It accepts float in the range (0,1] specifying sub-sample ration of training samples. The value of 0.5 will result in taking half of the sample randomly before training starts which can help prevent overfitting.
  • sampling_method - This parameter accepts one of the below string as a sampling method to draw sub-samples.

    • uniform - Default
    • gradient_based
  • lambda - It accepts float specifying L2 regularization term on weights. The default is 1.

  • alpha - It accepts float specifying L1 regularization term on weights. The default is 0.
  • max_bin - It accepts integer specifying the number of bins to bucket continuous features. The default is 256. The more value improves split quality at the expense of more computation time.
  • monotone_constraints - It accepts tuple of integers of length n_features. Each entry in tuple has a value either 1,0 or -1 specifying increasing, none, or decreasing monotone relation of a feature with the target. It only works with tree_method set to one of the exact, hist or gpu_hist.
  • interaction_constraints - It accepts a list of the list each individual list represents indexes of features that are allowed to interact when creating a tree to make the final prediction. If we don't provide this constraint then all features are allowed to interact with one another. We can restrict feature interaction using this parameter.
  • tweedie_variance_power - It accepts float in the range (1,2) that controls variance of Tweedie distribution. The default value is 1.5.
  • objective - It accepts string specifying objective/loss function to use for training. The default value is reg:squarederror. Below are some of the commonly used values. Please visit this link to check a list of all objective functions available.

    • reg:squarederror
    • reg:squaredlogerror
    • reg:logistic - Logistic Regression
    • binary:logistic - Logistic Regression for Binary Classification. Outputs probability.
    • multi:softmax - Multi-Class classification using softmax function.
    • multi:softprob - It’s the same as softmax but outputs probability.
    • reg:tweedie - It’s tweedie regression with log-link.
  • eval_metric - It accepts string value specifying metric which will be used to evaluate evaluation sets passed to evals parameter. Below is a list of commonly used values. Please visit this link to check a list of all evaluation metrics available.

    • rmse - Root Mean Squared Error
    • rmsle - Root Mean Squared Log Error
    • mae - Mean Absolute Error
    • logloss - Negative Log-likelihood
    • auc - Area Under Curve ROC
    • error - Binary Classification error rate (no_wrong_preds/total_samples).
  • num_class - It's an integer specifying number of class for multi-class classification problem. We need to provide this when the objective is set to multi:softmax or multi:softprob.
  • nthread - It specifies the number of threads to use to run xgboost.
  • verbosity - It accepts one of the below integers for printing messages during training.
    • 0 - Silent
    • 1 - Warning
    • 2 - Info
    • 3 - Debug

Please make a note that this is not a list of all parameters for estimator but a list of important parameters that are commonly tuned by practitioners. Please visit the below link to know about all possible parameters available with xgboost.

Booster: Tweedie Regression

Below we have explained an example of how we can use tweedie regression on Boston housing data. We have trained the model using tweedie regression and then evaluated RMSE and R2 scores on both train and test datasets.

In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie', 'tree_method':'hist', 'nthread':4},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",tweedie_booster.eval(dmat_train))
print("Test  RMSE : ",tweedie_booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	train-tweedie-nloglik@1.5:28.32970	test-tweedie-nloglik@1.5:26.66488
[1]	train-tweedie-nloglik@1.5:19.30740	test-tweedie-nloglik@1.5:18.58394
[2]	train-tweedie-nloglik@1.5:18.72894	test-tweedie-nloglik@1.5:18.14010
[3]	train-tweedie-nloglik@1.5:18.71592	test-tweedie-nloglik@1.5:18.13065
[4]	train-tweedie-nloglik@1.5:18.70913	test-tweedie-nloglik@1.5:18.12305
[5]	train-tweedie-nloglik@1.5:18.70438	test-tweedie-nloglik@1.5:18.12354
[6]	train-tweedie-nloglik@1.5:18.70052	test-tweedie-nloglik@1.5:18.11985
[7]	train-tweedie-nloglik@1.5:18.69816	test-tweedie-nloglik@1.5:18.12131
[8]	train-tweedie-nloglik@1.5:18.69564	test-tweedie-nloglik@1.5:18.12422
[9]	train-tweedie-nloglik@1.5:18.69303	test-tweedie-nloglik@1.5:18.12833

Train RMSE :  [0]	eval-tweedie-nloglik@1.5:18.693033
Test  RMSE :  [0]	eval-tweedie-nloglik@1.5:18.128325

Test  R2 Score : 0.90
Train R2 Score : 0.95

Booster: Binary Classification

As a part of this section, we have explained how we can use the train() method to train booster for the binary classification task of classifying breast cancer tumor type. Please make a note that we have used binary:logistic as our objective function hence the output of the predict() method of the booster will be a probability. We have included logic to convert probabilities into class. We have then calculated accuracy, confusion matrix, and classification report for test data.

In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target, train_size=0.90, stratify=breast_cancer.target, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=breast_cancer.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=breast_cancer.feature_names)

booster = xgb.train({'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",booster.eval(dmat_train))
print("Test  RMSE : ",booster.eval(dmat_test))

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

train_preds = [1 if pred>0.5 else 0 for pred in booster.predict(data=dmat_train)]
test_preds = [1 if pred>0.5 else 0 for pred in booster.predict(data=dmat_test)]

print("\nTest  Accuracy : %.2f"%accuracy_score(Y_test, test_preds))
print("Train Accuracy : %.2f"%accuracy_score(Y_train, train_preds))

print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, test_preds))

print("\nClassification Report : ")
print(classification_report(Y_test, test_preds))
Train/Test Sizes :  (512, 30) (57, 30) (512,) (57,)

[0]	train-error:0.05273	test-error:0.10526
[1]	train-error:0.02344	test-error:0.07018
[2]	train-error:0.01953	test-error:0.07018
[3]	train-error:0.02148	test-error:0.05263
[4]	train-error:0.00977	test-error:0.05263
[5]	train-error:0.00781	test-error:0.05263
[6]	train-error:0.00586	test-error:0.07018
[7]	train-error:0.00195	test-error:0.03509
[8]	train-error:0.00195	test-error:0.08772
[9]	train-error:0.00195	test-error:0.03509

Train RMSE :  [0]	eval-error:0.001953
Test  RMSE :  [0]	eval-error:0.035088

Test  Accuracy : 0.96
Train Accuracy : 1.00

Confusion Matrix :
[[20  1]
 [ 1 35]]

Classification Report :
              precision    recall  f1-score   support

           0       0.95      0.95      0.95        21
           1       0.97      0.97      0.97        36

    accuracy                           0.96        57
   macro avg       0.96      0.96      0.96        57
weighted avg       0.96      0.96      0.96        57

Below we have plotted feature importance for booster trained on breast cancer dataset. We have plotted the average gain of splits that uses the feature. Please feel free to look at the data frame retrieved using the trees_to_dataframe() method.

In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(9,6))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_importance(booster, ax=ax, height=0.6, importance_type="gain")

XGBoost - An In-Depth Guide

We have now plotted 3rd tree from the ensemble below.

In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(15,10))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_tree(booster, ax=ax, num_trees=2)

XGBoost - An In-Depth Guide

Booster: Multi-Class Classification

As a part of this section, we have explained how we can use the train() method for multi-class classification problems. We have used it to generate booster trained on wine classification train dataset. We have then evaluated the accuracy, confusion matrix, and classification report on the test dataset.

We have then plotted the feature importance bar chart and first decision tree.

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(wine.data, wine.target, train_size=0.80, stratify=wine.target, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=wine.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=wine.feature_names)

booster = xgb.train({'max_depth': 5, 'eta': 1, 'objective': 'multi:softmax', 'num_class':3},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",booster.eval(dmat_train))
print("Test  RMSE : ",booster.eval(dmat_test))

from sklearn.metrics import accuracy_score

print("\nTest  Accuracy : %.2f"%accuracy_score(Y_test, booster.predict(data=dmat_test)))
print("Train Accuracy : %.2f"%accuracy_score(Y_train, booster.predict(data=dmat_train)))

print("\nConfusion Matrix : ")
print(confusion_matrix(Y_test, booster.predict(data=dmat_test)))

print("\nClassification Report : ")
print(classification_report(Y_test, booster.predict(data=dmat_test)))
Train/Test Sizes :  (142, 13) (36, 13) (142,) (36,)

[0]	train-merror:0.00000	test-merror:0.05556
[1]	train-merror:0.00000	test-merror:0.05556
[2]	train-merror:0.00000	test-merror:0.02778
[3]	train-merror:0.00000	test-merror:0.02778
[4]	train-merror:0.00000	test-merror:0.00000
[5]	train-merror:0.00000	test-merror:0.02778
[6]	train-merror:0.00000	test-merror:0.00000
[7]	train-merror:0.00000	test-merror:0.02778
[8]	train-merror:0.00000	test-merror:0.02778
[9]	train-merror:0.00000	test-merror:0.02778

Train RMSE :  [0]	eval-merror:0.000000
Test  RMSE :  [0]	eval-merror:0.027778

Test  Accuracy : 0.97
Train Accuracy : 1.00

Confusion Matrix :
[[12  0  0]
 [ 0 14  0]
 [ 0  1  9]]

Classification Report :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.93      1.00      0.97        14
           2       1.00      0.90      0.95        10

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36

In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(9,6))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_importance(booster, ax=ax, height=0.6, importance_type="weight")

XGBoost - An In-Depth Guide

In [ ]:
with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(20,10))
    ax = fig.add_subplot(111)
    xgb.plotting.plot_tree(booster, ax=ax, num_trees=1)

XGBoost - An In-Depth Guide

Saving and Loading Trained Model

As a part of this section, we have explained how we can save the trained xgboost model to disk and then load it to make predictions again in the future.

Below is a list of available methods that can be used to save the model in a different format.

  • save_model(file_name) - It saves model in xgboost internal format.
  • save_config() - It outputs booster configuration as JSON string which can be saved to json file. We can load the booster later using the same parameter configuration using this file.
  • save_raw() - It returns the bytearray object which is the current memory representation of a booster instance.

Below is a list of available methods that can be used to load the saved model.

  • load_model(file_name) - It accepts file name or bytearray from which trained model can be loaded.
  • load_config() - It accepts JSON string generated by save_config() to load model with same configuration.

Below we have saved our multi-class classification model which we created in the previous example. We have then reloaded the model and made predictions using it for verification.

In [26]:
booster.save_model("multiclass_classification.model")
In [27]:
loaded_booster =  xgb.Booster()
loaded_booster
Out[27]:
<xgboost.core.Booster at 0x7f7460ae5780>
In [28]:
loaded_booster.load_model("multiclass_classification.model")
In [29]:
pd.DataFrame({"Preds":booster.predict(dmat_test)[:5], "Loaded Model Preds":loaded_booster.predict(dmat_test)[:5]})
Out[29]:
Preds Loaded Model Preds
0 0.0 0.0
1 2.0 2.0
2 0.0 0.0
3 1.0 1.0
4 1.0 1.0

We can even load the model by using the Booster() class giving it the file name as a part of the model_file parameter.

In [30]:
loaded_booster1 =  xgb.Booster(model_file="multiclass_classification.model")

pd.DataFrame({"Preds":booster.predict(dmat_test)[:5], "Loaded Model Preds":loaded_booster1.predict(dmat_test)[:5]})
Out[30]:
Preds Loaded Model Preds
0 0.0 0.0
1 2.0 2.0
2 0.0 0.0
3 1.0 1.0
4 1.0 1.0

Cross Validation

Xgboost lets us perform cross-validation on our dataset as well using the cv() method. The cv() method has almost the same parameters as that of the train() method with few extra parameters as mentioned below.

  • nfold - It accepts integer specifying the number of folds to create from the dataset. The default is 3.
  • folds - It accepts sklearn KFold, StratifiedKFold, ShuffleSplitor StratifiedShuffleSplit instance.
  • metrics - It accepts list of metrics to evaluate.

Below we have performed cross-validation on the full Boston dataset for 10 rounds and 5 folds.

In [31]:
dmat_train = xgb.DMatrix(boston.data, boston.target, feature_names=boston.feature_names)

xgb.cv({'max_depth': 5, 'eta': 1, 'objective': 'reg:squarederror'}, dmat_train, num_boost_round=10, nfold=5)
Out[31]:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 3.822509 0.207240 5.041322 0.709767
1 2.650213 0.148219 4.740111 0.665003
2 2.179826 0.093208 4.509482 0.673848
3 1.828081 0.080050 4.392651 0.675589
4 1.512701 0.056885 4.294071 0.450255
5 1.316335 0.024188 4.285634 0.437991
6 1.114541 0.050333 4.350706 0.418353
7 0.961021 0.054002 4.387452 0.434030
8 0.869441 0.049707 4.403180 0.421032
9 0.790959 0.063960 4.389377 0.445109

Below we have again performed cross-validation on the Boston dataset but this time we have passed sklearn ShufflSplit for creating folds. It creates 10 fold of randomly shuffled data.

In [32]:
from sklearn.model_selection import KFold, ShuffleSplit

shuffle_split = ShuffleSplit(random_state=123)

dmat_train = xgb.DMatrix(boston.data, boston.target, feature_names=boston.feature_names)

xgb.cv({'max_depth': 5, 'eta': 1, 'objective': 'reg:squaredlogerror'}, dmat_train, folds=shuffle_split)
Out[32]:
train-rmsle-mean train-rmsle-std test-rmsle-mean test-rmsle-std
0 2.166785 0.006862 2.179977 0.064457
1 1.660311 0.006184 1.673285 0.062628
2 1.203496 0.005193 1.216123 0.059105
3 0.825498 0.003802 0.837564 0.051902
4 0.558500 0.003576 0.570968 0.040899
5 0.405827 0.004899 0.418653 0.036919
6 0.334471 0.008007 0.348108 0.039197
7 0.308396 0.010793 0.320564 0.043359
8 0.301879 0.011445 0.312817 0.046100
9 0.300311 0.011576 0.310560 0.047579

Below we have performed cross-validation on the breast cancer dataset. We have informed the cv() method to evaluate log loss, AUC, and error metrics for each iteration.

In [33]:
dmat_train = xgb.DMatrix(breast_cancer.data,
                         breast_cancer.target,
                         feature_names=breast_cancer.feature_names)

xgb.cv({'max_depth': 3, 'eta': 1, 'objective': 'binary:logitraw'},
       dmat_train, stratified=breast_cancer.target, nfold=5, metrics=["auc", "logloss", "error"])
Out[33]:
train-auc-mean train-auc-std train-logloss-mean train-logloss-std train-error-mean train-error-std test-auc-mean test-auc-std test-logloss-mean test-logloss-std test-error-mean test-error-std
0 0.982511 0.003957 0.814563 0.177615 0.036921 0.009734 0.950616 0.018239 2.359281 0.590475 0.077384 0.007155
1 0.995670 0.002256 0.407830 0.056408 0.020655 0.002281 0.971417 0.015519 1.840956 0.755282 0.065117 0.016566
2 0.998588 0.000816 0.251895 0.053079 0.010540 0.003212 0.981969 0.012894 1.461477 0.831429 0.054652 0.023516
3 0.999566 0.000313 0.119303 0.108406 0.006146 0.002904 0.986684 0.013078 1.383230 0.909166 0.044125 0.024508
4 0.999913 0.000132 0.055061 0.069058 0.004390 0.003395 0.985918 0.013166 1.323538 0.760395 0.047665 0.027874
5 0.999975 0.000049 0.017522 0.033633 0.000877 0.001754 0.987003 0.012294 1.381233 0.761902 0.044125 0.022510
6 0.999992 0.000016 0.001264 0.002502 0.000877 0.001754 0.987900 0.009896 1.309201 0.746346 0.038815 0.019172
7 1.000000 0.000000 0.000359 0.000674 0.000439 0.000877 0.989445 0.007545 1.248183 0.666860 0.040554 0.016580
8 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.990555 0.007242 1.244237 0.668620 0.038815 0.021483
9 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.990723 0.007753 1.371507 0.702382 0.040554 0.019206

Sklearn API

Xgboost provides estimators that have almost the same API like that of sklearn estimators. This helps developers with sklearn background to grasp the usage of xgboost faster. It even lets us use the xgboost model with sklearn's grid search functionality. As a part of this section, we'll explain 4 estimators available from xgboost which has the same API as sklearn's estimators.

  • XGBRegressor
  • XGBClassifier
  • XGBRFRegressor
  • XGBRFClassifier

XGBRegressor

The XGBRegressor is an estimator that is used for regression problems. It has a default objective function as reg:squarederror. It has a list of parameters that we gave as a dictionary to the train() method. We pass those parameters to the constructor of XGBRegressor directly.

Below we have trained XGBRegressor on Boston train data and then calculated R2 score on test and train dataset both. The score() method is available as a part of estimators which has sklearn like API. The score() method will return the R2 score for regression tasks and accuracy for classification tasks.

In [34]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

xgb_regressor = xgb.XGBRegressor()

xgb_regressor.fit(X_train, Y_train, eval_set=[(X_test, Y_test)], eval_metric="mae", verbose=10)

print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
[0]	validation_0-mae:14.61328
[10]	validation_0-mae:1.86316
[20]	validation_0-mae:1.70020
[30]	validation_0-mae:1.62740
[40]	validation_0-mae:1.63325
[50]	validation_0-mae:1.62120
[60]	validation_0-mae:1.61760
[70]	validation_0-mae:1.62004
[80]	validation_0-mae:1.61866
[90]	validation_0-mae:1.62278
[99]	validation_0-mae:1.62320
Test  R2 Score : 0.93
Train R2 Score : 1.00
In [35]:
xgb_regressor.predict(X_test)[:5]
Out[35]:
array([24.521688, 29.77457 , 14.518701, 22.433651, 17.031559],
      dtype=float32)

Below we have printed the number of estimators which model used by default, max depth of each tree, and feature importance of individual features.

In [36]:
print("Default Number of Estimators : ",xgb_regressor.n_estimators)
print("Default Max Depth of Trees   : ", xgb_regressor.max_depth)
print("Feature Importances : ")
pd.DataFrame([xgb_regressor.feature_importances_], columns=boston.feature_names)
Default Number of Estimators :  100
Default Max Depth of Trees   :  None
Feature Importances :
Out[36]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.011552 0.001155 0.014551 0.00315 0.043485 0.242339 0.011518 0.056501 0.010146 0.032733 0.062321 0.012791 0.497758

We have now explained how we perform a grid search with XGBRegressor. We have tried different values of parameters n_estimators, max_depth, and eta to find the best performing values. We have then plotted grid search results as well.

In [37]:
%%time

from sklearn.model_selection import GridSearchCV

params = {
        'n_estimators': [50,100],
        'max_depth': [None, 3, 5, 7, 9],
        'eta': [0.5, 1, 2, 3]
        }
grid_search = GridSearchCV(xgb.XGBRegressor(), params, n_jobs=-1)

grid_search.fit(X_train, Y_train)

print("Test  R2 Score : %.2f"%grid_search.score(X_test, Y_test))
print("Train R2 Score : %.2f"%grid_search.score(X_train, Y_train))

print("Best Params : ", grid_search.best_params_)
print("Feature Importances : ")
pd.DataFrame([grid_search.best_estimator_.feature_importances_], columns=boston.feature_names)
Test  R2 Score : 0.91
Train R2 Score : 1.00
Best Params :  {'eta': 0.5, 'max_depth': 5, 'n_estimators': 50}
Feature Importances :
CPU times: user 652 ms, sys: 85.1 ms, total: 738 ms
Wall time: 3.39 s
/home/sunny/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[37]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.011688 0.003513 0.011462 0.002225 0.035786 0.147807 0.009096 0.040782 0.007239 0.026909 0.057599 0.014461 0.631432
In [38]:
grid_search_results = pd.DataFrame(grid_search.cv_results_)
print("Grid Search Size : ", grid_search_results.shape)
grid_search_results.head()
Grid Search Size :  (40, 14)
Out[38]:
mean_fit_time std_fit_time mean_score_time std_score_time param_eta param_max_depth param_n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.051536 0.018275 0.006071 0.006235 0.5 None 50 {'eta': 0.5, 'max_depth': None, 'n_estimators'... 0.887726 0.849283 0.879015 0.871992 0.016472 6
1 0.056037 0.022982 0.001785 0.000207 0.5 None 100 {'eta': 0.5, 'max_depth': None, 'n_estimators'... 0.887756 0.849392 0.879027 0.872043 0.016434 5
2 0.012587 0.000226 0.001103 0.000048 0.5 3 50 {'eta': 0.5, 'max_depth': 3, 'n_estimators': 50} 0.865275 0.870843 0.884393 0.873480 0.008021 3
3 0.023356 0.000537 0.001254 0.000021 0.5 3 100 {'eta': 0.5, 'max_depth': 3, 'n_estimators': 100} 0.867816 0.869590 0.880927 0.872760 0.005802 4
4 0.019933 0.001143 0.001214 0.000030 0.5 5 50 {'eta': 0.5, 'max_depth': 5, 'n_estimators': 50} 0.880565 0.874491 0.872728 0.875935 0.003357 1
In [39]:
xgb_regressor.get_booster() ## We can get Booster object using this method from sklearn estimators
Out[39]:
<xgboost.core.Booster at 0x7f7460a810b8>

XGBClassifier

The XGBClassifier is an estimator that is used for classification tasks. It has the default objective function binary:logistic. We can pass the same parameters which we can pass to the train() method's params parameter as a dictionary to the constructor of XGBClassifier. We can get actual predictions using predict() method and probabilities using predict_proba() method. It even provides a score() method which lets us calculate the accuracy of the model on passed data.

Below we have trained XGBClassifier on the breast cancer train dataset. We have then evaluated accuracy on train and test datasets. We have also printed the first few predictions and probabilities.

In [40]:
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target,
                                                    stratify=breast_cancer.target,
                                                    train_size=0.90, random_state=42)

xgb_classif = xgb.XGBClassifier()

xgb_classif.fit(X_train, Y_train, eval_set=[(X_test, Y_test)], eval_metric="auc" , verbose=10)

print("Test  Accuracy Score : %.2f"%xgb_classif.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%xgb_classif.score(X_train, Y_train))
[0]	validation_0-auc:0.97685
[10]	validation_0-auc:0.99339
[20]	validation_0-auc:0.99206
[30]	validation_0-auc:0.99206
[40]	validation_0-auc:0.98809
[50]	validation_0-auc:0.98809
[60]	validation_0-auc:0.98809
[70]	validation_0-auc:0.98809
[80]	validation_0-auc:0.98809
[90]	validation_0-auc:0.98809
[99]	validation_0-auc:0.98942
Test  Accuracy Score : 0.96
Train Accuracy Score : 1.00
In [41]:
xgb_classif.predict(X_test)[:5]
Out[41]:
array([0, 1, 1, 0, 0])
In [42]:
print("Probabilities : ")
print(xgb_classif.predict_proba(X_test)[:5])
print("\nPrediction From Probabilities : ")
print(np.argmax(xgb_classif.predict_proba(X_test)[:5], axis=1))
Probabilities :
[[9.9962151e-01 3.7849427e-04]
 [7.4094534e-04 9.9925905e-01]
 [7.4838996e-03 9.9251610e-01]
 [9.9939799e-01 6.0198107e-04]
 [9.9195606e-01 8.0439411e-03]]

Prediction From Probabilities :
[0 1 1 0 0]
In [43]:
print("Default Number of Estimators : ",xgb_classif.n_estimators)
print("Default Max Depth of Trees   : ", xgb_classif.max_depth)
print("Feature Importances : ")
pd.DataFrame([xgb_classif.feature_importances_], columns=breast_cancer.feature_names)
Default Number of Estimators :  100
Default Max Depth of Trees   :  None
Feature Importances :
Out[43]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 0.007508 0.018245 0.0 0.01407 0.005634 0.00113 0.00268 0.034978 0.001187 0.005131 0.01308 0.002079 0.0 0.005953 0.001658 0.006462 0.0 0.001915 0.000841 0.004829 0.449676 0.020751 0.218758 0.033382 0.007877 0.0 0.009276 0.124308 0.002986 0.005606

Below we have explained how we can use XGBClassifier with sklearn's grid search functionality to try a list of parameters to find the best parameter settings.

In [44]:
%%time

from sklearn.model_selection import GridSearchCV

params = {
        'n_estimators': [50,100,150,200,300,500],
        'max_depth': [None, 3, 5, 7, 9],
        'eta': [0.5, 1, 2, 3]
        }
grid_search = GridSearchCV(xgb.XGBClassifier(), params, n_jobs=-1, cv=5)

grid_search.fit(X_train, Y_train)

print("Test  Accuracy Score : %.2f"%grid_search.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%grid_search.score(X_train, Y_train))

print("Best Params : ", grid_search.best_params_)
print("Feature Importances : ")
pd.DataFrame([grid_search.best_estimator_.feature_importances_], columns=breast_cancer.feature_names)
Test  Accuracy Score : 0.98
Train Accuracy Score : 1.00
Best Params :  {'eta': 1, 'max_depth': None, 'n_estimators': 50}
Feature Importances :
CPU times: user 353 ms, sys: 3.51 ms, total: 357 ms
Wall time: 6.9 s
/home/sunny/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[44]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 0.00544 0.014877 0.0 0.0 0.002808 0.00063 0.000494 0.052355 0.002261 0.000968 0.007919 0.002445 0.0 0.004377 0.000712 0.002277 0.001108 0.000309 0.000304 0.001782 0.462069 0.015647 0.268949 0.004184 0.002715 0.007969 0.021782 0.111749 0.003871 0.0

XGBRFRegressor

The XGBRFRegressor is a random forest implementation based on decision trees for regression tasks. It has almost exactly the same API as that of XGBRegressor. We have explained below the usage of it on the Boston housing dataset.

In [45]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

xgb_rf_regressor = xgb.XGBRFRegressor()

xgb_rf_regressor.fit(X_train, Y_train)

print("Test  R2 Score : %.2f"%xgb_rf_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_rf_regressor.score(X_train, Y_train))
Test  R2 Score : 0.87
Train R2 Score : 0.96
In [46]:
print("Default Number of Estimators : ",xgb_rf_regressor.n_estimators)
print("Default Max Depth of Trees   : ", xgb_rf_regressor.max_depth)
print("Feature Importances : ")
pd.DataFrame([xgb_rf_regressor.feature_importances_], columns=boston.feature_names)
Default Number of Estimators :  100
Default Max Depth of Trees   :  None
Feature Importances :
Out[46]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.027662 0.004731 0.04277 0.00403 0.08146 0.335912 0.015528 0.091729 0.010167 0.028006 0.039975 0.013578 0.304454
In [47]:
%%time

from sklearn.model_selection import GridSearchCV

params = {
        'n_estimators': [50,100,150,200,300,500],
        'max_depth': [None, 3, 5, 7, 9],
        'eta': [0.5, 1, 2, 3]
        }
grid_search = GridSearchCV(xgb.XGBRFRegressor(), params, n_jobs=-1, cv=5)

grid_search.fit(X_train, Y_train)

print("Test  R2 Score : %.2f"%grid_search.score(X_test, Y_test))
print("Train R2 Score : %.2f"%grid_search.score(X_train, Y_train))

print("Best Params : ", grid_search.best_params_)
print("Feature Importances : ")
pd.DataFrame([grid_search.best_estimator_.feature_importances_], columns=boston.feature_names)
Test  R2 Score : 0.88
Train R2 Score : 0.99
Best Params :  {'eta': 0.5, 'max_depth': 9, 'n_estimators': 100}
Feature Importances :
CPU times: user 1.42 s, sys: 33.9 ms, total: 1.45 s
Wall time: 14.4 s
Out[47]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.01547 0.0042 0.028811 0.004552 0.063577 0.299841 0.014431 0.089211 0.013981 0.035951 0.046551 0.014287 0.369136

XGBRFClassifier

The XGBRFClassifier is a random forest implementation based on decision trees for classification tasks. It has almost exactly the same API as that of XGBClassifier. We have explained below the usage of it on the breast cancer dataset.

In [48]:
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target,
                                                    stratify=breast_cancer.target,
                                                    train_size=0.90, random_state=42)

xgb_rf_classif = xgb.XGBRFClassifier()

xgb_rf_classif.fit(X_train, Y_train)

print("Test  Accuracy Score : %.2f"%xgb_rf_classif.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%xgb_rf_classif.score(X_train, Y_train))
Test  Accuracy Score : 0.95
Train Accuracy Score : 0.99
In [49]:
print("Default Number of Estimators : ",xgb_rf_classif.n_estimators)
print("Default Max Depth of Trees   : ", xgb_rf_classif.max_depth)
print("Feature Importances : ")
pd.DataFrame([xgb_rf_classif.feature_importances_], columns=breast_cancer.feature_names)
Default Number of Estimators :  100
Default Max Depth of Trees   :  None
Feature Importances :
Out[49]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 0.003356 0.009529 0.009581 0.010629 0.006968 0.007762 0.009291 0.090449 0.00362 0.010503 0.004347 0.015605 0.00586 0.006079 0.001386 0.015551 0.007671 0.007575 0.005983 0.006225 0.157929 0.013715 0.232325 0.161624 0.01187 0.027882 0.01665 0.096775 0.017094 0.026166
In [50]:
%%time

from sklearn.model_selection import GridSearchCV

params = {
        'n_estimators': [50,100,150,200,300,500],
        'max_depth': [None, 3, 5, 7, 9],
        'eta': [0.5, 1, 2, 3]
        }
grid_search = GridSearchCV(xgb.XGBRFClassifier(), params, n_jobs=-1, cv=5)

grid_search.fit(X_train, Y_train)

print("Test  Accuracy Score : %.2f"%grid_search.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%grid_search.score(X_train, Y_train))

print("Best Params : ", grid_search.best_params_)
print("Feature Importances : ")
pd.DataFrame([grid_search.best_estimator_.feature_importances_], columns=breast_cancer.feature_names)
Test  Accuracy Score : 0.95
Train Accuracy Score : 0.99
Best Params :  {'eta': 0.5, 'max_depth': None, 'n_estimators': 150}
Feature Importances :
CPU times: user 836 ms, sys: 3.11 ms, total: 840 ms
Wall time: 25.4 s
Out[50]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 0.003453 0.009951 0.008917 0.010582 0.00692 0.007947 0.010473 0.067403 0.003711 0.010675 0.00435 0.014345 0.005435 0.006293 0.001694 0.013137 0.006933 0.00861 0.006133 0.005835 0.139899 0.01372 0.261298 0.164143 0.012468 0.033611 0.015302 0.103061 0.016882 0.026823

Early Stop Training to Avoid Overfitting

Xgboost provides us with an option that lets us stop the training process if training loss is not improving for some specified number of iterations. We can specify the early_stopping_rounds parameter in the train() method to some integer and it'll stop training if training loss is not improved for that many rounds of training.

Below we have instructed train() method to train for 20 rounds using num_boost_round parameter and early_stopping_rounds is set to 5. The train() method will stop training if training loss is not improved for 5 sequential rounds of training.

In [51]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie'},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")],
                    num_boost_round=20,
                    early_stopping_rounds=5)

print("\nTrain RMSE : ",tweedie_booster.eval(dmat_train))
print("Test  RMSE : ",tweedie_booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	train-tweedie-nloglik@1.5:28.32970	test-tweedie-nloglik@1.5:26.66488
Multiple eval metrics have been passed: 'test-tweedie-nloglik@1.5' will be used for early stopping.

Will train until test-tweedie-nloglik@1.5 hasn't improved in 5 rounds.
[1]	train-tweedie-nloglik@1.5:19.30211	test-tweedie-nloglik@1.5:18.59277
[2]	train-tweedie-nloglik@1.5:18.73162	test-tweedie-nloglik@1.5:18.15825
[3]	train-tweedie-nloglik@1.5:18.71630	test-tweedie-nloglik@1.5:18.14867
[4]	train-tweedie-nloglik@1.5:18.70620	test-tweedie-nloglik@1.5:18.14152
[5]	train-tweedie-nloglik@1.5:18.70218	test-tweedie-nloglik@1.5:18.13844
[6]	train-tweedie-nloglik@1.5:18.69798	test-tweedie-nloglik@1.5:18.13924
[7]	train-tweedie-nloglik@1.5:18.69527	test-tweedie-nloglik@1.5:18.13543
[8]	train-tweedie-nloglik@1.5:18.69322	test-tweedie-nloglik@1.5:18.12734
[9]	train-tweedie-nloglik@1.5:18.69212	test-tweedie-nloglik@1.5:18.12620
[10]	train-tweedie-nloglik@1.5:18.69033	test-tweedie-nloglik@1.5:18.12475
[11]	train-tweedie-nloglik@1.5:18.68892	test-tweedie-nloglik@1.5:18.12602
[12]	train-tweedie-nloglik@1.5:18.68768	test-tweedie-nloglik@1.5:18.12688
[13]	train-tweedie-nloglik@1.5:18.68724	test-tweedie-nloglik@1.5:18.12563
[14]	train-tweedie-nloglik@1.5:18.68605	test-tweedie-nloglik@1.5:18.12537
[15]	train-tweedie-nloglik@1.5:18.68555	test-tweedie-nloglik@1.5:18.12502
Stopping. Best iteration:
[10]	train-tweedie-nloglik@1.5:18.69033	test-tweedie-nloglik@1.5:18.12475


Train RMSE :  [0]	eval-tweedie-nloglik@1.5:18.685551
Test  RMSE :  [0]	eval-tweedie-nloglik@1.5:18.125015

Test  R2 Score : 0.91
Train R2 Score : 0.97

We have below explained how we can early stop training with XGBRegressor.

In [52]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective='reg:tweedie')

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)], eval_metric="rmse",
                  early_stopping_rounds=5, verbose=5)

print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	validation_0-rmse:19.40547
Will train until validation_0-rmse hasn't improved in 5 rounds.
[5]	validation_0-rmse:3.15521
[10]	validation_0-rmse:2.48459
[15]	validation_0-rmse:2.40736
[20]	validation_0-rmse:2.40103
[25]	validation_0-rmse:2.46336
Stopping. Best iteration:
[21]	validation_0-rmse:2.38972

Test  R2 Score : 0.91
Train R2 Score : 0.98

Feature Interaction Constraints

When xgboost creates a tree during the training process it takes into consideration all feature interactions. In a decision tree, we have nodes where each node represents a decision to be made based on a particular value of the feature. The next node will be based on the feature value mentioned in the previous node. By default, all features can be present in any node of the decision tree. We can force xgboost to keep a list of features in subsequent nodes by giving it a list of indices of features in the dataset. We can give list of list to interaction_constraints parameter of train() method. Here an individual list is a list of feature indices that should only interact with one another and not with other features.

Please feel free to go through this link to get in-depth details about feature interaction constraints in xgboost.

Below we have kept features 0,1,2,11 and 12 into one list hence these features will interact with one another when creating a tree but not with other features hence tree will have only these features. The Same goes for other lists.

In [53]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie',
                             'tree_method':'hist', 'nthread':4,
                             'interaction_constraints':[[0,1,2,11,12], [3, 4],[6,10], [5,9], [7,8]]},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",tweedie_booster.eval(dmat_train))
print("Test  RMSE : ",tweedie_booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	train-tweedie-nloglik@1.5:28.32970	test-tweedie-nloglik@1.5:26.66487
[1]	train-tweedie-nloglik@1.5:19.31425	test-tweedie-nloglik@1.5:18.56556
[2]	train-tweedie-nloglik@1.5:18.75812	test-tweedie-nloglik@1.5:18.15946
[3]	train-tweedie-nloglik@1.5:18.73723	test-tweedie-nloglik@1.5:18.18582
[4]	train-tweedie-nloglik@1.5:18.72045	test-tweedie-nloglik@1.5:18.18661
[5]	train-tweedie-nloglik@1.5:18.71539	test-tweedie-nloglik@1.5:18.18151
[6]	train-tweedie-nloglik@1.5:18.71035	test-tweedie-nloglik@1.5:18.16513
[7]	train-tweedie-nloglik@1.5:18.70579	test-tweedie-nloglik@1.5:18.15365
[8]	train-tweedie-nloglik@1.5:18.70398	test-tweedie-nloglik@1.5:18.15155
[9]	train-tweedie-nloglik@1.5:18.69945	test-tweedie-nloglik@1.5:18.15776

Train RMSE :  [0]	eval-tweedie-nloglik@1.5:18.699446
Test  RMSE :  [0]	eval-tweedie-nloglik@1.5:18.157764

Test  R2 Score : 0.78
Train R2 Score : 0.94

Below we have explained how we can use feature interaction constraint with XGBRegressor.

In [54]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective='reg:tweedie',
                                 interaction_constraints=[[0,1,2,11,12], [3, 4],[6,10], [5,9], [7,8]])

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)], eval_metric="rmse",
                  early_stopping_rounds=5, verbose=1)

print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	validation_0-rmse:19.40547
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:9.42298
[2]	validation_0-rmse:3.54240
[3]	validation_0-rmse:4.78789
[4]	validation_0-rmse:5.06168
[5]	validation_0-rmse:5.65695
[6]	validation_0-rmse:5.04555
[7]	validation_0-rmse:4.97560
Stopping. Best iteration:
[2]	validation_0-rmse:3.54240

Test  R2 Score : 0.80
Train R2 Score : 0.79

Monotonic Constraints

The monotonic constraints let us specify increasing, decreasing, or no monotone relation of the feature with the target. We can specify a monotone value of 1,0 or -1 for each feature to show the increasing, none, and decreasing relation of the feature with the target by setting the monotone_constraints parameter. Below we have explained the usage of monotonic constraints for regression problems using the Boston dataset.

Please feel free to check this link to better understand monotonic constraints.

In [55]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie',
                             'tree_method':'hist', 'nthread':4,
                             'monotone_constraints':(1,0,1,-1,1,0,1,0,-1,1,1, -1, 1)},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",tweedie_booster.eval(dmat_train))
print("Test  RMSE : ",tweedie_booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	train-tweedie-nloglik@1.5:28.33915	test-tweedie-nloglik@1.5:26.59607
[1]	train-tweedie-nloglik@1.5:19.35218	test-tweedie-nloglik@1.5:18.59604
[2]	train-tweedie-nloglik@1.5:18.79351	test-tweedie-nloglik@1.5:18.18378
[3]	train-tweedie-nloglik@1.5:18.75948	test-tweedie-nloglik@1.5:18.15357
[4]	train-tweedie-nloglik@1.5:18.75133	test-tweedie-nloglik@1.5:18.15053
[5]	train-tweedie-nloglik@1.5:18.74689	test-tweedie-nloglik@1.5:18.14669
[6]	train-tweedie-nloglik@1.5:18.74240	test-tweedie-nloglik@1.5:18.14553
[7]	train-tweedie-nloglik@1.5:18.73603	test-tweedie-nloglik@1.5:18.16522
[8]	train-tweedie-nloglik@1.5:18.73118	test-tweedie-nloglik@1.5:18.16656
[9]	train-tweedie-nloglik@1.5:18.72748	test-tweedie-nloglik@1.5:18.16702

Train RMSE :  [0]	eval-tweedie-nloglik@1.5:18.727484
Test  RMSE :  [0]	eval-tweedie-nloglik@1.5:18.167023

Test  R2 Score : 0.82
Train R2 Score : 0.88
In [56]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective='reg:tweedie',
                                 monotone_constraints=(1,0,1,-1,1,0,1,0,-1,1,1, -1, 1))

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)], eval_metric="rmse",
                  early_stopping_rounds=5, verbose=1)

print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	validation_0-rmse:19.38521
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:9.61720
[2]	validation_0-rmse:4.16700
[3]	validation_0-rmse:3.83074
[4]	validation_0-rmse:4.35182
[5]	validation_0-rmse:4.52082
[6]	validation_0-rmse:4.44078
[7]	validation_0-rmse:4.20142
[8]	validation_0-rmse:4.24515
Stopping. Best iteration:
[3]	validation_0-rmse:3.83074

Test  R2 Score : 0.76
Train R2 Score : 0.78

Custom Objective/Loss Function

As a part of this section, we have explained how we can use a custom objective/loss function with xgboost. We'll be giving input to loss function list of predicted values and actual target values. It'll then return a list of the first derivative and second derivative of loss function for that values. Below we have created the mean squared error loss function and explained its usage with a simple example. We need to pass a reference to function to the objective parameter of an estimator.

In [57]:
def first_grad(predt, dtrain):
    '''Compute the first derivative for mean squared error.'''
    y = dtrain.get_label() if isinstance(dtrain, xgb.DMatrix) else dtrain
    return 2*(y-predt)

def second_grad(predt, dtrain):
    '''Compute the second derivative for mean squared error.'''
    y = dtrain.get_label() if isinstance(dtrain, xgb.DMatrix) else dtrain
    return [1] * len(predt)

def mean_sqaured_error(predt, dtrain):
    ''''Mean squared error function.'''
    predt[predt < -1] = -1 + 1e-6
    grad = first_grad(predt, dtrain)
    hess = second_grad(predt, dtrain)
    return grad, hess
In [58]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective=mean_sqaured_error)  ## Custom Evaluation Function

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)], eval_metric="mae",
                  early_stopping_rounds=5,
                  verbose=10)

print("\nTest  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	validation_0-mae:19.54971
Will train until validation_0-mae hasn't improved in 5 rounds.
[10]	validation_0-mae:15.58860
[20]	validation_0-mae:13.57056
[30]	validation_0-mae:11.92066
[40]	validation_0-mae:10.05207
[50]	validation_0-mae:9.34793
[60]	validation_0-mae:8.23015
[70]	validation_0-mae:7.45686
[80]	validation_0-mae:7.02828
[90]	validation_0-mae:6.30696
[99]	validation_0-mae:5.31595

Test  R2 Score : 0.27
Train R2 Score : 0.63

Custom Evaluation Functions

Xgboost lets us create our custom evaluation function as well. The function should accept predictions and DMatrix instances as parameters and then calculate metrics based on predictions and actual target values. We have created simple mean_absolute_error() for explanation purpose.

We can pass function reference to the feval parameter of train() to use it on the evaluation dataset.

In [59]:
def mean_absolute_error(preds, dmat):
    actuals = dmat.get_label()
    err = (actuals - preds).sum()
    return "MAE", err
In [60]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:squarederror'},
                    dmat_train,
                    evals=[(dmat_test, "test")],
                    feval=mean_absolute_error, ## Custom Evaluation Function
                    num_boost_round=10,
                    early_stopping_rounds=5)

print("\nTrain RMSE : ",booster.eval(dmat_train))
print("Test  RMSE : ",booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	test-rmse:3.59159	test-MAE:26.53242
Multiple eval metrics have been passed: 'test-MAE' will be used for early stopping.

Will train until test-MAE hasn't improved in 5 rounds.
[1]	test-rmse:3.26373	test-MAE:15.24190
[2]	test-rmse:3.12218	test-MAE:18.01450
[3]	test-rmse:2.94107	test-MAE:4.76002
[4]	test-rmse:2.75222	test-MAE:2.55075
[5]	test-rmse:2.78515	test-MAE:-3.35190
[6]	test-rmse:2.64519	test-MAE:-2.30084
[7]	test-rmse:2.64290	test-MAE:-2.01779
[8]	test-rmse:2.58895	test-MAE:-9.34707
[9]	test-rmse:2.61442	test-MAE:-5.63952

Train RMSE :  [0]	eval-rmse:1.965108
Test  RMSE :  [0]	eval-rmse:2.614419

Test  R2 Score : 0.89
Train R2 Score : 0.96

Below we have explained how we can use custom evaluation metrics with XGBRegressor. We need to set the eval_metric parameter of the fit() method with reference to the custom evaluation function.

In [61]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective='reg:squarederror')

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)], eval_metric=mean_absolute_error,
                  early_stopping_rounds=5,
                  verbose=5)

print("Test  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	validation_0-rmse:3.59159	validation_0-MAE:26.53242
Multiple eval metrics have been passed: 'validation_0-MAE' will be used for early stopping.

Will train until validation_0-MAE hasn't improved in 5 rounds.
[5]	validation_0-rmse:2.78515	validation_0-MAE:-3.35190
[10]	validation_0-rmse:2.57606	validation_0-MAE:-16.31485
[15]	validation_0-rmse:2.47347	validation_0-MAE:-23.60214
[20]	validation_0-rmse:2.43278	validation_0-MAE:-19.31291
Stopping. Best iteration:
[16]	validation_0-rmse:2.49018	validation_0-MAE:-24.94875

Test  R2 Score : 0.90
Train R2 Score : 0.97

Callbacks

Xgboost provides us with a list of callback functions for a different purpose which gets executed after each iteration of training. Below is a list of available callbacks with xgboost as a part of the callback module.

  • early_stop - It accepts integer specifying whether to stop training if evaluation metric results on last evaluation set are not improved for that many iterations.
  • print_evaluation - It accepts integer values specifying how often to print evaluation results. Evaluation metric results are printed at every that many iterations as specified.
  • record_evaluation - It accepts a dictionary in which evaluation results will be recorded.
  • reset_learning_rate - It lets us reset the learning rate after each iteration of training. It accepts an array of size the same as the number of iterations or callback returning the new learning rate for each iteration.

We need to provide a list of callbacks to the callbacks parameter for their execution after each iteration.

Below we have explained usage of early_stop(), print_evaluation() and record_evaluation() callbacks for regression task.

In [62]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)


early_stop_execution = xgb.callback.early_stop(5)
print_eval = xgb.callback.print_evaluation(1)
eval_results = {}
eval_results_callback = xgb.callback.record_evaluation(eval_results)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie'},
                    dmat_train,
                    evals=[(dmat_test, "test")],
                    num_boost_round=25,
                    verbose_eval=False,
                    callbacks=[early_stop_execution, print_eval, eval_results_callback])

print("Evaluation Results : ", eval_results)

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

Will train until test-tweedie-nloglik@1.5 hasn't improved in 5 rounds.
[0]	test-tweedie-nloglik@1.5:26.66488
[1]	test-tweedie-nloglik@1.5:18.59277
[2]	test-tweedie-nloglik@1.5:18.15825
[3]	test-tweedie-nloglik@1.5:18.14867
[4]	test-tweedie-nloglik@1.5:18.14152
[5]	test-tweedie-nloglik@1.5:18.13844
[6]	test-tweedie-nloglik@1.5:18.13924
[7]	test-tweedie-nloglik@1.5:18.13543
[8]	test-tweedie-nloglik@1.5:18.12734
[9]	test-tweedie-nloglik@1.5:18.12620
[10]	test-tweedie-nloglik@1.5:18.12475
[11]	test-tweedie-nloglik@1.5:18.12602
[12]	test-tweedie-nloglik@1.5:18.12688
[13]	test-tweedie-nloglik@1.5:18.12563
[14]	test-tweedie-nloglik@1.5:18.12537
Stopping. Best iteration:
[10]	test-tweedie-nloglik@1.5:18.12475

Evaluation Results :  {'test': {'tweedie-nloglik@1.5': [26.664877, 18.592772, 18.158253, 18.148666, 18.141518, 18.138443, 18.139244, 18.135429, 18.127338, 18.126202, 18.124754, 18.12602, 18.126877, 18.125628, 18.12537]}}

Test  R2 Score : 0.91
Train R2 Score : 0.97

Below we have again explained the same three callbacks with XGBRegressor. This time we need to pass a list of callback functions to the callbacks parameter of the fit() method.

In [63]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

xgb_regressor = xgb.XGBRegressor(max_depth=3, eta=1, objective='reg:squarederror')

early_stop_execution = xgb.callback.early_stop(5)
print_eval = xgb.callback.print_evaluation(5)
eval_results = {}
eval_results_callback = xgb.callback.record_evaluation(eval_results)

xgb_regressor.fit(X_train, Y_train,
                  eval_set=[(X_test, Y_test)],
                  verbose=False,
                  callbacks = [early_stop_execution, print_eval, eval_results_callback]
                  )

print("Evaluation Results : ", eval_results)

print("\nTest  R2 Score : %.2f"%xgb_regressor.score(X_test, Y_test))
print("Train R2 Score : %.2f"%xgb_regressor.score(X_train, Y_train))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

Will train until validation_0-rmse hasn't improved in 5 rounds.
[0]	validation_0-rmse:3.59159
[5]	validation_0-rmse:2.78515
[10]	validation_0-rmse:2.57606
[15]	validation_0-rmse:2.47347
[20]	validation_0-rmse:2.43278
[25]	validation_0-rmse:2.39191
[30]	validation_0-rmse:2.44904
Stopping. Best iteration:
[26]	validation_0-rmse:2.37548

Evaluation Results :  {'validation_0': {'rmse': [3.591595, 3.263732, 3.122178, 2.941069, 2.752223, 2.785148, 2.64519, 2.642898, 2.588952, 2.614419, 2.576064, 2.52405, 2.521241, 2.548234, 2.546718, 2.473467, 2.490181, 2.460504, 2.394174, 2.426881, 2.432777, 2.421911, 2.420417, 2.379785, 2.382656, 2.39191, 2.375476, 2.386967, 2.440467, 2.443965, 2.449045]}}

Test  R2 Score : 0.90
Train R2 Score : 0.99

As a part of this example, we have explained how we can use the reset_learning_rate() callback. We have first called the reset_learning_rate() function with an array of size 15 which is the same as the number of iterations of our training process. The array starts from 0.1 till 1.5 increasing the learning rate by 0.1 each time.

In [64]:
reset_learning_rate = xgb.callback.reset_learning_rate(list(np.linspace(0.1,1.5, num=15)))

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie'},
                    dmat_train,
                    evals=[(dmat_test, "test")],
                    num_boost_round=15,
                    callbacks=[reset_learning_rate])

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
[0]	test-tweedie-nloglik@1.5:26.66487
[1]	test-tweedie-nloglik@1.5:25.31718
[2]	test-tweedie-nloglik@1.5:23.11685
[3]	test-tweedie-nloglik@1.5:20.87608
[4]	test-tweedie-nloglik@1.5:19.25085
[5]	test-tweedie-nloglik@1.5:18.44966
[6]	test-tweedie-nloglik@1.5:18.19030
[7]	test-tweedie-nloglik@1.5:18.14023
[8]	test-tweedie-nloglik@1.5:18.13443
[9]	test-tweedie-nloglik@1.5:18.13133
[10]	test-tweedie-nloglik@1.5:18.13147
[11]	test-tweedie-nloglik@1.5:18.13265
[12]	test-tweedie-nloglik@1.5:18.13475
[13]	test-tweedie-nloglik@1.5:18.13259
[14]	test-tweedie-nloglik@1.5:18.13398

Test  R2 Score : 0.87
Train R2 Score : 0.97

We have now explained another example demonstrating usage of the reset_learning_rate() callback. This time we have created a function named calculate_learning_rate() which will be passed to reset_learning_rate() callback. The function takes as input two integers (current boosting round index and a total number of boosting rounds) and returns the learning rate for that boosting round. We have then passed the callback created to the callbacks parameter.

In [65]:
def calculate_learning_rate(boosting_round, num_boost_round):
    lrs = list(np.linspace(0.1,1.5, num=num_boost_round))
    return lrs[boosting_round]

reset_learning_rate = xgb.callback.reset_learning_rate(calculate_learning_rate)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie'},
                    dmat_train,
                    evals=[(dmat_test, "test")],
                    num_boost_round=15,
                    callbacks=[reset_learning_rate])

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
[0]	test-tweedie-nloglik@1.5:26.66487
[1]	test-tweedie-nloglik@1.5:25.31718
[2]	test-tweedie-nloglik@1.5:23.11685
[3]	test-tweedie-nloglik@1.5:20.87608
[4]	test-tweedie-nloglik@1.5:19.25085
[5]	test-tweedie-nloglik@1.5:18.44966
[6]	test-tweedie-nloglik@1.5:18.19030
[7]	test-tweedie-nloglik@1.5:18.14023
[8]	test-tweedie-nloglik@1.5:18.13443
[9]	test-tweedie-nloglik@1.5:18.13133
[10]	test-tweedie-nloglik@1.5:18.13147
[11]	test-tweedie-nloglik@1.5:18.13265
[12]	test-tweedie-nloglik@1.5:18.13475
[13]	test-tweedie-nloglik@1.5:18.13258
[14]	test-tweedie-nloglik@1.5:18.13398

Test  R2 Score : 0.87
Train R2 Score : 0.97

Dask Backend for Distributed Training

Xgboost provides support for using dask as a backend for training gradient boosting algorithm in a distributed environment. Xgboost has a module named dask which has a list of data structures and estimators for using with dask.

Dask has a simple structure where we have below mentioned three main components.

  • Scheduler - Dask distributed environment has one scheduler which handles communication between clients and workers. It’s even responsible for distributing work to worker nodes.
  • Clients - We can have more than one client instances which can be used to submit tasks to the scheduler.
  • Workers - These are actual nodes(processes/machines) which runs task.

In order to use dask, we need to create a client that will be used to communicate with the scheduler. This tutorial is run on a single PC and not on a distributed environment with multiple nodes. When we create an instance of dask client without giving the IP address and port of scheduler, it'll create a cluster on the local machine itself. Below we have created a small cluster of 4 workers using the Client() constructor.

If you are interested in learning about dask then please feel free to check our tutorials on the same. It has information about creating a dask distributed environment as well on an actual cluster with multiple machines.

Please make a note that using xgboost on the dask distributed environment requires a little background of dask to make it work correctly.

In [66]:
print("Dask Installed ?", xgb.dask.DASK_INSTALLED)

client = xgb.dask.Client(n_workers=4, threads_per_worker=4)
Dask Installed ? True
In [ ]:
client

XGBoost - An In-Depth Guide

In [ ]:
xgb.dask.get_client()

XGBoost - An In-Depth Guide

Below we have divided the Boston housing dataset into the train (90%) and test (10%) sets. We have then converted the normal numpy array to the dask array using the da module of the xgboost.dask module. The da module provides access to the dask.array module. We can even use the dask.array module to create arrays and it'll work fine. The xgboost estimators available through the xgboost.dask module accepts either the dask array or dask dataframe. The dask.dataframe module is available in xgboost as xgboost.dask.dd.

In [69]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

X_train_d, X_test_d, Y_train_d, Y_test_d = xgb.dask.da.array(X_train), xgb.dask.da.array(X_test), xgb.dask.da.array(Y_train), xgb.dask.da.array(Y_test)

X_train_d
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

Out[69]:
Array Chunk
Bytes 47.32 kB 47.32 kB
Shape (455, 13) (455, 13)
Count 1 Tasks 1 Chunks
Type float64 numpy.ndarray
13 455

Below we have explained how we can run the xgboost algorithm in a dask distributed environment. The dask module has its own DaskDMatrix data structure which is almost the same as DMatrix but requires client instance as the first argument followed by data arrays containing features values and target labels.

The train() method available through the dask module requires us to pass the client instance first before the parameters dictionary. Everything else is the same as the train() method available directly. It runs training in a distributed environment and returns a dictionary with two components (Booster instance and training history). We can then call the predict() method of the dask module by giving it the client, booster instance, and DaskDMatrix dataset. It'll return a lazy instance on which we need to call compute() to evaluate it and return the actual result. We have calculated the R2 score on train and test datasets at last.

In [70]:
dmat_train_dask = xgb.dask.DaskDMatrix(client, X_train_d, Y_train_d, feature_names=boston.feature_names)
dmat_test_dask = xgb.dask.DaskDMatrix(client, X_test_d, Y_test_d, feature_names=boston.feature_names)

reg_booster = xgb.dask.train(client, {'max_depth': 3, 'eta': 1, 'objective': 'reg:squarederror'},
                             dmat_train_dask,
                             evals=[(dmat_train_dask, "train"), (dmat_test_dask, "test")])

print(reg_booster)

from sklearn.metrics import r2_score

test_preds = xgb.dask.predict(client, reg_booster["booster"], dmat_test_dask)
train_preds = xgb.dask.predict(client, reg_booster["booster"], dmat_train_dask)

print("\nType of Predictions : ",test_preds)

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, test_preds.compute()))
print("Train R2 Score : %.2f"%r2_score(Y_train, train_preds.compute()))
{'booster': <xgboost.core.Booster object at 0x7f73ec6a4748>, 'history': {'train': {'rmse': [4.037745, 3.383986, 3.151301, 2.850509, 2.742785, 2.55956, 2.460093, 2.290408, 2.148991, 2.067703]}, 'test': {'rmse': [3.581157, 3.032583, 2.976439, 2.762226, 2.773082, 2.730096, 2.779377, 2.83347, 2.83619, 2.845627]}}}

Type of Predictions :  dask.array<from-value, shape=(51,), dtype=float32, chunksize=(51,), chunktype=numpy.ndarray>

Test  R2 Score : 0.87
Train R2 Score : 0.95
In [73]:
dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

print("\nTrain RMSE : ",reg_booster["booster"].eval(dmat_train))
print("Test  RMSE : ",reg_booster["booster"].eval(dmat_test))

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, reg_booster["booster"].predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, reg_booster["booster"].predict(dmat_train)))
Train RMSE :  [0]	eval-rmse:2.067703
Test  RMSE :  [0]	eval-rmse:2.845627

Test  R2 Score : 0.87
Train R2 Score : 0.95

Below we have explained how we can use the DaskXGBRegressor() estimator for regression task with the Boston housing dataset. It has the same API as that of XGBRegressor(). We have first created a client instance and then used it as context to call all other methods which will require the usage of dask distributed environment.

In [74]:
client = xgb.dask.Client(n_workers=4, threads_per_worker=4)

with client:
    X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

    print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

    X_train_d, X_test_d, Y_train_d, Y_test_d = xgb.dask.da.array(X_train), xgb.dask.da.array(X_test), xgb.dask.da.array(Y_train), xgb.dask.da.array(Y_test)

    xgb_dask_regressor = xgb.dask.DaskXGBRegressor()

    xgb_dask_regressor.fit(X_train_d, Y_train_d)

    print("Test  R2 Score : %.2f"%xgb_dask_regressor.score(X_test_d, Y_test_d))
    print("Train R2 Score : %.2f"%xgb_dask_regressor.score(X_train_d, Y_train_d))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

Test  R2 Score : 0.93
Train R2 Score : 1.00

As the last example of using xgboost with dask, we have explained how we can use the DaskXGBClassifier() estimator for classification tasks. The majority of things are almost the same as normal API with differences like using the client to communicate to dask cluster, wrapping data into dask data structures, and calling compute() on lazy instances to actually run a task on a cluster to get results.

In [75]:
client = xgb.dask.Client(n_workers=4, threads_per_worker=4)

with client:
    X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target, train_size=0.90,
                                                        stratify=breast_cancer.target,
                                                        random_state=42)

    print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

    X_train_d, X_test_d, Y_train_d, Y_test_d = xgb.dask.da.array(X_train), xgb.dask.da.array(X_test), xgb.dask.da.array(Y_train, dtype="int64"), xgb.dask.da.array(Y_test, dtype="int64")

    xgb_dask_classif = xgb.dask.DaskXGBClassifier()

    xgb_dask_classif.fit(X_train_d, Y_train_d)

    train_preds = xgb_dask_classif.predict(X_train_d)
    test_preds = xgb_dask_classif.predict(X_test_d)

    print("Test  Accuracy Score : %.2f"%accuracy_score(Y_test, test_preds.compute()))
    print("Train Accuracy Score : %.2f"%accuracy_score(Y_train, train_preds.compute()))

    test_preds_proba = xgb_dask_classif.predict_proba(X_test_d)

    print("\nType of Preds Proba Result : ",type(test_preds_proba))

    test_preds_proba = test_preds_proba.compute()

test_preds_proba[:5]
Train/Test Sizes :  (512, 30) (57, 30) (512,) (57,)

Test  Accuracy Score : 0.96
Train Accuracy Score : 1.00

Type of Preds Proba Result :  <class 'dask.array.core.Array'>
Out[75]:
array([3.9076293e-04, 9.9919409e-01, 9.9410325e-01, 4.0606453e-04,
       8.1864735e-03], dtype=float32)

GPU Support

Xgboost provides support for running algorithms on GPU as well. It takes the addition of two simple parameters in order to instruct xgboost to shift training from CPU to GPU. The tree_method parameter has a value named gpu_hist which will let us run our same code on GPU. We can also provide GPU id by setting the gpu_id parameter if we have more than one GPU available.

Below we have run the same code from our previous example but now on GPU by setting tree_method asgpu_hist and gpu_id to 0.

In [74]:
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, train_size=0.90, random_state=42)

print("Train/Test Sizes : ", X_train.shape, X_test.shape, Y_train.shape, Y_test.shape, "\n")

dmat_train = xgb.DMatrix(X_train, Y_train, feature_names=boston.feature_names)
dmat_test = xgb.DMatrix(X_test, Y_test, feature_names=boston.feature_names)

tweedie_booster = xgb.train({'max_depth': 3, 'eta': 1, 'objective': 'reg:tweedie',
                             'tree_method':'gpu_hist', 'gpu_id':0},
                    dmat_train,
                    evals=[(dmat_train, "train"), (dmat_test, "test")])

print("\nTrain RMSE : ",tweedie_booster.eval(dmat_train))
print("Test  RMSE : ",tweedie_booster.eval(dmat_test))

from sklearn.metrics import r2_score

print("\nTest  R2 Score : %.2f"%r2_score(Y_test, tweedie_booster.predict(dmat_test)))
print("Train R2 Score : %.2f"%r2_score(Y_train, tweedie_booster.predict(dmat_train)))
Train/Test Sizes :  (455, 13) (51, 13) (455,) (51,)

[0]	train-tweedie-nloglik@1.5:28.32970	test-tweedie-nloglik@1.5:26.66487
[1]	train-tweedie-nloglik@1.5:19.30740	test-tweedie-nloglik@1.5:18.58394
[2]	train-tweedie-nloglik@1.5:18.72894	test-tweedie-nloglik@1.5:18.14010
[3]	train-tweedie-nloglik@1.5:18.71593	test-tweedie-nloglik@1.5:18.13066
[4]	train-tweedie-nloglik@1.5:18.70913	test-tweedie-nloglik@1.5:18.12305
[5]	train-tweedie-nloglik@1.5:18.70438	test-tweedie-nloglik@1.5:18.12354
[6]	train-tweedie-nloglik@1.5:18.70052	test-tweedie-nloglik@1.5:18.11985
[7]	train-tweedie-nloglik@1.5:18.69751	test-tweedie-nloglik@1.5:18.12120
[8]	train-tweedie-nloglik@1.5:18.69563	test-tweedie-nloglik@1.5:18.12000
[9]	train-tweedie-nloglik@1.5:18.69326	test-tweedie-nloglik@1.5:18.12512

Train RMSE :  [0]	eval-tweedie-nloglik@1.5:18.693260
Test  RMSE :  [0]	eval-tweedie-nloglik@1.5:18.125120

Test  R2 Score : 0.90
Train R2 Score : 0.95

Below we have explained how we can inform XGBClassifier to run training on GPU. The same will work for XGBRegressor, XGBRFClassifier, and XGBRFRegressor as well.

In [75]:
X_train, X_test, Y_train, Y_test = train_test_split(breast_cancer.data, breast_cancer.target,
                                                    stratify=breast_cancer.target,
                                                    train_size=0.90, random_state=42)

xgb_classif = xgb.XGBClassifier(tree_method="gpu_hist", gpu_id=0)

xgb_classif.fit(X_train, Y_train)

print("Test  Accuracy Score : %.2f"%xgb_classif.score(X_test, Y_test))
print("Train Accuracy Score : %.2f"%xgb_classif.score(X_train, Y_train))
Test  Accuracy Score : 0.95
Train Accuracy Score : 1.00

GPU & Dask Together For Parallel GPUs

Xgboost lets us run our code in parallel on multi GPUs as well by using dask. We can use dask for distributed training of our dataset and we can tree_method to gpu_hist to instruct each worker of dask to run the training process on GPU. This way we can run a training process on all workers of dask where each worker will run training on GPU of its own.



Sunny Solanki  Sunny Solanki