lightgbm classifier example

gradient boosting with scikit-learn, xgboost, lightgbm, and catboost

gradient boosting with scikit-learn, xgboost, lightgbm, and catboost

Its popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.

There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, gradient boosting, as the loss gradient is minimized as the model is fit, much like a neural network.

Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.

Additional third-party libraries are available that provide computationally efficient alternate implementations of the algorithm that often achieve better results in practice. Examples include the XGBoost library, the LightGBM library, and the CatBoost library.

Note: We are not comparing the performance of the algorithms in this tutorial. Instead, we are providing code examples to demonstrate how to use each different implementation. As such, we are using synthetic test datasets to demonstrate evaluating and making a prediction with each implementation.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

The example below first evaluates a GradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates a GradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

This is an alternate approach to implement gradient tree boosting inspired by the LightGBM library (described more later). This implementation is provided via the HistGradientBoostingClassifier and HistGradientBoostingRegressor classes.

The example below first evaluates a HistGradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates a HistGradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates an XGBClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates an XGBRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates an LGBMClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates an LGBMRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for Category Gradient Boosting.

The CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the CatBoostClassifier and CatBoostRegressor classes.

The example below first evaluates a CatBoostClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The example below first evaluates a CatBoostRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Hi Jason, all of my work is time series regression with utility metering data. And I always just look at RSME because its in the units that make sense to me. Basically when using from sklearn.metrics import mean_squared_error I just take the math.sqrt(mse) I notice that you use mean absolute error in the code above Is there anything wrong with what I am doing to achieve best model results only viewing RSME?

When you use RepeatedStratifiedKFold mostly the accuracy is calculated to know the best performing model. What if one whats to calculate the parameters like recall, precision, sensitivity, specificity. Then how do we calculate it for each of these repeated folds and also the final mean of all of them like how accuracy is calculated?

So if you set the informative to be 5, does it mean that the classifier will detect these 5 attributes during the feature importance at high scores while as the other 5 redundant will be calculated as low?

Hi Jason, I am confused how a light gradient boosting model works, since in the API they use num_round = 10 bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data]) to fit the model with the training data.

The question is I must answer this question:(robustness of the system is not clear, you have to specify it) But I have no idea how to estimate robustness and what should I read to answer it any help, please

One more blog of yours published by Johar Ashfaque. https://medium.com/ai-in-plain-english/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost-58e372d0d34b. I did not find any reference to your article. This is the second one I know of. He seems to have omitted Histogram Based Gradient Boosting in here.

Hey Jason, just wondering how you can incorporate early stopping with catboost and lightgbm? Im getting an error which is asking for a validation set to be generated. Im wondering if cross_val_score isnt compatible with early stopping. Cheers!

how to develop a light gradient boosted machine (lightgbm) ensemble

how to develop a light gradient boosted machine (lightgbm) ensemble

LightGBM extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients. This can result in a dramatic speedup of training and improved predictive performance.

As such, LightGBM has become a de facto algorithm for machine learning competitions when working with tabular data for regression and classification predictive modeling tasks. As such, it owns a share of the blame for the increased popularity and wider adoption of gradient boosting methods in general, along with Extreme Gradient Boosting (XGBoost).

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, gradient boosting, as the loss gradient is minimized as the model is fit, much like a neural network.

Gradient-based One-Side Sampling, or GOSS for short, is a modification to the gradient boosting method that focuses attention on those training examples that result in a larger gradient, in turn speeding up learning and reducing the computational complexity of the method.

With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size.

Exclusive Feature Bundling, or EFB for short, is an approach for bundling sparse (mostly zero) mutually exclusive features, such as categorical variable inputs that have been one-hot encoded. As such, it is a type of automatic feature selection.

Together, these two changes can accelerate the training time of the algorithm by up to 20x. As such, LightGBM may be considered gradient boosting decision trees (GBDT) with the addition of GOSS and EFB.

We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy

The LightGBM library has its own custom API, although we will use the method via the scikit-learn wrapper classes: LGBMRegressor and LGBMClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

We will evaluate the model using repeated stratified k-fold cross-validation with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).

There are two main ways to control tree complexity: the max depth of the trees and the maximum number of terminal nodes (leaves) in the tree. In this case, we are exploring the number of leaves so we need to increase the number of leaves to support deeper trees by setting the num_leaves argument.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a larger learning rate results in better performance on this dataset. We would expect that adding more trees to the ensemble for the smaller learning rates would further lift performance.

DART is described in the 2015 paper titled DART: Dropouts meet Multiple Additive Regression Trees and, as its name suggests, adds the concept of dropout from deep learning to the Multiple Additive Regression Trees (MART) algorithm, a precursor to gradient boosting decision trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Hi Jason, Thanks for the wonderful paper. One question regarding max_depth vs num_leaves MaxDepth = [2,3,4,5,6,7,8,9,10] NumLeaves = [2 ** i for i in MaxDepth] NumLeaves = [4, 8, 16, 32, 64, 128, 256, 512, 1024] num_leaves higher values may cause overfitting right, also getting warning as No further splits with positive gain, best gain: -inf may i know how to set num_leaves optimally for smaller and larger datasets? Thanks

Hi dear.. i want to know the above mentioned algorithm we achieved accuracy.. but how we get values using confusion matrix I dnt know how i can do. Model.fit(X) if i used whatever model name in above case lightgbm etc. After that Ypred = model.predict (X) Confusionmatrix (y, Pred)

It provides me total reuslt like no error in values.. So kindly tell me how can i make confusion matrix using cross validation technique instead of test train split and with only X dependent and y outcome variable etc.

Sir, I want to know the difference between light GBM and Histogram Gradient Boosting With LightGBM, because I have read in many places that light gbm uses histogram based approach. So, I am confused, if light GBM automatically uses the concept of binning in its training process then what is the difference between histogram baed light GBM?

Hi, Thanks for this article, but there is something I cant understand, you already explained the different boosting types of LGBM and you said that The default is GBDT, which is the classical gradient boosting algorithm. So is using LGBM with gbdt the same as using the Gradient Boosting class of sklearn, or is it interesting to compare their results? (Because I wanna implement both : LGBM and GBM but Im not sure if its a good idea if I just keep the default parameter for lgbm) Also, the power of LGBM is related to GOSS and EFB, but since those two are not used in the default parameter, then we are not really using LGBM right?

python examples of

python examples of

The following are 30 code examples for showing how to use lightgbm.lgbmregressor(). These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example.You may check out the related API usage on the sidebar.You may also want to check out all available functions/classes of the module lightgbm , or try the search function .

python - multiclass classification with lightgbm - stack overflow

python - multiclass classification with lightgbm - stack overflow

While analyzing the prediction I found that my predictions contain only 2 classes - 0 and 1. Class 2 was the 2nd largest class in the training set, but it was nowhere to be found in the predictions.. On evaluating the result it gave about 78% accuracy.

The model produces three probabilities as you show and just from the first output you provided [ 7.93856847e-06 9.99989550e-01 2.51164967e-06] class 2 has a higher probability, so I can't see the problem here.

By clicking Accept all cookies, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Related Equipments