Thanks to visit codestin.com
Credit goes to github.com

Skip to content

New strategies for KBinsDiscretizer #19255

@glevv

Description

@glevv

New non-parametric strategies could be added to KBD. Like geometric, winsorized, combined (Uniform+Quantiles, Geometric+Uniform, Geometric+Quantiles).
Winsorized binning uses interpercentile range (in my examples I used p95-p05) instead of peak-to-peak (max - min) like in Uniform. It allows algorithm to ignore outliers and save the form of distribution of the most of the mass of data.
Geometric binning uses incremental binning. This technique allows to "deskew" the distribution (or skew it if it was symmetric). It is used in some GIS packages and could be beneficial for regression models.
Uniquant is Uniform+Quantiles (CatBoost also has this one). It is just an average between Uniform and Quantile bins. Same thing for Geouni and Geoquant. I did not plot Quantile strategy, since it will be always uniform distribution. I used n_bins=31 with N=10_000.

Quant_strats_Bimodal_10000
Quant_strats_Exponential_10000
Quant_strats_LogNormal_10000
Quant_strats_Normal_10000

More info about winsorized binning could be found here. It will require additional parameter and the effect of this binning technique could be achieved by different strategies without additional hyperparameters, so this one is debatable.
More info about geometric binning (and other interesting techniques) could be found here

Results for LogisticRegression, RandomForestClassifier, Ridge and RandomForestRegressor. Datasets generated with make_classification and make_regression with n_samples=1000 and n_bins=10 (default value). Combined here is Uniform+Quantile. I did not test other combos mentioned above.

R2_regression_forest
R2_regression_linear
ROC_AUC_classification_forest
ROC_AUC_classification_linear

All these new algorithms are easy to implement (just add another elif and two lines of code for each one) and are fast to compute, since no fitting is necessary.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions