If you face imbalance data in your machine learning project, this package is here to pre-process your data. It is an efficient and ready-to-use implementation of MGS-GRF, an oversampling strategy presented at ECML-PKDD 2025 conference, designed to handle large-scale and mixed imbalanced data-set — with both continuous and categorical features.
First you can clone the repository:
git clone [email protected]:artefactory/mgs-grf.gitAnd install the required packages into your environment (conda, mamba or pip):
pip install -r requirements.txtHere is a short example on how to use MGS-GRF:
from mgs_grf import MGSGRFOverSampler
## Apply MGS-GRF procedure to oversample the data
mgs_grf = MGSGRFOverSampler(categorical_features=categorical_features, random_state=0)
X_train_balanced, y_train_balanced = mgs_grf.fit_resample(X_train_imbalanced, y_train_imbalanced)
## Encode the categorical variables
enc = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_train_balanced_enc = np.hstack((X_train_balanced[:,numeric_features],
enc.fit_transform(X_train_balanced[:,categorical_features])))
X_test_enc = np.hstack((X_test[:,numeric_features], enc.transform(X_test[:,categorical_features])))
# Fit the final classifier on the augmented data
clf = lgb.LGBMClassifier(n_estimators=100, verbosity=-1, random_state=0)
clf.fit(X_train_balanced_enc, y_train_balanced)A more detailed notebook example is available in this notebook.
This work was done through a partenership between Artefact Research Center and the Laboratoire de Probabilités Statistiques et Modélisation (LPSM) of Sorbonne University.
If you find the code useful, please consider citing us :
@inproceedings{sakho2025harnessing,
title={Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring},
author={Sakho, Abdoulaye and Malherbe, Emmanuel and Gauthier, Carl-Erik and Scornet, Erwan},
booktitle={Joint European Conference on Machine Learning and Knowledge Discovery in Databases},
pages={247--264},
year={2025},
organization={Springer}
}