Synthetic Tabular Data Generation Library
This Python library specializes in the generation of synthetic tabular data. It has a diverse range of statistical, Machine Learning (ML) and Deep Learning (DL) methods to accurately capture patterns in real datasets and replicate them in a synthetic context. It has multiple applications including pre-processing of tabular datasets, data balancing, resampling...
🔩 Pre-process your data.
🕜 State-of-the-art models.
♻️ Easy to use and customize.
The gentab library is available using pip. We recommend using a virtual environment to avoid conflicts with other software on your machine.
pip install gentabBelow is the list of the generators currently available in the library.
| Model | Example | Paper |
|---|---|---|
| SMOTE | link | |
| ADASYN | link |
| Model | Example | Paper |
|---|---|---|
| Gaussian Copula | link |
| Model | Example | Paper |
|---|---|---|
| TVAE | link |
| Model | Example | Paper |
|---|---|---|
| CTGAN | link | |
| CTAB-GAN | link | |
| CTAB-GAN+ | link |
| Model | Example | Paper |
|---|---|---|
| ForestDiffusion | link |
| Model | Example | Paper |
|---|---|---|
| GReaT | link | |
| Tabula | link |
| Model | Example | Papers |
|---|---|---|
| Copula GAN | link link | |
| AutoDiffusion | link |
from gentab.generators import AutoDiffusion
from gentab.evaluators import MLP
from gentab.data import Config, Dataset
from gentab.utils import console
config = Config("configs/playnet.json")
dataset = Dataset(config)
dataset.reduce_size(
{
"left_attack": 0.97,
"right_attack": 0.97,
"right_transition": 0.9,
"left_transition": 0.9,
"time_out": 0.8,
"left_penal": 0.5,
"right_penal": 0.5,
}
)
dataset.merge_classes(
{
"attack": ["left_attack", "right_attack"],
"transition": ["left_transition", "right_transition"],
"penalty": ["left_penal", "right_penal"],
}
)
dataset.reduce_mem()
console.print(dataset.class_counts(), dataset.row_count())
generator = AutoDiffusion(dataset)
generator.generate()
console.print(dataset.generated_class_counts(), dataset.generated_row_count())
evaluator = MLP(generator)
evaluator.evaluate()
dataset.save_to_disk(generator)from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
generator = AutoDiffusion(dataset)
evaluator = LightGBM(generator)
trials = 10
time = 60 * 60 * 8
tuner = AutoDiffusionTuner(evaluator, trials, timeout=time)
tuner.tune()
tuner.save_to_disk()from gentab.generators import AutoDiffusion
from gentab.evaluators import LightGBM
from gentab.tuners import AutoDiffusionTuner
from gentab.data import Config, Dataset
config = Config("configs/adult.json")
dataset = Dataset(config)
dataset.merge_classes({
"<=50K": ["<=50K."], ">50K": [">50K."]
})
dataset.reduce_mem()
# Load previously saved dataset...
generator = AutoDiffusion(dataset)
generator.load_from_disk()
# Do work with previously generated but not tuned dataset...
evaluator = LightGBM(generator)
evaluator.evaluate()
evaluator.evaluate_baseline()
# Load previously tuned and saved dataset...
tuner = AutoDiffusionTuner(evaluator, 0)
tuner.load_from_disk()
# Do work with previously tuned dataset...
evaluator.evaluate()
evaluator.evaluate_baseline()This project has received support from the Spanish Ministry of Science and Innovation (AEI/PID2020-115734RB-C22 and AEI/RYC2018-025385-I), Xunta de Galicia (ED431F 2021/11) and EU-FEDER Galicia (ED431G 2019/01).