Sberbank Data Science Journey 2018: AutoML

main scripts:
train.py - training
predict.py - prediction on test data

Apart from basic preprocessing (extracting datetime features, encoding categorical variables, drop constant features):

is_holiday flag for each datetime feature
mean target encoding for categorical features
for dealing with memory issues:
- read small part of data, define data types, read entire data with float32 instead of float64
- parse datetime while reading

LightGBM
Hyperopt for parameter tuning
after each step check if time limit is not exceeded, then continue
ensemble (blending) of best models from hyperopt
- during hyperopt iterations remember all models that were trained
- choose 5 best models in the end
- blend them with stepwise blending
  Caruana et al. (2004) Ensemble Selection from Libraries of Models

Very small data
- don't use target encoding (to prevent overfitting)
- don't optimize parameters at all (to prevent overfitting)
- run several models (LightGBM, XGBoost, RF, ET) with random parameters and average them
Very big data
- simple feature selection from LightGBM feature importance

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
presentation		presentation
README.md		README.md
ensemble.py		ensemble.py
feature_selection.py		feature_selection.py
holidays.csv		holidays.csv
metadata.json		metadata.json
models.py		models.py
optim_hyperopt.py		optim_hyperopt.py
predict.py		predict.py
preprocess.py		preprocess.py
small_data.py		small_data.py
train.py		train.py
utils.py		utils.py
validate.py		validate.py

Provide feedback