A comprehensive machine learning toolkit that combines AutoML, genetic programming, clustering, and model explainability features for advanced data analysis and model development.
reveng/
├── src/ # Core modules
│ ├── automl.py # H2O AutoML implementation
│ ├── explainautoml.py # Model explainability with LIME
│ ├── geneticlearn.py # Genetic programming using DEAP
│ ├── h2owrapper.py # H2O wrapper utilities
│ ├── preclustering.py # DBSCAN clustering and outlier detection
│ └── preprocessing.py # Data preprocessing pipeline
├── utils/ # Utility functions
├── config/ # Configuration files
├── data/ # Sample datasets
├── __main__.py # Main entry point
└── __init__.py # Package initialization
Configure your dataset in config/config.json (see Configuration section)
python -m revengAvailable commands:
cl- Compute preclustering and outlier detectiongp- Run genetic programmingautoml- Execute AutoML pipelinecodb- Connect to databaseq- Quit the application
- AutoML Integration: Automated machine learning using H2O framework with support for multiple algorithms (Random Forest, GLM, Deep Learning, GBM)
- Genetic Programming: Feature selection and model generation using DEAP (Distributed Evolutionary Algorithms in Python)
- Clustering & Outlier Detection: DBSCAN-based clustering with automatic outlier identification
- Model Explainability: LIME (Local Interpretable Model-agnostic Explanations) integration for model interpretability
- Data Preprocessing: Comprehensive preprocessing pipeline with encoding, scaling, and feature selection
- Interactive CLI: Command-line interface for easy interaction with all features
- Configurable: JSON-based configuration system for flexible dataset handling
Edit config/config.json to configure your dataset:
{
"your_dataset": {
"path": "/data/your_file.csv",
"target_output": "target_column_name",
"exclude": ["target_column_name", "id_column"],
"categorical_features": [0, 1, 4, 5],
"inputing_features_gp": [3, 13, 14]
},
"automl_param": {
"max_models": 10,
"nfolds": 5,
"seed": 42,
"include_algos": ["DRF", "GLM", "DeepLearning", "GBM"]
}
}Configuration Parameters:
path: Path to your CSV datasettarget_output: Name of the target/output columnexclude: List of columns to exclude from training (should include target column)categorical_features: List of column indices that contain categorical datainputing_features_gp: List of column indices to use for genetic programmingautoml_param: H2O AutoML configuration parameters
- Configure your dataset in
config/config.json - Run the application:
python -m reveng - Execute AutoML:
automl - View model performance and explanations
- Run clustering:
cl - View outlier detection results
- Analyze cluster distributions
- Run genetic programming:
gp - View evolved feature combinations
- Analyze fitness scores
- Python 3.7+
- Java 8+ (required for H2O)
Install the required packages:
pip install h2o pandas numpy scikit-learn deap lime matplotlib pyfigletThis project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions, please open an issue on the GitHub repository.