A machine learning-based query routing system for PolarDB that optimizes query execution by intelligently routing queries between row-store and column-store formats.
- Overview
- Quick Start
- Installation
- Training Pipeline
- Model Training
- Resource Management
- Benchmarking
- Project Structure
- Advanced Configuration
This project implements an intelligent query routing system that uses machine learning models (LightGBM, Random Forest, Decision Tree, MLP) to determine the optimal storage format (row vs. column) for query execution in PolarDB.
- PolarDB Version:
PolarDB_802(linthompson_new) - Routing Model:
ensemble_model_32_features - Supported ML Models: LightGBM, RowMLP (FANN), Decision Tree, Random Forest
# Run PolarDB with constrained resources
bash run_polardb_constrained_resource.sh
# Start mysqld with custom configuration
./mysqld --defaults-file=/home/jitu.wy/mypolardb/db/fann_model.cnf &yum install fann-develmake
./router --model lightgbm- PolarDB 802
- FANN development libraries
- Python 3.x with required packages
- C++ compiler with C++11 support
- Install FANN libraries:
yum install fann-devel - Clone the repository
- Build the project:
make -j$(nproc)
Create database, schema, load data, create indexes, and convert tables to columnar format.
cd zsce
# Modify 'dir' in code to point to CSV/TBL data location
python generate_column_stats.py --dataset tpch_sf100cd zsce
# Modify 'dir' in code to point to CSV/TBL data location
python generate_string_stats.py --dataset tpch_sf100cd zsce
python generate_zsce_queries.py \
--workload_dir "/home/wuy/query_costs/workloads" \
--dataset "tpch_sf100"cd zsce/cross_db_benchmark/benchmark_tools/
python generate_TP_workload.py \
--data_dir "/home/wuy/query_costs" \
--dataset "tpch_sf100"python preprocessing/collect_query_costs_including_fann_model_and_hybrid.py \
--dataset tpch_sf100 \
--db tpch_sf100For all datasets:
python preprocessing/collect_query_costs_trace_all_datasets.pyBuild and train gap-regression LightGBM:
cd binary_classification
make -j$(nproc)
./router --mixTrain row/column time regression models:
cd time_regression
make -j$(nproc)
./router --mix# Train specific model
./router --model lightgbm
# Skip training and use existing model
./router --skip_train --data_dirs=tpch_sf100,tpch_sf1,tpdcs_sf1Supported models:
lightgbm- LightGBM gradient boostingrowmlp- MLP implemented with FANNdtree- Decision Treeforest- Random Forest
python feature_analysis.py extract tpch_sf1python feature_analysis.py train airline --use_idx shap_idx.npyGenerate SHAP analysis and heatmaps:
python shap_analysis.pySET GLOBAL enable_resource_control = ON;-- Create resource control for TP (transactional) workloads
CREATE POLAR_RESOURCE_CONTROL rc_tp MAX_CPU 30;
-- Create resource control for AP (analytical) workloads
CREATE POLAR_RESOURCE_CONTROL rc_ap MAX_CPU 60;-- Create users
CREATE USER tp_user IDENTIFIED BY '***';
CREATE USER ap_user IDENTIFIED BY '***';
-- Assign resource controls to users
SET POLAR_RESOURCE_CONTROL rc_tp FOR USER tp_user;
SET POLAR_RESOURCE_CONTROL rc_ap FOR USER ap_user;# Create cgroup
cgcreate -g cpu,memory:mysqld_grp
# Limit to 10% of a CPU core (10ms out of every 100ms)
echo 100000 | sudo tee /sys/fs/cgroup/cpu/mysqld_grp/cpu.cfs_period_us
echo 10000 | sudo tee /sys/fs/cgroup/cpu/mysqld_grp/cpu.cfs_quota_us
# Move mysqld process to the cgroup (replace 73474 with actual PID)
cgclassify -g cpu,memory:mysqld_grp 73474Run combined benchmarks:
python combined_benchmark.py \
--mysqld_pid <pid> \
--rounds 1 \
-n 1000 \
--warmup_queries 100Location: binary_classification/
Key files:
main.cpp- Entry pointlightgbm_model.cpp/h- LightGBM implementationdecision_tree_model.cpp- Decision tree implementationfannmlp_model.cpp- FANN MLP implementationcommon.cpp/hpp- Common utilitiesshap_util.hpp- SHAP value utilities
Location: time_regression/
Contains similar structure to binary classification but optimized for regression tasks.
cpu_meter.cpp/h- CPU usage monitoringglobal_stats.cpp/hpp- Global statistics managementthresholds.hpp- Threshold configurationsvib.hpp- Variational Information Bottleneck utilities
To train on multiple datasets:
./router --skip_train --data_dirs=dataset1,dataset2,dataset3Use SHAP-based feature selection:
python feature_analysis.py train <dataset> --use_idx shap_idx.npyMonitor CPU usage during training and inference using the built-in CPU meter functionality.