A curated paper list of awesome AI4DB theory, frameworks, resources, tools and other awesomeness, for data engineers.
The repository is under construction. Welcome new PR, please conform to the committed rules:
paperName(with pdf link) [MeetingName Year] Github link if it has open-sourced code (optional)Thanks to all authors of the paper/repository I cite :D
- AI4DB Paper Sets
 
- Leveraging Query Logs and Machine Learning for Parametric Query Optimization [VLDB 22]
 - LEON: A New Framework for ML-Aided Query Optimization [VLDB 23]
 - LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution Plans [VLDB 23]
 - Kepler: Robust Learning for Parametric Query Optimization [SIGMOD 23]
 - FOSS: A Self-Learned Doctor for Query Optimizer [ICDE 23]
 - Eraser: Eliminating Performance Regression on Learned Query Optimizer [VLDB 24]
 - AutoSteer: Learned Query Optimization for Any SQL Database [VLDB 24]
 - Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
 - Stage: Query Execution Time Prediction in Amazon Redshift [SIGMOD 24]
 - Roq: Robust Query Optimization Based on a Risk-aware Learned Cost Model [arXiv 24]
 - RobOpt: A Tool for Robust Workload Optimization Based on Uncertainty-Aware Machine Learning [SIGMOD Demo 24]
 - Towards Exploratory Query Optimization for Template-based SQL Workloads [ICDE 24]
 - RankPQO: Learning-to-Rank for Parametric Query Optimization [VLDB 25]
 - LEAP: A Low-cost Spark SQL Query Optimizer using Pairwise Comparison [VLDB 25]
 - Athena: An Effective Learning-based Framework for Query Optimizer Performance Improvement [SIGMOD 25]
 - PAR2QO: Parametric Penalty-Aware Robust Query Optimization [VLDB 25]
 
- DSB: a decision support benchmark for workload-driven and traditional database systems [VLDB 21]
 - Expand your training limits! generating training data for ml-based data management [VLDB 21]
 - LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning [SIGMOD 22]
 - Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems [VLDB 24] 
 - Artemis: A Customizable Workload Generation Toolkit for Benchmarking Cardinality Estimation [ICDE Demo 25]
 - Towards a Unified Query Plan Representation [ICDE 25]
 - SQLStorm: Taking Database Benchmarking into the LLM Era [VLDB 25]
 - The Accuracy of Cardinality Estimators: Unraveling the Evaluation Result Conundrum [VLDB 25]
 
- Machine Unlearning in Learned Databases: An Experimental Analysis [SIGMOD 24]
 - NeurBench: Benchmarking Learned Database Components with Data and Workload Drift Modeling [arXiv 25]
 - Learned Query Optimizer: What is New and What is Next [SIGMOD 24]
 
- Cardinality Estimation: An Experimental Survey [VLDB 17]
 - Are We Ready For Learned Cardinality Estimation? [VLDB 21]
 - Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [VLDB 21]
 - Learned cardinality estimation: A design space exploration and a comparative evaluation [VLDB 22]
 - Learned Cardinality Estimation: An In-depth Study [SIGMOD 22] 
 
- Selectivity estimation for range predicates using lightweight models [VLDB 19]
 - Deep learning models for selectivity estimation of multiattribute queries [SIGMOD 20]
 
- Towards a Learning Optimizer for Shared Clouds [VLDB 18]
 - Learned Cardinalities: Estimating Correlated Joins with Deep Learning [CIDR 2019]
 - An End-to-End Learning-based Cost Estimator [VLDB 19] 
 - Flow-Loss: Learning Cardinality Estimates That Matter [VLDB 21]
 - Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts [SIGMOD 22]
 - Selectivity Functions of Range Queries are Learnable[SIGMOD 22] 
 - Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation [SIGMOD 23] 
 - Robust Query Driven Cardinality Estimation under Changing Workloads[VLDB 23]
 - Enhanced Featurization of Queries with Mixed Combinations of Predicates for ML-based Cardinality Estimation [EDBT 23]
 - AutoCE: An Accurate and Efficient Model Advisor for Learned Cardinality Estimation [ICDE 23]
 - Asm: Harmonizing autoregressive model, sampling, and multi-dimensional statistics merging for cardinality estimation [SIGMOD 24]
 - Adding Domain Knowledge to Query-Driven Learned Databases [arXiv 24]
 - Sample-Efficient Cardinality Estimation Using Geometric Deep Learning [VLDB 24]
 - A Practical Theory of Generalization in Selectivity Learning [VLDB 25]
 
- Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation [SIGMOD 15]
 - Deep Unsupervised Cardinality Estimation [VLDB 19] 
 - Quicksel: Quick selectivity learning with mixture models [SIGMOD 20]  
 - Pre-training Summarization Models of Structured Datasets for Cardinality Estimation [VLDB 22] 
 
- DeepDB: Learn from Data, not from Queries! [VLDB 20] 
 - NeuroCard: One Cardinality Estimator for All Tables [VLDB 21]  
 - FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation [VLDB 21]
 - BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation [aiXiv 21] 
 - Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size [aiXiv 21]
 - Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation [VLDB 21]
 - FACE: a normalizing flow based cardinality estimator [VLDB 22]  
 - FactorJoin: A New Cardinality Estimation Framework for Join Queries [SIGMOD 22]
 - Cardinality estimation using normalizing flow [VLDBJ 23]
 - CEDA: Learned Cardinality Estimation with Domain Adaptation [VLDB 23]
 - LPLM: A Neural Language Model for Cardinality Estimation of LIKE-Queries [SIGMOD 24]
 - Cardinality Estimation of LIKE Predicate Queries using Deep Learning [SIGMOD 25]
 - Grid-AR: A Grid–based Booster for Learned Cardinality Estimation and Range Joins [arXiv 25]
 - Updateable Data-Driven Cardinality Estimator with Bounded Q-error [arXiv 25]
 - SPACE: Cardinality Estimation for Path Queries Using Cardinality-Aware Sequence-based Learning [SIGMOD 25]
 - Data-Agnostic Cardinality Learning from Imperfect Workloads [VLDB 25]
 - A Lightweight Learned Cardinality Estimation Model [TKDE 25]
 
- A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation [SIGMOD 21]  
 - ALECE: An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads [VLDB 23] 
 - A Unified Model for Cardinality Estimation by Learning from Data and Queries via Sum-Product Networks [arXiv 25]  
 
- PRICE: A Pretrained Model for Cross-Database Cardinality Estimation [VLDB 25]
 - PLM4NDV:Minimizing Data Access for Number of Distinct Values Estimation with Pre-trained Language Models [SIGMOD 25]
 - ZeroCard: Cardinality Estimation with Zero Dependence onTarget Databases-No Data, No Query, No Retraining [arXiv 25]
 
- Bao: Making Learned Query Optimization Practical [SIMOD 21]
 - FASTgres: Making Learned Query Optimizer Hinting Effective [VLDB 23]
 - COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations [VLDB 23] 
 
- Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload [SIGMOD 21]
 - A Resource-Aware Deep Cost Model for Big Data Query Processing [ICDE 22]
 - Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction [VLDB 22]
 - Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection [VLDB 22]  
 - Lero: A Learning-to-Rank Qery Optimizer [VLDB 23]
 - Lero: applying learning-to-rank in query optimizer [VLDBJ 24]
 - DACE: A Database-Agnostic Cost Estimator [ICDE 24]
 
- How Good are Learned Cost Models, Really? Insights from Query Optimization Tasks [SIGMOD 25]
 - Learned Cost Models for Query Optimization: From Batch to Streaming Systems [VLDB 25 Tutorial]
 
- PreQR: Pre-training Representation for SQL Understanding [SIGMOD 22]
 - QueryFormer: A Tree Transformer Model for Query Plan Representation [VLDB 22]
 - A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies [VLDB 24]
 
- Learning to Optimize Join queries With Deep Reinforcement Learning [SIGMOD 16]
 - Deep Reinforcement Learning for Join Order Enumeration[arXiv 18]
 - Reinforcement Learning with Tree-LSTM for Join Order Selection [ICDE 20]
 
- A Learned Query Rewrite System using Monte Carlo Tree Search [VLDB 22] 
 - LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency [VLDB 25]
 
- Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
 - Db2une: Tuning Under Pressure via Deep Learning [VLDB 24]
 - PilotScope: Steering Databases with Machine Learning Drivers [VLDB 24]
 
- Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [VLDB 23]
 - The Dawn of Natural Language to SQL: Are We Fully Ready? [VLDB 24]
 - Demonstrating SQLBarber: Leveraging Large Language Models to Generate Customized, Constraint-aware, and Realistic SQL [SIGMOD Demo 25]
 - Cracking SQL Barriers: An LLM-based Dialect Translation System [SIGMOD 25]
 - Natural language to sql: State of the art and open problems [VLDB 25]
 
- A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? [TKDE 25]
 - Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration [arXiv 25]
 
- MemQ: A Graph-Based Query Memory Prediction Framework for Effective Workload Scheduling [ICDE 25]
 - Towards Automatic and Efficient Prediction Query Processing in Analytical Database [ICDE 25]
 - LORE: Learning-based Resource Recommendation for Big Data Queries [ICDE 25]
 - BQSched: A Non-intrusive Scheduler for Batch Concurrent Queries via Reinforcement Learning [ICDE 25]
 
- Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine [VLDB 22]
 - TreeLine: An Update-In-Place Key-Value Store for Modern Storage [VLDB 22] 
 - Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads [SIGMOD 23]
 - Limousine: Blending Learned and Classical Indexes to Self-Design Larger-than-Memory Cloud Storage Engines [SIGMOD 24]
 - CAMAL: Optimizing LSM-trees via Active Learning [SIGMOD 25]
 - NEXT: A New Secondary Index Framework for LSM-based Data Storage [SIGMOD 25]
 
- The Case for Learned Index Structures [SIGMOD 18]
 - FITing-Tree: A Data-aware Index Structure [SIGMOD 19]
 - ALEX: An Updatable Adaptive Learned Index [aiXiv 20]
 - The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds [VLDB 20]
 - RadixSpline: a single-pass learned index [aiDM 20]
 - Why Are Learned Indexes So Effective? [ICML 20]
 - A Pluggable Learned Index Method via Sampling and Gap Insertion [aiXiv 21]
 - Updatable Learned Index with Precise Positions [VLDB 21]
 - The next 50 years in database indexing or: the case for automatically generated index structures [VLDB 21]
 - Tuning Hierarchical Learned Indexes on Disk and Beyond [SIGMOD 22]
 - APEX: A High-Performance Learned Index on Persistent Memory [VLDB 22]
 - FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems [VLDB 22]
 - Are Updatable Learned Indexes Ready? [VLDB 22]
 - CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm [VLDB 22]
 - NFL: Robust Learned Index via Distribution Transformation [VLDB 22]
 - Cutting Learned Index into Pieces: An In-depth Inquiry into Updatable Learned Indexes [ICDE 23]
 
- Learning Multi-dimensional Indexes [SIGMOD 20]
 - LISA: A Learned Index Structure for Spatial Data [SIGMOD 20]
 - Effectively Learning Spatial Indices [VLDB 20]
 - The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries [EDBT 20]
 - Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [VLDB 21]
 - NEIST: a Neural-Enhanced Index for Spatio-Temporal Queries [TKDE 21]
 - RW-Tree: A Learned Workload-aware Framework for R-tree Construction [ICDE 22]
 - A New Paradigm in Tuning Learned Indexes: A Reinforcement Learning-Enhanced Approach [SIGMOD 25]
 
- The Case for Automatic Database Administration using Deep Reinforcement Learning [arXiv 18]
 - AI Meets AI: Leveraging Query Executions to Improve Index Recommendations [SIGMOD 19]
 - Online Index Selection Using Deep Reinforcement Learning for a Cluster Database [ICDEW 20]
 - SMARTIX: A database indexing agent based on reinforcement learning [Applied Intelligence 20]
 - Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms [VLDB 20]  
 - An Index Advisor Using Deep Reinforcement Learning [CIKM 20]
 - Automated Database Indexing Using Model-Free Reinforcement Learning [ICAPS 20]
 - DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees [ICDE 21]
 - Index selection for NoSQL database with deep reinforcement learning [Information Sciences 21]
 - MANTIS: Multiple Type and Attribute Index Selection using Deep Reinforcement Learning [IDEAS 21]
 - AutoIndex: An Incremental Index Management System for Dynamic Workloads [ICDE 22]
 - SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning [EDBT 22]
 - Indexer++: Workload-Aware Online Index Tuning with Transformers and Reinforcement Learning [SAC 22]
 - Budget-aware Index Tuning with Reinforcement Learning [SIGMOD 22]
 - ISUM: Efficiently Compressing Large and Complex Workloads for Scalable Index Tuning [SIGMOD 22]
 - DISTILL: low-overhead data-driven techniques for filtering and costing indexes for scalable index tuning [VLDB 22]
 - HMAB: Self-Driving Hierarchy of Bandits for Integrated Physical Database Design Tuning [VLDB 22]  
 - SmartIndex: An Index Advisor with Learned Cost Estimator [CIKM 22]
 - Learned Index Benefits: Machine Learning Based Index Performance Estimation [VLDB 23] 
 - No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable Guarantees [TKDE 23]
 - IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads [EuroMLSys 24]
 - Leveraging Dynamic and Heterogeneous Workload Knowledge to Boost the Performance of Index Advisors [PVLDB 24]
 - Refactoring Index Tuning Process with Benefit Estimation [PVLDB 24]
 - Breaking It Down: An In-Depth Study of Index Advisors [PVLDB 24]
 - TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation [ICDE 24]
 - Automatic Database Index Tuning: A Survey [TKDE 24]
 - Robustness of Updatable Learning-based Index Advisors against Poisoning Attack [SIGMOD 24]
 - Wii: Dynamic Budget Reallocation In Index Tuning [SIGMOD 24]
 - Wred: Workload Reduction for Scalable Index Tuning [SIGMOD 24]
 - ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges [SIGMOD 24]
 - Esc: An Early-Stopping Checker for Budget-aware Index Tuning [VLDB 25]
 
- Automatic Database Management System Tuning Through Large-scale Machine Learning [SIGMOD 17]
 - Deploying a Steered Query Optimizer in Production at Microsof [SIGMOD 22]
 - Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data [SIGMOD 23]
 - AutoSteer: Learned Query Optimization for Any SQL Database [SIGMOD 23]
 - Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshif [SIGMOD 23]
 - GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization [VLDB 24]
 - LLMTune: Accelerate Database Knob Tuning with Large Language Models [VLDB 24]
 - 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning [SIGMOD 25]
 - Rabbit: Retrieval-Augmented Generation Enables Better Automatic Database Knob Tuning [ICDE 25]
 - Autotuning Systems: Techniques, Challenges, and Opportunities [SIGMOD 25 Tutorial]
 - E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model [VLDB 25]
 - A-Tune-Online: Efficient and QoS-aware Online Configuration Tuning for Dynamic Workloads [ICDE 25]
 
- Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation [arXiv 24]
 
- D-Bot: Database Diagnosis System using Large Language Models [VLDB 24]
 - Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG [arXiv 25]
 - Automatic Database Configuration Debugging using Retrieval-Augmented Language Models [SIGMOD 25]
 - Andromeda: Debugging Database Performance Issues with Retrieval-Augmented Large Language Models [SIGMOD Demo 25]
 - Demonstrating SQLBarber: Leveraging Large Language Models to Generate Customized and Realistic SQL Workloads [SIGMOD Demo 25]
 
- ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models [VLDB 24]
 - A Survey on Large Language Models for Code Generation [arXiv 24]
 - Fuzz4All: Universal Fuzzing with Large Language Models [ICSE 24]
 - LLM-PBE: Assessing Data Privacy in Large Language Models [VLDB 24]
 - Are Large Language Models a Good Replacement of Taxonomies? [VLDB 24]
 - A survey on augmenting knowledge graphs (KGs) with large language models (LLMs): models, evaluation metrics, benchmarks, and challenges [Discover Artificial Intelligence 24]
 - Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models [VLDB 25]
 - Large Language Model-Based Agents for Software Engineering: A Survey [arXiv 25]
 - Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation [arXiv 25]
 - Data+AI: LLM4Data and Data4LLM [SIGMOD 25 Tutorial]
 - A Survey of LLM Ă— DATA [arXiv 25]