Thanks to visit codestin.com
Credit goes to github.com

Skip to content

bird-bench/livesqlbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LiveSQLBench HKU Logo Google Cloud Logo

News

  • πŸš€ [2025-10-23] Docker update: We added the docker for Full DB Env. And we pushed 3 docker images (Base-Lite/Full DB Env and the evaluation environment) to Docker Hub to facilitate the environment setup. No need to download the DB dumps and build the images manually!

  • πŸ”₯πŸ”₯πŸ”₯ [2025-09-04] We are pleased to release LiveSQLBench-Base-Full v1, a new release with 600 NEW tasks over 22 NEW real, complex databases with KB docs.NEW FEATURES: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values. See the dataset and leaderboard for details

  • πŸ“’ [2025-07-28] We are pleased to release LiveSQLBench-Base-Lite-SQLite, extending from PostgreSQL to SQLite dialect to improve accessibility. Please check another repo BIRD-Mini-Dev v2 and dataset for more details.

  • πŸ“’ [2025-05-30] We are pleased to release LiveSQLBench-Base-Lite, featuring 18 end-user level databases and 270 tasks (180 SELECT-only, 90 Management tasks). Each task involves unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.

πŸš€ Overview

LiveSQLBench (BIRD-SQL Pro v0.5) is a contamination-free, continuously evolving benchmark designed to evaluate LLMs on complex, real-world text-to-SQL tasks, featuring diverse real-world user queries, including Business Intelligence (BI), CRUD operations, and more. Each release will include 50 new, fully open-source DBs curated by the BIRD team through expert collaboration and continuous improvement. It will cover a wide range of database sizes, from end-user level (around 127 columns) to industrial level (1340+ columns).

Here are the features of the LiveSQLBench benchmark:

  1. πŸ—„οΈ Live Databases: Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability.

  2. πŸ’¬ Live User Queries and SQL: Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in an external knowledge base, with medium to hard complexity solution SQL statements.

  3. 🧠 Contextual Reasoning (HKB): Every DB includes a hierarchical knowledge base (HKB) where each knowledge may have dependencies to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format.

  4. πŸ” The First Full SQL Spectrum: Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries.

  5. ⚑ Automated Evaluation: Support fast evaluation via PostgreSQL template & docker. Each question includes verifiable test cases for accurate, reproducible scoring. Soft EX metric is used to evaluate SELECT-ONLY tasks; customized test cases are designed for DBA tasks, such as CRUD (CREATE, READ, UPDATE, DELETE).

  6. πŸ”„ Truly Live & Hidden Test: New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation.

🎯 Current Release: LiveSQLBench-Base-Lite and LiveSQLBench-Base-Full

We have released LiveSQLBench-Base-Lite and LiveSQLBench-Base-Full, featuring:

  • New end-user level databases with complex relationships and noisy schema and data.
  • New tasks
    • Covering both SELECT-only and Management tasks,
    • Lite version has straightforward queries, while Full version has more natural, reasoning-intensive queries.
    • Both versions are without ambiguity.
    • Grounded in external knowledge, with medium to hard complexity SQL statements.
  • HKB for multi-hop retrieval and reasoning

πŸ“¦ Dataset Details

Dataset Description

  • Database: The database can be downloaded from livesqlbench-base-lite and livesqlbench-base-full
  • Data Fields:
    • instance_id: Unique task identifier
    • selected_database: Associated database name
    • query: User query
    • sol_sql: Ground truth SQL solution
    • external_knowledge: IDs of required external knowledge
    • preprocess_sql: SQL setup queries
    • clean_up_sql: SQL queries to reset database state
    • test_cases: Test cases to validate the predicted SQL
    • category: "Query" (SELECT-only) or "Management" (CRUD)
    • high_level: Boolean for high-level description
    • conditions: Indicates decimal/distinct conditions
    • difficulty_tier: Task difficulty (Simple, Moderate, Challenging)

Data viewer: Explore our data through data viewer in our website livesqlbench.ai.

πŸ” To avoid data leakage by auto-crawling, certain fields (e.g., sol_sql, test_cases, external_knowledge) are excluded from the public dataset livesqlbench_data.jsonl. For the full dataset, please email: πŸ“§ [email protected] with subject tag [livesqlbench-base-lite GT&Test Cases] for Lite version or [livesqlbench-base-full-v1 GT&Test Cases] for Full version, which will be sent automatically.

πŸ’¨ Quick Eval (LiveSQLBench-Base-Lite)

Prepare the Dataset

Download the dataset containing DB's hkb, column meaning, schema and the livesqlbench_data.jsonl file:

cd livesqlbench
git clone https://huggingface.co/datasets/birdsql/livesqlbench-base-lite

Integrate the data file containing the annotated fields (obtained from the email) with the livesqlbench_data.jsonl file by running the following command:

python integrate_gt_data.py --gt_file <path_to_gt_file> 

Generation Environment Setup

To run the baseline code to generate LLM outputs, you need to install the following dependencies:

conda create -n livesqlbench python=3.10 -y
conda activate livesqlbench
pip install -r requirements.txt

Generation

You also need to setup the model name (eg., gpt-4o-2024-11-20) with the API key in the config.py file. Then you can run the following command to generate the output:

# Generate the prompt
cd baseline/run
bash generate_prompt.sh

# LLM Inference, need to set the API key in config.py
bash run_baseline.sh

The output will be save in the ./evaluation/outputs/final_output/

Evaluation Environment Setup

We use docker to provide a consistent environment for running the benchmark. To set up the environment, follow these steps:

  1. Build the docker compose using pre-built images

    cd evaluation
    docker compose pull
    docker compose up 

    This contains three containers:

    • livesqlbench_postgresql: the PostgreSQL database for the base-lite DB Environment
    • livesqlbench_postgresql_base_full: the PostgreSQL database for the base-full DB Environment
    • livesqlbench_so_eval_env: the environment for the evaluation

    Just comment out the postgresql_base_full service in the docker-compose.yml file if you only want to evaluate the base-lite version.

  2. (Optional) Build the docker images from scratch

    • Downdload the database dumps
    • Build the environment manually by running docker-compose.build.yml.
      cd evaluation
      docker compose -f docker-compose.build.yml build
      docker compose -f docker-compose.build.yml up -d
  3. (Recommended) Check the database containers are built and running successfully.

  • Print the container build logs to ensure that the databases are built successfully without errors:

    docker logs livesqlbench_postgresql > build_livesqlbench_postgresql.log 2>&1
    docker logs livesqlbench_postgresql_base_full > build_livesqlbench_postgresql_base_full.log 2>&1

    If errors occur, "Errors occurred during import:" will be printed in the log files.

  • Check if the database containers are in good shape.

    Use our provided Python script to verify database metadata:

    docker compose exec so_eval_env bash
    python check_db_metadata.py --host livesqlbench_postgresql
    python check_db_metadata.py --host livesqlbench_postgresql_base_full

    Expected results:

    • livesqlbench-base-lite:
      • πŸ“ˆ Total Databases: 18
      • πŸ“‹ Total Tables: 175
      • πŸ”’ Total Columns: 2286
      • πŸ“ˆ Avg Rows per Table: 1,038.48
      • πŸ’Ύ Total Size: 207.15 MB (around)
    • livesqlbench-base-full:
      • πŸ“ˆ Total Databases: 22
      • πŸ“‹ Total Tables: 244
      • πŸ”’ Total Columns: 2011
      • πŸ“ˆ Avg Rows per Table: 1,121.19
      • πŸ’Ύ Total Size: 272.00 MB (around)

Evaluation Run

Exec into the livesqlbench_so_eval_env container and start the evaluation:

docker compose exec so_eval_env bash
cd run
bash run_eval.sh 

The output will be save in the ./evaluation/outputs/final_output/ If you want the log file for each instance, you can set the --logging to true in the run_eval.sh script.

πŸ’¨ Quick Eval (LiveSQLBench-Base-Full)

Similar to the above, but do the following changes:

  1. Change the DB host name in the evaluation/src/db_utils.py file from livesqlbench_postgresql to livesqlbench_postgresql_base_full.
  2. Change the databaes metafiles and livesqlbench_data.jsonl to the full version from huggingface.
  3. Using the new generated prompt for Full version to get LLM outputs.
  4. Using the new evaluation env with modified DB host to evaluate the LLM outputs.

πŸ“Š Model Performance on LiveSQLBench

LiveSQLBench-Base-Lite (2025-05-28)

Rank Model Success Rate (%) Avg. Cost (USD) / Task
πŸ₯‡ 1 o3-mini 47.78 0.0233
πŸ₯ˆ 2 GPT-4.1 44.10 0.0336
πŸ₯‰ 3 Claude Sonnet 4 42.59 0.0623

More results can be found here

πŸ”„ Upcoming Releases

  • πŸ”„ LiveSQLBench-Base-Lite: 18 NEW databases and 270 NEW tasks with straightforward, direct queries.
  • πŸ”„ LiveSQLBench-Base-Full: 22 NEW databases and 600 NEW tasks with more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
  • πŸ”„ LiveSQLBench-Large-Lite: Industrial-scale databases with 1340+ columns
  • πŸ”„ LiveSQLBench-Large-Full: Comprehensive large-scale datasets

πŸ“Š Feature Comparison

Feature LiveSQLBench-Base-Lite LiveSQLBench-Base-Full LiveSQLBench-Large-Full
User Tasks β€’ 270 tasks
β€’ Clear, direct queries with explicit DB/HKB connections
β€’ Example1*
β€’ 600 tasks
β€’ Natural, colloquial queries with implicit DB/HKB connections
β€’ Example2*
β€’ 600 tasks
β€’ Natural, colloquial queries with implicit DB/HKB connections
β€’ Example2*, but with large DBs (industrial-scale DB)
Database β€’ 18 base databases
β€’ ~127 columns per DB
β€’ Simple 1:1 relationships
β€’ Clean data (no nulls, consistent formats)
β€’ 22 base databases
β€’ ~127 columns per DB
β€’ Complex relationships (1:1, 1:N, N:1, N:N)
β€’ Real-world data quality (e.g., nulls, duplicates, inconsistent formats)
β€’ 22 large databases
β€’ ~1,340 columns per DB
β€’ Complex relationships (1:1, 1:N, N:1, N:N)
β€’ Real-world data quality (e.g., nulls, duplicates, inconsistent formats)
Hierarchical Knowledge Base (HKB) β€’ Structured HKB-JSON format only β€’ Dual format support:
1. Structured HKB-JSON
2. Unstructured HKB-Document (coming soon)
β€’ Dual format support:
1. Structured HKB-JSON
2. Unstructured HKB-Document (coming soon)

* Example1 (more formal): "For our archaeological site evaluation, I need to quantify the Digital Preservation Quality metrics across our collection. Please compute a comprehensive DPQ index for each archaeological location. Present the results in descending order of DPQ values, displaying only the site identification code, site designation, and calculated DPQ value (rounded to two decimal places) to facilitate prioritization of our digital preservation resources."

* Example2 (more colloquial): "To better allocate our digital preservation resources, I need to identify which archaeological sites have the most robust digital records. Can you generate a ranked list showing each site's primary identifier, its designation, and a metric quantifying its digital preservation quality? Please sort the list to show the highest-quality sites first, with the quality metric rounded to the second decimal."

Want new dialects? Vote for new SQL dialects πŸ—³οΈ here!

Created By:

BIRD Team & Google Cloud

Citation

If you find this work useful, please cite:

@misc{livesqlbench2025,
  author       = {BIRD Team},
  title        = {LiveSQLBench: A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks},
  year         = {2024},
  howpublished = {https://github.com/bird-bench/livesqlbench},
  note         = {Accessed: 2025-05-22}
}

Changelog

Changelog

All notable changes to this project will be documented in this file.

2025-09-04

  • Released LiveSQLBench-Base-Full v1 with 600 NEW tasks over 22 NEW real, complex databases with KB docs.
  • New features: more natural, reasoning-intensive user tasks and richer, noisier DB schemas/values.
  • The leaderboard is updated with some LLM results (e.g., GPT-5, gemini-2.5-pro).

2025-06-06

  • SQL evaluation postprocessing improvements:
    • Removed rounding in SQL postprocessing step
    • Added rounding of execution results to 2 decimal places
    • Modified comparison logic for ordered conditions:
      • Using list comparison when "order" is True in conditions
      • Using set comparison when "order" is False in conditions

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published