DataLens

Natural-language analytics over relational databases using hybrid schema retrieval, constrained SQL generation, validation, correction retries, and automatic visualizations.

Live demo | Security notes

What It Demonstrates

Database schema profiling and document construction at connection time
Hybrid FAISS and BM25 retrieval with Reciprocal Rank Fusion
Gemini-based SQL generation grounded in retrieved tables and columns
A read-only SQL policy enforced with sqlglot
Database-aware validation with EXPLAIN before execution
Error-guided correction for failed generations, with bounded retries
Result-size limits, heuristic chart selection, and generated summaries
Offline unit tests, GitHub Actions CI, and a reproducible Chinook benchmark

This is a portfolio project and experimental analytics assistant, not a production database administration tool.

Pipeline

Component	Responsibility
`SchemaProfiler`	Inspects tables, columns, and sample rows; generates table descriptions
`SchemaRetriever`	Combines semantic and keyword retrieval over schema documents
`SQLAgent`	Generates one SQL query using the retrieved schema context
`Validator`	Rejects non-read-only or multi-statement SQL, then runs `EXPLAIN`
`Correction loop`	Sends validation feedback back to the SQL agent for up to three attempts
`InsightAgent`	Executes validated SQL, caps results, selects a chart, and creates a summary

Safety Boundary

DataLens accepts a single read-only query. It rejects DML, DDL, transactions, administrative commands, and multiple statements. Returned result sets are capped at 1,000 rows.

Application checks are not a substitute for database permissions. Use a dedicated read-only account with access limited to approved schemas. Schema metadata, sampled rows, and some query-result previews are sent to Google Gemini. Do not use confidential, regulated, personal, or production data in the public demo.

See SECURITY.md for the complete data-handling guidance.

Evaluation

The repository includes 15 natural-language questions over the public Chinook database. Each generated query is executed and compared with a reference query.

Chinook Results

Single run performed in June 2026 using gemini-2.5-flash, gemini-embedding-001, hybrid FAISS/BM25 retrieval, and top_k=5.

Metric	Result
Questions	15
First-attempt valid SQL	14/15 (93.3%)
Final valid SQL	14/15 (93.3%)
Execution success	14/15 (93.3%)
Exact result match	12/15 (80.0%)
Average corrections	0.0
Average generation latency	3.74 seconds

The three non-matching cases are useful failure signals:

Retrieval miss: the question asking for the artist with the most tracks did not retrieve the Artist table, so the model returned NO_ANSWER.
Interpretation mismatch: “playlists contain the most tracks” produced only the playlists tied for the maximum, while the reference expected a complete descending ranking.
Tie-order mismatch: the expensive-tracks query returned valid top-priced tracks, but selected a different set among equal-price rows because it did not apply the reference query's secondary alphabetical ordering.

These figures are from one model run on a small public database, not a claim of general text-to-SQL accuracy. LLM output and latency can vary between runs, and the exact-result evaluator is intentionally strict about tied rows.

$env:GOOGLE_API_KEY="your-key"
python -m scripts.run_benchmark

Results are written to benchmark_results/chinook_results.json and are ignored by Git by default.

Run Locally

Requirements:

Python 3.11+
A Gemini API key from Google AI Studio

git clone https://github.com/VishwasPrabhakara/datalens.git
cd datalens
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
Copy-Item .env.example .env
streamlit run app.py

Add your API key to .env, then open http://localhost:8501. Use a Gemini API key created in Google AI Studio. Current authorization keys start with AQ.; older standard keys commonly start with AIza.

The app supports the bundled Chinook database, uploaded SQLite files, and SQLAlchemy connection URIs. Non-SQLite databases require the appropriate Python database driver to be installed separately.

Tests

The automated tests do not call Gemini. They cover the SQL policy, semantic validation, correction-loop behavior, query execution, and result-size cap.

python -m pip install -r requirements-dev.txt
pytest

GitHub Actions runs the same test suite on every push and pull request. Scripts under scripts/manual_*.py are optional Gemini-backed integration checks.

Project Layout

datalens/
|-- .github/workflows/tests.yml
|-- benchmarks/chinook_questions.json
|-- scripts/
|   |-- run_benchmark.py
|   `-- manual_*.py
|-- tests/
|   |-- test_insight.py
|   |-- test_loop.py
|   `-- test_validator.py
|-- app.py
|-- datalens.py
|-- schema_profiler.py
|-- retrieval.py
|-- agents.py
|-- validator.py
|-- loop.py
|-- insight.py
|-- prompts.py
|-- chinook.db
|-- architecture.svg
|-- SECURITY.md
`-- requirements.txt

Stack

Python, Streamlit, Gemini, LangChain, FAISS, BM25, SQLAlchemy, sqlglot, pandas, Plotly, and pytest.

Author

Built by Vishwas Prabhakara, Project Assistant (AIML) at the Indian Institute of Science.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataLens

What It Demonstrates

Pipeline

Safety Boundary

Evaluation

Chinook Results

Run Locally

Tests

Project Layout

Stack

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
benchmarks		benchmarks
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
agents.py		agents.py
app.py		app.py
architecture.svg		architecture.svg
check_db.py		check_db.py
chinook.db		chinook.db
datalens.py		datalens.py
insight.py		insight.py
loop.py		loop.py
northwind.db		northwind.db
prompts.py		prompts.py
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
retrieval.py		retrieval.py
schema_profiler.py		schema_profiler.py
validator.py		validator.py

Folders and files

Latest commit

History

Repository files navigation

DataLens

What It Demonstrates

Pipeline

Safety Boundary

Evaluation

Chinook Results

Run Locally

Tests

Project Layout

Stack

Author

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages