Natural-language analytics over relational databases using hybrid schema retrieval, constrained SQL generation, validation, correction retries, and automatic visualizations.
- Database schema profiling and document construction at connection time
- Hybrid FAISS and BM25 retrieval with Reciprocal Rank Fusion
- Gemini-based SQL generation grounded in retrieved tables and columns
- A read-only SQL policy enforced with
sqlglot - Database-aware validation with
EXPLAINbefore execution - Error-guided correction for failed generations, with bounded retries
- Result-size limits, heuristic chart selection, and generated summaries
- Offline unit tests, GitHub Actions CI, and a reproducible Chinook benchmark
This is a portfolio project and experimental analytics assistant, not a production database administration tool.
| Component | Responsibility |
|---|---|
SchemaProfiler |
Inspects tables, columns, and sample rows; generates table descriptions |
SchemaRetriever |
Combines semantic and keyword retrieval over schema documents |
SQLAgent |
Generates one SQL query using the retrieved schema context |
Validator |
Rejects non-read-only or multi-statement SQL, then runs EXPLAIN |
Correction loop |
Sends validation feedback back to the SQL agent for up to three attempts |
InsightAgent |
Executes validated SQL, caps results, selects a chart, and creates a summary |
DataLens accepts a single read-only query. It rejects DML, DDL, transactions, administrative commands, and multiple statements. Returned result sets are capped at 1,000 rows.
Application checks are not a substitute for database permissions. Use a dedicated read-only account with access limited to approved schemas. Schema metadata, sampled rows, and some query-result previews are sent to Google Gemini. Do not use confidential, regulated, personal, or production data in the public demo.
See SECURITY.md for the complete data-handling guidance.
The repository includes 15 natural-language questions over the public Chinook database. Each generated query is executed and compared with a reference query.
Single run performed in June 2026 using gemini-2.5-flash,
gemini-embedding-001, hybrid FAISS/BM25 retrieval, and top_k=5.
| Metric | Result |
|---|---|
| Questions | 15 |
| First-attempt valid SQL | 14/15 (93.3%) |
| Final valid SQL | 14/15 (93.3%) |
| Execution success | 14/15 (93.3%) |
| Exact result match | 12/15 (80.0%) |
| Average corrections | 0.0 |
| Average generation latency | 3.74 seconds |
The three non-matching cases are useful failure signals:
- Retrieval miss: the question asking for the artist with the most tracks
did not retrieve the
Artisttable, so the model returnedNO_ANSWER. - Interpretation mismatch: “playlists contain the most tracks” produced only the playlists tied for the maximum, while the reference expected a complete descending ranking.
- Tie-order mismatch: the expensive-tracks query returned valid top-priced tracks, but selected a different set among equal-price rows because it did not apply the reference query's secondary alphabetical ordering.
These figures are from one model run on a small public database, not a claim of general text-to-SQL accuracy. LLM output and latency can vary between runs, and the exact-result evaluator is intentionally strict about tied rows.
$env:GOOGLE_API_KEY="your-key"
python -m scripts.run_benchmarkResults are written to benchmark_results/chinook_results.json and are ignored
by Git by default.
Requirements:
- Python 3.11+
- A Gemini API key from Google AI Studio
git clone https://github.com/VishwasPrabhakara/datalens.git
cd datalens
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
Copy-Item .env.example .env
streamlit run app.pyAdd your API key to .env, then open http://localhost:8501.
Use a Gemini API key created in Google AI Studio. Current authorization keys
start with AQ.; older standard keys commonly start with AIza.
The app supports the bundled Chinook database, uploaded SQLite files, and SQLAlchemy connection URIs. Non-SQLite databases require the appropriate Python database driver to be installed separately.
The automated tests do not call Gemini. They cover the SQL policy, semantic validation, correction-loop behavior, query execution, and result-size cap.
python -m pip install -r requirements-dev.txt
pytestGitHub Actions runs the same test suite on every push and pull request.
Scripts under scripts/manual_*.py are optional Gemini-backed integration
checks.
datalens/
|-- .github/workflows/tests.yml
|-- benchmarks/chinook_questions.json
|-- scripts/
| |-- run_benchmark.py
| `-- manual_*.py
|-- tests/
| |-- test_insight.py
| |-- test_loop.py
| `-- test_validator.py
|-- app.py
|-- datalens.py
|-- schema_profiler.py
|-- retrieval.py
|-- agents.py
|-- validator.py
|-- loop.py
|-- insight.py
|-- prompts.py
|-- chinook.db
|-- architecture.svg
|-- SECURITY.md
`-- requirements.txt
Python, Streamlit, Gemini, LangChain, FAISS, BM25, SQLAlchemy, sqlglot, pandas, Plotly, and pytest.
Built by Vishwas Prabhakara, Project Assistant (AIML) at the Indian Institute of Science.