Thanks to visit codestin.com
Credit goes to github.com

Skip to content

VishwasPrabhakara/datalens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataLens

Tests Live Demo Python 3.11

Natural-language analytics over relational databases using hybrid schema retrieval, constrained SQL generation, validation, correction retries, and automatic visualizations.

Live demo | Security notes

DataLens architecture

What It Demonstrates

  • Database schema profiling and document construction at connection time
  • Hybrid FAISS and BM25 retrieval with Reciprocal Rank Fusion
  • Gemini-based SQL generation grounded in retrieved tables and columns
  • A read-only SQL policy enforced with sqlglot
  • Database-aware validation with EXPLAIN before execution
  • Error-guided correction for failed generations, with bounded retries
  • Result-size limits, heuristic chart selection, and generated summaries
  • Offline unit tests, GitHub Actions CI, and a reproducible Chinook benchmark

This is a portfolio project and experimental analytics assistant, not a production database administration tool.

Pipeline

Component Responsibility
SchemaProfiler Inspects tables, columns, and sample rows; generates table descriptions
SchemaRetriever Combines semantic and keyword retrieval over schema documents
SQLAgent Generates one SQL query using the retrieved schema context
Validator Rejects non-read-only or multi-statement SQL, then runs EXPLAIN
Correction loop Sends validation feedback back to the SQL agent for up to three attempts
InsightAgent Executes validated SQL, caps results, selects a chart, and creates a summary

Safety Boundary

DataLens accepts a single read-only query. It rejects DML, DDL, transactions, administrative commands, and multiple statements. Returned result sets are capped at 1,000 rows.

Application checks are not a substitute for database permissions. Use a dedicated read-only account with access limited to approved schemas. Schema metadata, sampled rows, and some query-result previews are sent to Google Gemini. Do not use confidential, regulated, personal, or production data in the public demo.

See SECURITY.md for the complete data-handling guidance.

Evaluation

The repository includes 15 natural-language questions over the public Chinook database. Each generated query is executed and compared with a reference query.

Chinook Results

Single run performed in June 2026 using gemini-2.5-flash, gemini-embedding-001, hybrid FAISS/BM25 retrieval, and top_k=5.

Metric Result
Questions 15
First-attempt valid SQL 14/15 (93.3%)
Final valid SQL 14/15 (93.3%)
Execution success 14/15 (93.3%)
Exact result match 12/15 (80.0%)
Average corrections 0.0
Average generation latency 3.74 seconds

The three non-matching cases are useful failure signals:

  • Retrieval miss: the question asking for the artist with the most tracks did not retrieve the Artist table, so the model returned NO_ANSWER.
  • Interpretation mismatch: “playlists contain the most tracks” produced only the playlists tied for the maximum, while the reference expected a complete descending ranking.
  • Tie-order mismatch: the expensive-tracks query returned valid top-priced tracks, but selected a different set among equal-price rows because it did not apply the reference query's secondary alphabetical ordering.

These figures are from one model run on a small public database, not a claim of general text-to-SQL accuracy. LLM output and latency can vary between runs, and the exact-result evaluator is intentionally strict about tied rows.

$env:GOOGLE_API_KEY="your-key"
python -m scripts.run_benchmark

Results are written to benchmark_results/chinook_results.json and are ignored by Git by default.

Run Locally

Requirements:

git clone https://github.com/VishwasPrabhakara/datalens.git
cd datalens
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -r requirements.txt
Copy-Item .env.example .env
streamlit run app.py

Add your API key to .env, then open http://localhost:8501. Use a Gemini API key created in Google AI Studio. Current authorization keys start with AQ.; older standard keys commonly start with AIza.

The app supports the bundled Chinook database, uploaded SQLite files, and SQLAlchemy connection URIs. Non-SQLite databases require the appropriate Python database driver to be installed separately.

Tests

The automated tests do not call Gemini. They cover the SQL policy, semantic validation, correction-loop behavior, query execution, and result-size cap.

python -m pip install -r requirements-dev.txt
pytest

GitHub Actions runs the same test suite on every push and pull request. Scripts under scripts/manual_*.py are optional Gemini-backed integration checks.

Project Layout

datalens/
|-- .github/workflows/tests.yml
|-- benchmarks/chinook_questions.json
|-- scripts/
|   |-- run_benchmark.py
|   `-- manual_*.py
|-- tests/
|   |-- test_insight.py
|   |-- test_loop.py
|   `-- test_validator.py
|-- app.py
|-- datalens.py
|-- schema_profiler.py
|-- retrieval.py
|-- agents.py
|-- validator.py
|-- loop.py
|-- insight.py
|-- prompts.py
|-- chinook.db
|-- architecture.svg
|-- SECURITY.md
`-- requirements.txt

Stack

Python, Streamlit, Gemini, LangChain, FAISS, BM25, SQLAlchemy, sqlglot, pandas, Plotly, and pytest.

Author

Built by Vishwas Prabhakara, Project Assistant (AIML) at the Indian Institute of Science.

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages