AI-powered platform for searching, downloading, and analyzing public datasets with natural language Q&A and interactive visualizations.
- Multi-source dataset search — queries 7 sources (data.gov, World Bank, Kaggle, HuggingFace, SDOH Place, CMS, Harvard Dataverse) simultaneously
- AI-ranked results — GPT-5-mini ranks and describes datasets by relevance to your question
- Interactive analysis — ask follow-up questions in natural language, get SQL/Python-backed answers with Plotly charts
- Secure REPL — write your own SQL or Python against loaded datasets in a sandboxed environment
- Cross-dataset joins — load multiple datasets into the same DuckDB session and join them with SQL
- Large dataset support — DuckDB handles files larger than RAM via streaming
- Admin email management — add/remove allowed users at runtime without restarting
React (Vite + TypeScript + Plotly.js)
↓ /api/*
FastAPI
├── Auth (JWT + email allowlist)
├── Admin (runtime allowlist management)
├── Dataset Search (7 sources → GPT-5-mini ranking)
├── Analysis (GPT-5.2 → SQL/Plotly chart generation)
├── DuckDB (per-session, in-memory)
└── Sandbox (RestrictedPython REPL)
- Python 3.12+
- Node.js 22+
- uv — Python package manager
- Azure OpenAI — two deployments: a mini model (search/ranking) and a full model (analysis)
git clone https://github.com/DrDavidL/public-data-analysis.git
cd public-data-analysis
cp .env.example .envEdit .env with your credentials:
# Azure OpenAI (required)
AZURE_ENDPOINT="https://your-endpoint.openai.azure.com/"
AZURE_API_KEY="your-key"
AZURE_DEPLOYMENT_MINI="your-mini-deployment"
AZURE_DEPLOYMENT_FULL="your-full-deployment"
# Auth (required)
JWT_SECRET="generate-a-strong-secret"
ALLOWED_EMAILS="[email protected]"
ADMIN_EMAILS="[email protected]"
# Dataset sources (optional, enables more results)
DATAGOV_API_KEY="" # Free from api.data.gov
KAGGLE_API_TOKEN="" # From kaggle.com/settings# Backend
cd backend && uv sync && cd ..
# Frontend
cd frontend && npm install && cd ..Both servers at once:
bash scripts/dev.shOr separately:
# Terminal 1 — backend on :8000
cd backend && uv run fastapi dev app/main.py --port 8000
# Terminal 2 — frontend on :5173
cd frontend && npm run devOpen http://localhost:5173, register with an email from your ALLOWED_EMAILS list, and start searching.
# Lint (ruff check + format)
bash scripts/lint.sh
# Tests
cd backend && uv run pytest tests/ -v
# Dependency audit
bash scripts/audit.shdocker build -t public-data-analysis .docker run --env-file .env -p 8000:8000 public-data-analysisOpen http://localhost:8000 — the container serves both the API and the built frontend.
The app runs on Azure Container Apps with automatic CI/CD via GitHub Actions.
RG="pubdata-rg"
ACR="pubdataacr"
ENV="pubdata-env"
APP="pubdata-app"
LOCATION="eastus"
# Resource group
az group create --name $RG --location $LOCATION
# Container registry
az acr create --resource-group $RG --name $ACR --sku Basic --admin-enabled true
# Container Apps environment
az containerapp env create --name $ENV --resource-group $RG --location $LOCATIONaz acr login --name $ACR
docker build --platform linux/amd64 -t $ACR.azurecr.io/public-data-analysis:latest .
docker push $ACR.azurecr.io/public-data-analysis:latestaz containerapp create \
--name $APP \
--resource-group $RG \
--environment $ENV \
--image $ACR.azurecr.io/public-data-analysis:latest \
--registry-server $ACR.azurecr.io \
--target-port 8000 \
--ingress external \
--min-replicas 0 \
--max-replicas 3 \
--cpu 1 \
--memory 2Gi \
--env-vars \
AZURE_ENDPOINT="https://your-endpoint.openai.azure.com/" \
AZURE_API_KEY="your-key" \
AZURE_DEPLOYMENT_MINI="your-mini-deployment" \
AZURE_DEPLOYMENT_FULL="your-full-deployment" \
JWT_SECRET="your-production-secret" \
ALLOWED_EMAILS="[email protected]" \
ADMIN_EMAILS="[email protected]" \
CORS_ORIGINS="https://your-app.azurecontainerapps.io"The workflow (.github/workflows/ci.yml) runs lint, tests, frontend build, and Docker build on every push/PR. On push to main, it also deploys to Azure Container Apps.
Required GitHub secrets:
| Secret | Value |
|---|---|
AZURE_CREDENTIALS |
Service principal JSON from az ad sp create-for-rbac --sdk-auth |
ACR_USERNAME |
ACR admin username |
ACR_PASSWORD |
ACR admin password |
Required GitHub variables:
| Variable | Value |
|---|---|
ACR_LOGIN_SERVER |
e.g. pubdataacr.azurecr.io |
CONTAINER_APP_NAME |
e.g. pubdata-app |
RESOURCE_GROUP |
e.g. pubdata-rg |
az acr login --name $ACR
docker build --platform linux/amd64 -t $ACR.azurecr.io/public-data-analysis:latest .
docker push $ACR.azurecr.io/public-data-analysis:latest
az containerapp update --name $APP --resource-group $RG \
--image $ACR.azurecr.io/public-data-analysis:latestManage the email allowlist at runtime (requires ADMIN_EMAILS membership):
TOKEN="your-jwt-token"
# List allowed emails
curl -H "Authorization: Bearer $TOKEN" https://your-app/api/admin/allowlist
# Add emails
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"emails": ["[email protected]"]}' \
https://your-app/api/admin/allowlist
# Remove an email
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
https://your-app/api/admin/allowlist/[email protected]Changes take effect immediately. The env var ALLOWED_EMAILS seeds the allowlist on startup; runtime changes persist until the app restarts.
User accounts (email + hashed password) are currently stored in-memory and lost on restart. The email allowlist persists via the ALLOWED_EMAILS env var, but users must re-register after each deploy or restart. Below is the planned fix.
Azure Table Storage is the best fit — serverless, pennies/month, zero infrastructure, and perfect for simple key-value user records.
- Add dependency:
azure-data-tablespackage - Add env var:
AZURE_STORAGE_CONNECTION_STRING(from an Azure Storage Account) - Create
services/user_store.py:- On startup, connect to a
userstable in Azure Table Storage register(email, hashed_password)→ insert row (PartitionKey="users", RowKey=email)get(email)→ retrieve hashed passwordexists(email)→ check if registered- Falls back to in-memory dict if no connection string is set (local dev)
- On startup, connect to a
- Update
routers/auth.py: replace_users: dictwith calls touser_store - Azure setup:
# Create storage account (one-time) az storage account create --name pubdatastorage --resource-group pubdata-rg --sku Standard_LRS # Get connection string az storage account show-connection-string --name pubdatastorage --resource-group pubdata-rg -o tsv # Set on container app az containerapp update --name pubdata-app --resource-group pubdata-rg \ --set-env-vars "AZURE_STORAGE_CONNECTION_STRING=<connection-string>"
| Option | Pros | Cons |
|---|---|---|
| Azure Table Storage | Serverless, ~$0.01/mo, no schema | Limited querying |
| SQLite on Azure Files | Simple, familiar | Requires volume mount, write locking |
| Cosmos DB | Powerful, global replication | Overkill and expensive for user records |
| PostgreSQL Flexible Server | Full relational DB | ~$15+/mo minimum, overkill |
For a user table with a few dozen rows, Table Storage is the clear winner.
public-data-analysis/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app
│ │ ├── config.py # Settings from .env
│ │ ├── routers/ # auth, datasets, analysis, admin
│ │ ├── services/ # AI, search, analysis, sandbox, datastore, allowlist
│ │ │ └── sources/ # 7 dataset source adapters
│ │ ├── schemas/ # Pydantic models
│ │ └── core/ # JWT security, session manager
│ ├── tests/
│ └── pyproject.toml
├── frontend/
│ ├── src/
│ │ ├── pages/ # Login, Search, Analysis
│ │ ├── components/ # Charts, Chat, REPL, Sidebar
│ │ ├── hooks/ # useAuth
│ │ └── api/ # Axios client
│ ├── package.json
│ └── vite.config.ts
├── scripts/ # dev.sh, lint.sh, audit.sh
├── .github/workflows/ci.yml
├── Dockerfile
├── .dockerignore
├── .env.example
└── CLAUDE.md