This repository contains scripts and data for analysing Aadhaar enrolment, biometric, and demographic data.
- Python 3.8+
- UV (recommended)
- Typst (for report compilation)
Tip
To install UV, follow the instructions in the official documentation. To install Typst, see typst.app.
git clone https://github.com/arnav-kr/aadhaar-stats.git
cd aadhaar-statsuv syncThe main script runs all analysis stages and compiles the final report:
uv run main.pyuv run main.py --skip-analysis # only compile report
uv run main.py --skip-report # only run analysis scripts
uv run main.py --assistant # launch AI assistantThe project includes an AI-powered assistant for exploring the analysis data interactively.
-
Create a
.envfile in the project root with your API key:AI_API_KEY=your-api-key-here AI_MODEL=gemini-3-flash-preview
-
Get an API key from Google AI Studio
uv run main.py --assistantThe assistant can answer questions about:
- Enrolment statistics and trends
- State and district comparisons
- Migration patterns
- Data quality metrics
- Anomaly detection results
- And more...
uv run scripts/preprocess.py
uv run scripts/univariate.py
# etc.├── main.py # Main pipeline script
├── assistant/ # AI-powered data exploration assistant
│ ├── __init__.py
│ ├── chat.py # Chat interface using Gemini
│ └── data_provider.py # Local data context provider
├── data/
│ ├── raw/ # Raw Aadhaar CSV files
│ │ ├── enrolment/ # New enrolment records
│ │ ├── demographic/ # Demographic update records
│ │ └── biometric/ # Biometric update records
│ ├── processed/ # Cleaned and normalized data
│ ├── intermediate/ # Intermediate processing artifacts
│ └── maps/ # Geographic boundary files (shapefiles, geojson)
├── scripts/
│ ├── preprocess.py # Data cleaning and normalization
│ ├── univariate.py # Single-variable analysis
│ ├── bivariate.py # Two-variable relationship analysis
│ ├── trivariate.py # Three-variable interaction analysis
│ ├── data_quality.py # Data quality assessment
│ ├── advanced.py # Advanced insights and forecasting
│ ├── spatial.py # Geographic visualizations
│ └── utils/ # Shared utilities and constants
├── plots/
│ ├── univariate/ # Single-variable plots
│ ├── bivariate/ # Two-variable plots
│ ├── trivariate/ # Three-variable plots
│ ├── data_quality/ # Data quality visualizations
│ └── advanced/ # Advanced analysis plots
├── analysis/ # JSON outputs from analysis scripts
├── descriptions/ # YAML descriptions for plots and analysis
└── report/
├── main.typ # Typst source document
└── report.pdf # Compiled PDF report (generated)
| Script | Description |
|---|---|
preprocess.py |
Loads raw CSVs, normalizes state/district names, validates pincodes, parses dates |
univariate.py |
State-wise distribution, age groups, temporal trends, activity patterns |
bivariate.py |
Correlation analysis, state-age relationships, migration patterns |
trivariate.py |
State-time-enrolment clustering, age-time dynamics, anomaly detection |
data_quality.py |
Spelling variations, naming inconsistencies, data entry issues |
advanced.py |
Demand forecasting, migration corridors, fraud indicators, resource allocation |
spatial.py |
Geographic map visualizations using shapefiles |
- 67 plots across 5 analysis categories
- JSON analysis files with computed statistics
- 71 page PDF report with findings and recommendations
This project is licensed under the AGPL-3.0 License. See the LICENSE file for details.