kaira

High-performance data ingestion pipeline for Indian index options (NIFTY / BANKNIFTY), designed as a small quant fund data stack:

Pluggable providers (NSE official EOD + live option chain collector; vendor adapters are the intended path for full intraday history)
Canonical, query-friendly Parquet datasets (columnar, partitioned, compactable)
Explicit data-quality checks, quarantine paths, and snapshot logs (for gap detection / “missing ticks”)

Architecture (fund-style)

Bronze → Silver → Gold

Bronze (raw): raw provider payloads (JSON/CSV) for audit + reproducibility.
Silver (canonical): normalized option_quotes fact table in Parquet (fast scans, stable schema).
Gold (research-ready): surfaces, features, resampled bars, strategy-specific datasets (not built yet).

This repo implements Silver plus a snapshot_log dataset; it also writes quarantined payloads + invalid rows to data/quarantine/.

Canonical dataset: `option_quotes`

Grain: one row per (ts, symbol, expiry, strike, right).

Time semantics

ts: timestamp in UTC (ms precision)
ts_date: trade date in IST (partition key)
ingest_ts: when your collector saw the snapshot (UTC)

Key columns

Contract: symbol, expiry, strike, right (C/P)
Market: bid, ask, bid_qty, ask_qty, last, iv (stored as decimal; 0.15 == 15%)
Activity: oi, volume
IDs: instrument_id (stable 64-bit), option_id (debug-friendly string)
Provenance: source

Partitioning (Hive)

symbol=.../expiry=YYYY-MM-DD/ts_date=YYYY-MM-DD/part-....parquet

This makes backtests fast because you can prune by symbol/expiry/date at the filesystem level, then rely on Parquet projection + row-group stats for the remaining filters.

Data quality + “missing ticks”

Two mechanisms are used:

Row-level validation (dq_flags bitmask) to catch corruption:
- missing required values
- invalid right
- crossed markets (ask < bid)
- negative OI/volume
- NaNs
- IV out of bounds
Invalid rows go to data/quarantine/option_quotes_invalid/.
Snapshot log: every polling attempt writes a row to data/silver/snapshot_log/ with status + latency + record count.
- Your backtester can detect gaps by scanning snapshot_log and deciding how to handle them (skip, forward-fill, resample, etc.).

Data sources for Indian index options

Free / official

NSE FO bhavcopy (EOD): reliable for historical OI/volume/close by contract, but no bid/ask and no IV.

Collector (build your own history going forward)

NSE option-chain endpoint: good for intraday snapshots, but can be blocked/unstable; treat as best-effort collection, not “institutional history”.

Commercial vendors (recommended for full historical intraday option chain)

You typically need a paid vendor for historical intraday option chain with bid/ask + IV. When evaluating vendors, confirm:

true tick vs 1s/1m sampling
full depth vs top-of-book
corporate action / symbol-change handling
exchange timestamp vs vendor timestamp
survivorship bias in instrument master
replays, corrections, and how they publish late/corrected data

Quickstart

Create an env and install:

py -m venv .venv
.\\.venv\\Scripts\\Activate.ps1
py -m pip install -e .

Collect live option-chain snapshots (creates forward history):

py -m kaira collect nse-live --symbols NIFTY BANKNIFTY --interval-s 2 --duration-s 600

Backfill EOD FO bhavcopy (official, free):

py -m kaira backfill nse-bhavcopy --start 2024-01-01 --end 2024-03-31 --symbols NIFTY BANKNIFTY

Compact partitions (coalesce small files; de-dup by latest ingest_ts):

py -m kaira maint compact-option-quotes --dataset-dir data/silver/option_quotes --min-files 10

Backtest reads (DuckDB)

Use kaira.query.read_option_quotes_arrow() for fast predicate pushdown reads:

from datetime import date
from kaira.query import read_option_quotes_arrow
from kaira.query.duckdb_reader import OptionQuoteQuery

t = read_option_quotes_arrow(
    "data/silver/option_quotes",
    query=OptionQuoteQuery(
        symbol="NIFTY",
        expiries=[date(2026, 2, 5)],
        trade_date_start=date(2026, 2, 1),
        trade_date_end=date(2026, 2, 4),
    ),
    columns=["ts", "expiry", "strike", "right", "bid", "ask", "iv", "oi", "volume"],
)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
smoke_test_pipeline.py		smoke_test_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaira

Architecture (fund-style)

Canonical dataset: `option_quotes`

Data quality + “missing ticks”

Data sources for Indian index options

Quickstart

Backtest reads (DuckDB)

About

Uh oh!

Releases

Packages

Languages

CodeWizarz/kaira

Folders and files

Latest commit

History

Repository files navigation

kaira

Architecture (fund-style)

Canonical dataset: option_quotes

Data quality + “missing ticks”

Data sources for Indian index options

Quickstart

Backtest reads (DuckDB)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Canonical dataset: `option_quotes`

Packages