High-performance data ingestion pipeline for Indian index options (NIFTY / BANKNIFTY), designed as a small quant fund data stack:
- Pluggable providers (NSE official EOD + live option chain collector; vendor adapters are the intended path for full intraday history)
- Canonical, query-friendly Parquet datasets (columnar, partitioned, compactable)
- Explicit data-quality checks, quarantine paths, and snapshot logs (for gap detection / “missing ticks”)
Bronze → Silver → Gold
- Bronze (raw): raw provider payloads (JSON/CSV) for audit + reproducibility.
- Silver (canonical): normalized
option_quotesfact table in Parquet (fast scans, stable schema). - Gold (research-ready): surfaces, features, resampled bars, strategy-specific datasets (not built yet).
This repo implements Silver plus a snapshot_log dataset; it also writes quarantined payloads + invalid rows to data/quarantine/.
Grain: one row per (ts, symbol, expiry, strike, right).
Time semantics
ts: timestamp in UTC (ms precision)ts_date: trade date in IST (partition key)ingest_ts: when your collector saw the snapshot (UTC)
Key columns
- Contract:
symbol,expiry,strike,right(C/P) - Market:
bid,ask,bid_qty,ask_qty,last,iv(stored as decimal;0.15 == 15%) - Activity:
oi,volume - IDs:
instrument_id(stable 64-bit),option_id(debug-friendly string) - Provenance:
source
Partitioning (Hive)
symbol=.../expiry=YYYY-MM-DD/ts_date=YYYY-MM-DD/part-....parquet
This makes backtests fast because you can prune by symbol/expiry/date at the filesystem level, then rely on Parquet projection + row-group stats for the remaining filters.
Two mechanisms are used:
-
Row-level validation (
dq_flagsbitmask) to catch corruption:- missing required values
- invalid right
- crossed markets (
ask < bid) - negative OI/volume
- NaNs
- IV out of bounds
Invalid rows go to
data/quarantine/option_quotes_invalid/. -
Snapshot log: every polling attempt writes a row to
data/silver/snapshot_log/with status + latency + record count.- Your backtester can detect gaps by scanning
snapshot_logand deciding how to handle them (skip, forward-fill, resample, etc.).
- Your backtester can detect gaps by scanning
Free / official
- NSE FO bhavcopy (EOD): reliable for historical OI/volume/close by contract, but no bid/ask and no IV.
Collector (build your own history going forward)
- NSE option-chain endpoint: good for intraday snapshots, but can be blocked/unstable; treat as best-effort collection, not “institutional history”.
Commercial vendors (recommended for full historical intraday option chain)
You typically need a paid vendor for historical intraday option chain with bid/ask + IV. When evaluating vendors, confirm:
- true tick vs 1s/1m sampling
- full depth vs top-of-book
- corporate action / symbol-change handling
- exchange timestamp vs vendor timestamp
- survivorship bias in instrument master
- replays, corrections, and how they publish late/corrected data
Create an env and install:
py -m venv .venv
.\\.venv\\Scripts\\Activate.ps1
py -m pip install -e .Collect live option-chain snapshots (creates forward history):
py -m kaira collect nse-live --symbols NIFTY BANKNIFTY --interval-s 2 --duration-s 600Backfill EOD FO bhavcopy (official, free):
py -m kaira backfill nse-bhavcopy --start 2024-01-01 --end 2024-03-31 --symbols NIFTY BANKNIFTYCompact partitions (coalesce small files; de-dup by latest ingest_ts):
py -m kaira maint compact-option-quotes --dataset-dir data/silver/option_quotes --min-files 10Use kaira.query.read_option_quotes_arrow() for fast predicate pushdown reads:
from datetime import date
from kaira.query import read_option_quotes_arrow
from kaira.query.duckdb_reader import OptionQuoteQuery
t = read_option_quotes_arrow(
"data/silver/option_quotes",
query=OptionQuoteQuery(
symbol="NIFTY",
expiries=[date(2026, 2, 5)],
trade_date_start=date(2026, 2, 1),
trade_date_end=date(2026, 2, 4),
),
columns=["ts", "expiry", "strike", "right", "bid", "ask", "iv", "oi", "volume"],
)