Validate datasets against data contracts to ensure schema compliance, data quality, and distribution health. DataPact supports DataPact YAML, ODCS v3.1.0, and Pact API contracts, with a provider architecture and a CLI designed for CI/CD pipelines.
- Schema Validation: Check columns, types, and required fields
- Quality Rules: Validate nulls, uniqueness, ranges, regex patterns, and enums
- Rule Severity: Mark rules as WARN or ERROR, with CLI overrides
- Schema Drift: Control extra column handling with WARN/ERROR policies
- Distribution Monitoring: Detect drift in numeric column statistics
- Profiling: Auto-generate rule baselines from data
- SLA Checks: Enforce row count and freshness constraints
- Big Data Support: Chunked validation with optional sampling
- Custom Rule Plugins: Load rule logic from plugin modules
- Policy Packs: Apply reusable rule bundles by name
- Contract Versioning: Track contract evolution with automatic migration
- Multiple Formats: Support CSV, Parquet, and JSON Lines
- Database Sources: Validate Postgres, MySQL, and SQLite tables
- ODCS Support: Validate Open Data Contract Standard v3.1.0 contracts
- API Pact Support: Infer DataPact contracts from Pact API contracts via type inference
- Contract Providers: Load DataPact YAML, ODCS, or Pact JSON contracts via provider dispatch
- Normalization Scaffold: Contract-aware normalization (flatten config; noop unless enabled)
- CI/CD Ready: Exit codes for automation pipelines
- Detailed Reporting: JSON reports with machine-readable errors
- Report Sinks: Send reports to files, stdout, or webhooks
See FEATURES.md for a functional feature list with compact examples.
pip install -e .Note: pact-python is included as a base dependency so DataPact can ingest Pact JSON contracts for schema inference.
Optional database drivers:
pip install -e ".[db]"Create customer_contract.yaml:
contract:
name: customer_data
version: 2.0.0
dataset:
name: customers
fields:
- name: customer_id
type: integer
required: true
rules:
unique: true
- name: email
type: string
required: true
rules:
regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
unique: true
- name: age
type: integer
rules:
min: 0
max: 150
- name: status
type: string
rules:
enum: [active, inactive, suspended]
- name: score
type: float
distribution:
mean: 50.0
std: 15.0
max_drift_pct: 10.0datapact validate --contract customer_contract.yaml --data customers.csvValidate a database table:
datapact validate \
--contract customer_contract.yaml \
--db-type postgres \
--db-host localhost \
--db-port 5432 \
--db-user app \
--db-password secret \
--db-name appdb \
--db-table customersValidate an ODCS contract:
datapact validate \
--contract my_contract.odcs.yaml \
--contract-format odcs \
--odcs-object customers \
--data customers.csvValidate a Pact API contract (schema inferred from Pact JSON):
datapact validate \
--contract pact_user_api.json \
--contract-format pact \
--data api_response.jsonType inference happens automatically. Add quality/distribution rules manually to the inferred contract if needed.
datapact init --contract new_contract.yaml --data data.csvdatapact profile --contract new_profile.yaml --data data.csvDataPact uses Pact contracts as an input format for schema inference. It consumes Pact JSON contracts commonly produced by pact-python, then maps the response body examples to DataPact fields.
- Pact JSON contract format: Reads Pact JSON files as the source of truth
- Consumer/Provider metadata: Uses
consumer.nameandprovider.nameto build a DataPact contract name - Interactions array: Requires Pact interactions to locate an API response
- Response body examples: Infers field names and types from
interactions[0].response.body - Type mapping: Maps JSON primitives to DataPact types (int → integer, float → float, bool → boolean, str → string)
- Mock server and stubs: DataPact validates from files, not live servers
- Consumer-driven test execution: DataPact is a validation tool, not a testing framework
- Provider verification: No provider verification against a running service
- Pact Broker integration: Only local Pact JSON files are supported
- Matching rules and generators: Matchers are not evaluated; only example values are used
- Message Pacts: Only REST API response bodies are supported
- CLI tooling for Pact: DataPact does not invoke pact-python CLI helpers
Pact contract snippet:
{
"consumer": {"name": "web-frontend"},
"provider": {"name": "user-api"},
"interactions": [
{
"response": {
"status": 200,
"body": {
"id": 123,
"name": "Alice Smith",
"email": "[email protected]",
"age": 30,
"active": true
}
}
}
]
}Inferred DataPact fields:
fields:
- name: id
type: integer
required: false
- name: name
type: string
required: false
- name: email
type: string
required: false
- name: age
type: integer
required: false
- name: active
type: boolean
required: falsePact does not define quality or distribution rules. Add those rules manually in DataPact YAML:
fields:
- name: id
type: integer
rules:
unique: true
- name: email
type: string
rules:
regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
- name: age
type: integer
rules:
min: 0
max: 150datapact validate --contract <path/to/contract.yaml> --data <path/to/data> [--format auto|csv|parquet|jsonl] [--output-dir ./reports]Options:
--contract: Path to contract file (required). Supports.yaml(DataPact/ODCS) or.json(Pact)--contract-format: Contract format (auto, datapact, odcs, pact). Default: auto--odcs-object: ODCS schema object name or id (required if multiple objects)--data: Path to data file (required)--format: Data format. Default: auto-detect from file extension--output-dir: Directory for JSON report. Default: ./reports--db-type: Database type (postgres, mysql, sqlite)--db-host: Database host (RDBMS only)--db-port: Database port--db-user: Database user (RDBMS only)--db-password: Database password (RDBMS only)--db-name: Database name (RDBMS only)--db-table: Database table to read--db-query: SQL query to read (overrides table)--db-path: SQLite database file path--db-connect-timeout: DB connection timeout in seconds--db-chunksize: Chunk size for DB streaming validation--report-sink: Report sink (file, stdout, webhook). Repeatable--report-webhook-url: Webhook URL for report sinkwebhook--report-webhook-header: Webhook header (Key: Value). Repeatable--report-webhook-timeout: Webhook timeout in seconds--severity-override: Override rule severity (format: field.rule=warn)--chunksize: Stream validation in chunks (CSV/JSONL)--sample-rows: Sample N rows for validation--sample-frac: Sample fraction for validation--sample-seed: Random seed for sampling--plugin: Plugin module path for custom rules (repeatable)
Exit Codes:
0: Validation passed1: Validation failed
datapact init --contract <path/to/output.yaml> --data <path/to/data>Infers a starter contract from a dataset (columns and types only).
datapact profile --contract <path/to/output.yaml> --data <path/to/data>Options:
--max-enum-size: Max enum size for profiling (default: 20)--max-enum-ratio: Max enum ratio for profiling (default: 0.2)--unique-threshold: Unique ratio threshold (default: 0.99)--null-ratio-buffer: Buffer added to observed null ratio (default: 0.01)--range-buffer-pct: Buffer added to min/max (default: 0.05)--max-drift-pct: Drift threshold for distributions (default: 10.0)--max-z-score: Outlier z-score threshold (default: 3.0)--no-distribution: Disable distribution profiling--no-date-regex: Disable date regex inference
In contracts, use:
integer- int32, int64float- float32, float64string- text/object columnsboolean- bool
not_null: Required, no nulls allowedunique: All values must be uniquemin: Minimum numeric valuemax: Maximum numeric valueregex: Regex pattern matchenum: Value must be in listmax_null_ratio: Tolerate up to X% nulls (0.0 to 1.0)freshness_max_age_hours: Max age in hours for timestamp fields
Rules can include severity metadata:
rules:
not_null:
value: true
severity: WARN
max:
value: 100
severity: ERRORmean: Expected mean for numeric columnstd: Expected standard deviationmax_drift_pct: Alert if mean/std changes by >X%max_z_score: Flag outliers with |z-score| > threshold
schema:
extra_columns:
severity: WARNflatten:
enabled: false
separator: "."policies:
- name: pii_basic
overrides:
fields:
phone:
rules:
regex:
value: '^\\+1[0-9]{10}$'
severity: WARNsla:
min_rows: 100
max_rows:
value: 100000
severity: WARN
fields:
- name: event_time
type: string
rules:
freshness_max_age_hours: 24datapact validate --contract contract.yaml --data data.csv --chunksize 50000
datapact validate --contract contract.yaml --data data.csv --sample-rows 10000Chunked validation is supported for CSV and JSONL inputs.
fields:
- name: score
type: float
rules:
custom:
field_max_value:
value: 100
severity: WARN
custom_rules:
- name: dataset_min_rows
config:
value: 1000
severity: ERRORdatapact validate --contract contract.yaml --data data.csv --plugin mypkg.rulesCustom rules run on full data; in streaming mode they run only when sampling is enabled.
JSON reports are saved to ./reports/<timestamp>.json:
{
"passed": false,
"contract": {
"name": "customer_data",
"version": "2.0.0"
},
"dataset": {
"name": "customers"
},
"metadata": {
"timestamp": "2026-02-13T10:30:45.123456",
"tool_version": "2.0.0"
},
"summary": {
"error_count": 2,
"warning_count": 1
},
"errors": [
{
"code": "QUALITY",
"field": "email",
"message": "has 1 values not matching regex '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'",
"severity": "ERROR"
}
]
}For scenario coverage details, see Banking & Finance Test Cases.
# Run tests
pytest
# Enable MySQL-backed DB source tests
export DATAPACT_MYSQL_TESTS=1
export DATAPACT_MYSQL_PASSWORD=<your-mysql-password>
export DATAPACT_MYSQL_HOST=127.0.0.1
export DATAPACT_MYSQL_PORT=3306
export DATAPACT_MYSQL_USER=root
export DATAPACT_MYSQL_DB=datapact_test
export DATAPACT_MYSQL_TABLE=customers
pytest tests/test_db_source.py -v
# With coverage
pytest --cov=src/datapact
# Coverage check with total percent
datapact-coverage --min 80Dependencies are documented in DEPENDENCIES.md.
# Install with dev dependencies
pip install -e ".[dev]"
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type check
mypy src/src/datapact/
├── __init__.py # Package exports
├── contracts.py # Contract parsing (YAML → dataclass models)
├── datasource.py # Data loading and schema inference
├── cli.py # CLI entry point
├── reporting.py # Report generation and serialization
├── versioning.py # Version management and migration
└── validators/
├── __init__.py
├── schema_validator.py # Column/type/required checks
├── quality_validator.py # Null/unique/range/regex/enum checks
└── distribution_validator.py # Mean/std drift detection
tests/
├── test_validator.py # Core validator tests
├── test_versioning.py # Version feature tests
├── test_banking_finance.py # Banking/finance scenarios
├── test_concurrency.py # Concurrency validation
├── test_concurrency_mp.py # Multiprocessing concurrency
└── fixtures/ # Sample contracts and data
The validator supports multiple contract versions with automatic migration and compatibility checking:
- Current Version: 2.0.0
- Supported Versions: 1.0.0, 1.1.0, 2.0.0
- Auto-Migration: Old contracts automatically upgrade to the latest version
- Breaking Changes: Tracked and reported in validation output
See docs/VERSIONING.md for detailed version history, migration guide, and breaking changes.
- docs/EXAMPLES.md — Comprehensive examples for all providers and features (YAML, ODCS, API Pact, quality rules, distributions, custom rules, report sinks, etc.)
- docs/ARCHITECTURE.md — System architecture and design patterns
- FEATURES.md — Feature checklist with compact examples
- CONTRIBUTING.md — Developer guide including provider pattern
MIT
The test suite covers multi-table data products for commercial banking and institutional finance, with deposits and lending modeled as accounts/loans plus transactions/payments. It also reflects consumer-specific contract needs (strict vs aggregate) to validate schema and quality expectations across different consumption patterns.
- PositiveCases: Valid data rows that should pass all schema and quality checks. These represent typical, correct records for deposits and lending products.
- NegativeCases: Rows intentionally containing errors (e.g., missing required fields, invalid dates, negative balances, out-of-range values, or type mismatches). These ensure the validator catches real-world data quality issues.
- BoundaryCases: Edge-case rows that test the limits of contract rules (e.g., zero balances, maximum allowed values, dates at the edge of valid ranges). These confirm the validator's correct handling of contract boundaries.
- Accounts (strict): Unique, non-null customer_id and account_id; valid product/status enums; balances within allowed range.
- Accounts (aggregate): customer_id may be 1% null with 99% uniqueness, while other fields remain strict.
- Transactions: Valid txn_type/channel enums, valid dates, and amounts within limits (including withdrawals/fees).
- Loans (strict): Non-null loan_id and customer_id, valid product/status enums, balances within limits, rates in [0, 0.25].
- Loans (aggregate): customer_id may be 1% null with 99% uniqueness, other fields remain strict.
- Payments: Valid payment_status enums, non-negative amounts, and valid dates.
Test cases are tagged using @pytest.mark.PositiveCases, @pytest.mark.NegativeCases, and @pytest.mark.BoundaryCases for easy filtering and reporting. See tests/test_banking_finance.py for implementation details and tests/fixtures/ for sample data and contracts.