dataprof

A fast, reliable data quality assessment tool built in Rust. Analyze datasets with 20x better memory efficiency than pandas, unlimited file streaming, and comprehensive ISO 8000/25012 compliant quality checks across 5 dimensions: Completeness, Consistency, Uniqueness, Accuracy, and Timeliness. Full Python bindings and production database connectivity included.

Perfect for data scientists, engineers, analysts, and anyone working with data who needs quick, reliable quality insights.

Privacy & Transparency

DataProf processes all data locally on your machine. Zero telemetry, zero external data transmission.

Read exactly what DataProf analyzes →

100% local processing - your data never leaves your machine
No telemetry or tracking
Open source & fully auditable
Read-only database access (when using DB features)

Complete transparency: Every metric, calculation, and data point is documented with source code references for independent verification.

CI/CD Integration

Automate data quality checks in your workflows with our GitHub Action:

- name: DataProf Quality Check
  uses: AndreaBozzo/dataprof-actions@v1
  with:
    file: 'data/dataset.csv'
    quality-threshold: 80
    fail-on-issues: true
    # Batch mode (NEW)
    recursive: true
    output-html: 'quality-report.html'

Get the Action →

Zero setup - works out of the box
ISO 8000/25012 compliant - industry-standard quality metrics
Batch processing - analyze entire directories recursively
Flexible - customizable thresholds and output formats
Fast - typically completes in under 2 minutes

Perfect for ensuring data quality in pipelines, validating data integrity, or generating automated quality reports.Updated to latest release.

Quick Start

CLI (Recommended - Full Features)

Installation: Download pre-built binaries from Releases or build from source with cargo install dataprof.

Note: After building with cargo build --release, the binary is located at target/release/dataprof-cli.exe (Windows) or target/release/dataprof (Linux/Mac). Run it from the project root as target/release/dataprof-cli.exe <command> or add it to your PATH.

Basic Analysis

# Comprehensive quality analysis
dataprof analyze data.csv --detailed

# Analyze Parquet files (requires --features parquet)
dataprof analyze data.parquet --detailed

# Windows example (from project root after cargo build --release)
target\release\dataprof-cli.exe analyze data.csv --detailed

HTML Reports

# Generate HTML report with visualizations
dataprof report data.csv -o quality_report.html

# Custom template
dataprof report data.csv --template custom.hbs --detailed

Batch Processing

# Process entire directory with parallel execution
dataprof batch /data/folder --recursive --parallel

# Generate HTML batch dashboard
dataprof batch /data/folder --recursive --html batch_report.html

# JSON export for CI/CD automation
dataprof batch /data/folder --json batch_results.json --recursive

# JSON output to stdout
dataprof batch /data/folder --format json --recursive

# With custom filter and progress
dataprof batch /data/folder --filter "*.csv" --parallel --progress

Database Analysis

# PostgreSQL table profiling
dataprof database postgres://user:pass@host/db --table users

# Custom SQL query
dataprof database sqlite://data.db --query "SELECT * FROM users WHERE active=1"

Benchmarking

# Benchmark different engines on your data
dataprof benchmark data.csv

# Show engine information
dataprof benchmark --info

Advanced Options

# Streaming for large files
dataprof analyze large_dataset.csv --streaming --sample 10000

# JSON output for programmatic use
dataprof analyze data.csv --format json --output results.json

# Custom ISO threshold profile
dataprof analyze data.csv --threshold-profile strict

Quick Reference: All commands follow the pattern dataprof <command> [args]. Use dataprof help or dataprof <command> --help for detailed options.

Python Bindings

pip install dataprof

import dataprof

# Comprehensive quality analysis (ISO 8000/25012 compliant)
report = dataprof.analyze_csv_with_quality("data.csv")
print(f"Quality score: {report.quality_score():.1f}%")

# Access individual quality dimensions
metrics = report.data_quality_metrics
print(f"Completeness: {metrics.complete_records_ratio:.1f}%")
print(f"Consistency: {metrics.data_type_consistency:.1f}%")
print(f"Uniqueness: {metrics.key_uniqueness:.1f}%")

# Batch processing
result = dataprof.batch_analyze_directory("/data", recursive=True)
print(f"Processed {result.processed_files} files at {result.files_per_second:.1f} files/sec")

Note: Database profiling is available via CLI only. Python users can export SQL results to CSV and use analyze_csv_with_quality().

Full Python API Documentation →

Rust Library

cargo add dataprof

use dataprof::*;

// High-performance Arrow processing for large files (>100MB)
// Requires compilation with: cargo build --features arrow
#[cfg(feature = "arrow")]
let profiler = DataProfiler::columnar();
#[cfg(feature = "arrow")]
let report = profiler.analyze_csv_file("large_dataset.csv")?;

// Standard adaptive profiling (recommended for most use cases)
let profiler = DataProfiler::auto();
let report = profiler.analyze_file("dataset.csv")?;

Development

Want to contribute or build from source? Here's what you need:

Prerequisites

Rust (latest stable via rustup)
Docker (for database testing)

Quick Setup

git clone https://github.com/AndreaBozzo/dataprof.git
cd dataprof
cargo build --release  # Build the project
docker-compose -f .devcontainer/docker-compose.yml up -d  # Start test databases

Feature Flags

dataprof uses optional features to keep compile times fast and binaries lean:

# Minimal build (CSV/JSON only, ~60s compile)
cargo build --release

# With Apache Arrow (columnar processing, ~90s compile)
cargo build --release --features arrow

# With Parquet support (requires arrow, ~95s compile)
cargo build --release --features parquet

# With database connectors
cargo build --release --features postgres,mysql,sqlite

# All features (full functionality, ~130s compile)
cargo build --release --all-features

When to use Arrow?

✅ Files > 100MB with many columns (>20)
✅ Columnar data with uniform types
✅ Need maximum throughput (up to 13x faster)
❌ Small files (<10MB) - standard engine is faster
❌ Mixed/messy data - streaming engine handles better

When to use Parquet?

✅ Analytics workloads with columnar data
✅ Data lake architectures
✅ Integration with Spark, Pandas, PyArrow
✅ Efficient storage and compression
✅ Type-safe schema preservation

Common Development Tasks

cargo test          # Run all tests
cargo bench         # Performance benchmarks
cargo fmt           # Format code
cargo clippy        # Code quality checks

Documentation

Privacy & Transparency

What DataProf Does - Complete transparency guide with source code verification

User Guides

Python API Reference - Full Python API documentation
Python Integrations - Pandas, scikit-learn, Jupyter, Airflow, dbt
Database Connectors - Production database connectivity
Apache Arrow Integration - Columnar processing guide
CLI Usage Guide - Complete CLI reference

Developer Guides

Development Guide - Complete setup and contribution guide
Performance Guide - Optimization and benchmarking
Performance Benchmarks - Benchmark results and methodology

License

Licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.cargo		.cargo
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets/animations		assets/animations
benches		benches
benchmark-results		benchmark-results
docs		docs
examples		examples
notebooks		notebooks
python		python
scripts		scripts
src		src
templates		templates
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.trufflehogignore		.trufflehogignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
tarpaulin.toml		tarpaulin.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

dataprof

Privacy & Transparency

CI/CD Integration

Quick Start

CLI (Recommended - Full Features)

Basic Analysis

HTML Reports

Batch Processing

Database Analysis

Benchmarking

Advanced Options

Python Bindings

Rust Library

Development

Prerequisites

Quick Setup

Feature Flags

Common Development Tasks

Documentation

Privacy & Transparency

User Guides

Developer Guides

License

About

Uh oh!

Releases 16

Uh oh!

Contributors 3

Uh oh!

Languages

Uh oh!

License

Uh oh!

AndreaBozzo/dataprof

Folders and files

Latest commit

History

Repository files navigation

dataprof

Privacy & Transparency

CI/CD Integration

Quick Start

CLI (Recommended - Full Features)

Basic Analysis

HTML Reports

Batch Processing

Database Analysis

Benchmarking

Advanced Options

Python Bindings

Rust Library

Development

Prerequisites

Quick Setup

Feature Flags

Common Development Tasks

Documentation

Privacy & Transparency

User Guides

Developer Guides

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 16

Uh oh!

Contributors 3

Uh oh!

Languages