2 unstable releases

Uses new Rust 2024

0.2.0	Dec 30, 2025
0.1.0	Apr 3, 2025

#277 in Data structures

Apache-2.0

195KB
3.5K SLoC

Aisle

Metadata-driven Parquet pruning for Rust: Skip irrelevant data before reading

Aisle evaluates DataFusion predicates against Parquet metadata (row-group statistics, page indexes, bloom filters) to determine which data to skip, dramatically reducing I/O for selective queries without modifying the upstream parquet crate.

📖 Read the full documentation on docs.rs

Why Aisle?

The Problem: Parquet readers typically apply filters after reading data, wasting I/O on irrelevant row groups and pages.

The Solution: Aisle evaluates your predicates against metadata before reading:

Row-group pruning using min/max statistics
Page-level pruning using column/offset indexes
Bloom filter checks for definite absence (high-cardinality columns)

The Result: 70-99% I/O reduction for selective queries with zero changes to the Parquet format.

Quick Start

use aisle::PruneRequest;
use datafusion_expr::{col, lit};
use parquet::file::metadata::ParquetMetaDataReader;
use parquet::arrow::ParquetRecordBatchReaderBuilder;

// 1. Load metadata (without reading data)
let metadata = ParquetMetaDataReader::new()
    .parse_and_finish(&parquet_bytes)?;

// 2. Define your filter using DataFusion expressions
let predicate = col("user_id").gt_eq(lit(1000i64))
    .and(col("age").lt(lit(30i64)));

// 3. Prune row groups
let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&predicate)
    .enable_page_index(false)     // Row-group level only
    .enable_bloom_filter(false)   // No bloom filters
    .prune();

println!("Pruned {} of {} row groups ({}% I/O reduction)",
    metadata.num_row_groups() - result.row_groups().len(),
    metadata.num_row_groups(),
    ((metadata.num_row_groups() - result.row_groups().len()) * 100
        / metadata.num_row_groups())
);

// 4. Apply pruning to Parquet reader
let reader = ParquetRecordBatchReaderBuilder::try_new(parquet_bytes)?
    .with_row_groups(result.row_groups().to_vec())  // Skip irrelevant row groups!
    .build()?;

// Read only the relevant data
for batch in reader {
    // Process matching rows...
}

Add to your Cargo.toml:

[dependencies]
aisle = "0.1"
datafusion-expr = "43"
parquet = "57"
arrow-schema = "57"

Key Features

Row-group pruning: Skip entire row groups using min/max statistics
Page-level pruning: Skip individual pages within row groups
Bloom filter support: Definite absence checks for point queries (=, IN)
DataFusion expressions: Use familiar col("x").eq(lit(42)) syntax
Conservative evaluation: Never skips data that might match (safety first)
Async-first API: Optimized for remote storage (S3, GCS, Azure)
Non-invasive: Works with upstream parquet crate, no format changes
Best-effort compilation: Uses supported predicates even if some fail

How It Works

┌─────────────────────────────────────────────────────┐
│                  Your Query                         │
│   WHERE user_id >= 1000 AND age < 30                │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│              Aisle Compiler                         │
│   Converts DataFusion Expr -> Pruning IR            │
│   (supports =, !=, <, >, <=, >=, BETWEEN, IN,       │
│    IS NULL, LIKE 'prefix%', AND, OR, NOT, CAST)     │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│          Metadata Evaluation                        │
│  • Row-group statistics (min/max, null_count)       │
│  • Page indexes (page-level min/max)                │
│  • Bloom filters (definite absence checks)          │
│  • Tri-state logic (True/False/Unknown)             │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│              Pruning Result                         │
│   row_groups: [2, 5, 7]  <- Only these needed!      │
│   row_selection: Some(...) <- Page-level selection  │
│   compile_result: Unsupported predicates logged     │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│          Parquet Reader                             │
│   .with_row_groups([2, 5, 7])                       │
│   .with_row_selection(...)                          │
│   I/O reduced by 70%! ⚡                             │
└─────────────────────────────────────────────────────┘

What's Supported

Predicates

Type	Example	Row-Group	Page-Level	Bloom Filter
Equality	`col("x").eq(lit(42))`	✓	✓	✓
Inequality	`col("x").not_eq(lit(42))`	✓	✓	✗
Comparisons	`col("x").lt(lit(100))`	✓	✓	✗
Range	`col("x").between(lit(10), lit(20))`	✓	✓	✗
Set membership	`col("x").in_list(vec![lit(1), lit(2)])`	✓	✓	✓
Null checks	`col("x").is_null()`	✓	✓	✗
String prefix	`col("name").like(lit("prefix%"))`	✓	✓	✗
Logical AND	`col("x").gt(lit(10)).and(col("y").lt(lit(5)))`	✓	✓ (best-effort)	✓
Logical OR	`col("x").eq(lit(1)).or(col("x").eq(lit(2)))`	✓	✓ (all-or-nothing)	✓
Logical NOT	`col("x").gt(lit(50)).not()`	✓	✓ (exact only)	✗
Type casting	`cast(col("x"), DataType::Int64).eq(lit(100))`	✓ (no-op casts only)	✓	✓

Data Types

Current supported leaf types for statistics-based pruning:

Integers: Int8/16/32/64, UInt8/16/32/64
Floats: Float32/Float64
Boolean
Strings: Utf8, LargeUtf8, Utf8View
Binary: Binary, LargeBinary, BinaryView, FixedSizeBinary

Not yet supported (treated conservatively as "unknown"):

Temporal logical types (Date32/Date64, Timestamp)
Decimals (Decimal128/Decimal256)
Interval/Duration and other complex logical types

Metadata Sources

Source	Row-Group	Page-Level	Point Queries
Statistics (min/max)	✓ Always	✓ Via page index	Range queries
Null count	✓ Always	✓ Via page index	IS NULL checks
Bloom filters	✓ Optional	✗ Not applicable	`=` and `IN`

Known Limitations

Type coverage is partial: Only the leaf types listed above are supported for stats-based pruning; temporal/logical types and decimals are currently conservative.
Byte array ordering requires column metadata: For ordering predicates (<, >, <=, >=) on Binary/Utf8 columns:
- Default (conservative): Requires TYPE_DEFINED_ORDER(UNSIGNED) column order AND exact (non-truncated) min/max statistics
- Opt-in (aggressive): Use .allow_truncated_byte_array_ordering(true) to allow truncated statistics, but be aware this may cause false negatives if truncation changes ordering semantics
- Equality predicates (=, !=, IN) always work regardless of truncation
No non-trivial column casts: Only no-op column casts are allowed; literal casts happen at compile time.
Page-level NOT is conservative: NOT is only inverted when the inner page selection is exact; otherwise it falls back to row-group evaluation.
OR requires full support: If any OR branch is unsupported at page level, page pruning is disabled for the whole OR.
LIKE support is limited: Only prefix patterns ('prefix%') are pushed down.

Usage Examples

Page-level pruning:

let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&predicate)
    .enable_page_index(true)
    .prune();

Async with bloom filters:

let metadata = builder.metadata().clone();
let schema = builder.schema().clone();
let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&predicate)
    .enable_bloom_filter(true)
    .prune_async(&mut builder).await;

Custom bloom provider:

impl AsyncBloomFilterProvider for MyProvider {
    async fn bloom_filter(&mut self, rg: usize, col: usize) -> Option<Sbbf> {
        // Your optimized loading logic
    }
}

let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&predicate)
    .prune_async(&mut my_provider).await;

Byte array ordering (advanced):

// Conservative (default): Requires exact min/max for ordering predicates
let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&col("name").gt(lit("prefix")))
    .prune();

// Aggressive: Allow truncated byte array statistics (may have false negatives)
let result = PruneRequest::new(&metadata, &schema)
    .with_predicate(&col("name").gt(lit("prefix")))
    .allow_truncated_byte_array_ordering(true)
    .prune();

Performance

Reasonable expectations (actual results depend on file layout and metadata quality):

Query Type	Expected I/O Reduction	Notes
Point query (`id = 12345`)	High (often substantial)	Best with bloom filters + accurate stats
Range query (`date BETWEEN ...`)	Moderate to high	Depends on row-group size and data distribution
Multi-column filter	Moderate	AND helps; OR can reduce page-level pruning
High-cardinality IN (`sku IN (...)`)	Moderate to high	Bloom filters help when present

These are guidance only until benchmarks land.

Performance Factors:

Row group size: Larger row groups → better statistics granularity
Predicate selectivity: Lower selectivity → more pruning opportunities
Column cardinality: Bloom filters shine for high-cardinality columns
Page index availability: Enables page-level pruning (Parquet 1.12+)

Overhead: Metadata evaluation is typically small relative to I/O, but varies with predicate complexity and metadata availability.

Examples

Run the included examples to see end-to-end usage:

basic_usage: Row-group pruning with metadata and predicates
bloom_filter: Async API with bloom filter support

# Row-group pruning example
cargo run --example basic_usage

# Async + bloom filters
cargo run --example bloom_filter

When to Use Aisle

Good fit:

Selective queries (reading <20% of data)
Large Parquet files (>100MB, multiple row groups)
Remote storage (S3, GCS) where I/O is expensive
High-cardinality point queries (user IDs, transaction IDs)
Time-series with range queries

Not needed:

Full table scans (no pruning benefit)
Small files (<10MB, single row group)
Already using a query engine with built-in pruning (DataFusion, DuckDB)

Tips: Combine Aisle with proper Parquet configuration:

Sort data by frequently-filtered columns
Use reasonable row group sizes (64-256MB)
Enable bloom filters for high-cardinality columns
Write page indexes (Parquet 1.12+)

Architecture

See detailed documentation:

Architecture: Internal design and IR compilation
CLAUDE.md: Project overview and design philosophy
Development Plan: Implementation roadmap

Testing

Aisle has comprehensive test coverage (111 tests):

# Run all tests
cargo test

# Run specific test suites
cargo test --test best_effort_pruning  # NOT pushdown edge cases
cargo test --test null_count_edge_cases  # Null handling
cargo test --test async_bloom  # Bloom filter integration

License

Licensed under the MIT License. See LICENSE for details.

Built with: arrow-rs • DataFusion • Parquet

Dependencies

~41MB
~680K SLoC