Optimize parquet writer output for duckdb-wasm in `buildMultipartParquet`


*Drafted with AI assistance based on a performance issue we noticed in our own usage — sharing as framing rather than a fully-formed proposal.*

We have an Evidence project where initial dashboard load is ~10 seconds, dominated by fetching and scanning large parquet files via duckdb-wasm. Looking at `buildMultipartParquet` in `packages/lib/universal-sql/src/build-parquet.js`, it calls `parquet-wasm`'s `writeParquet` with default writer properties, except ZSTD compression — default row group sizing, no page index.

Two things about Evidence's architecture make this worth revisiting:

**The reader is always duckdb-wasm.** Unlike most parquet pipelines, Evidence isn't writing files that need to be portable across Spark, Trino, Athena, or Pandas — the parquet cache is consumed by duckdb-wasm both at build time (for SSR) and at runtime (in the browser). There's a single known consumer with documented preferences (DuckDB recommends 100K–1M row groups, exploits page indexes, handles ZSTD efficiently). Optimizing for that one engine has no downside.

**Evidence is write-once, read-many.** Sources are built once per refresh and served to every dashboard load thereafter. That asymmetry means heavier write-time work (better compression, larger sorted row groups, page indexes) directly buys read-time wins. The current defaults are reasonable for a general-purpose writer but don't reflect Evidence's actual workload.

The docs already point users toward `ORDER BY` in source queries for better compression and projection pushdown, which suggests the team understands runtime parquet layout matters for duckdb-wasm specifically — tuning the rest of the writer to match feels like a natural extension.

**Questions I don't know the answer to**

- Has this been considered before and rejected for reasons not visible from outside?
- Is there a preferred config surface (per-source `connection.yaml`, global config, per-query header), or is changing defaults the cleaner path given the fixed reader?
- Are there constraints from duckdb-wasm version compatibility I'm missing?

Mostly wanted to surface it in case it's useful input — happy for it to sit as a "noted, maybe later" if that's the right call.

**Sketch of the change**

The current call to `writeParquet` uses defaults. The lever is `WriterPropertiesBuilder` from `parquet-wasm`:

```js
import {
  Table as ParquetTable,
  WriterPropertiesBuilder,
  Compression,
  writeParquet
} from 'parquet-wasm/node/arrow1';
import { tableToIPC } from 'apache-arrow';

// Build once, reuse across flushes
const writerProps = new WriterPropertiesBuilder()
  .setCompression(Compression.ZSTD)        // vs default SNAPPY
  .setMaxRowGroupSize(122_880)             // DuckDB's native default
  .setDictionaryEnabled(true)
  .setWritePageIndex(true)                 // enables page-level skipping
  .setStatisticsEnabled(true)
  .build();

// In flush():
const wasmTable = ParquetTable.fromIPCStream(tableToIPC(arrowTable, 'stream'));
const bytes = writeParquet(wasmTable, writerProps);
```

The four properties above map directly to things duckdb-wasm exploits at read time: smaller files to fetch, fewer/larger row groups for parallel scan, page-level min/max for predicate pushdown, dictionary pages reused across groups for low-cardinality strings.

Per-source overrides could be threaded through from `connection.yaml` so users can tune for their workload without forking — but defaults alone would likely be a meaningful win for most projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize parquet writer output for duckdb-wasm in `buildMultipartParquet` #3301

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize parquet writer output for duckdb-wasm in buildMultipartParquet #3301

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize parquet writer output for duckdb-wasm in `buildMultipartParquet` #3301