Drafted with AI assistance based on a performance issue we noticed in our own usage — sharing as framing rather than a fully-formed proposal.
We have an Evidence project where initial dashboard load is ~10 seconds, dominated by fetching and scanning large parquet files via duckdb-wasm. Looking at buildMultipartParquet in packages/lib/universal-sql/src/build-parquet.js, it calls parquet-wasm's writeParquet with default writer properties, except ZSTD compression — default row group sizing, no page index.
Two things about Evidence's architecture make this worth revisiting:
The reader is always duckdb-wasm. Unlike most parquet pipelines, Evidence isn't writing files that need to be portable across Spark, Trino, Athena, or Pandas — the parquet cache is consumed by duckdb-wasm both at build time (for SSR) and at runtime (in the browser). There's a single known consumer with documented preferences (DuckDB recommends 100K–1M row groups, exploits page indexes, handles ZSTD efficiently). Optimizing for that one engine has no downside.
Evidence is write-once, read-many. Sources are built once per refresh and served to every dashboard load thereafter. That asymmetry means heavier write-time work (better compression, larger sorted row groups, page indexes) directly buys read-time wins. The current defaults are reasonable for a general-purpose writer but don't reflect Evidence's actual workload.
The docs already point users toward ORDER BY in source queries for better compression and projection pushdown, which suggests the team understands runtime parquet layout matters for duckdb-wasm specifically — tuning the rest of the writer to match feels like a natural extension.
Questions I don't know the answer to
- Has this been considered before and rejected for reasons not visible from outside?
- Is there a preferred config surface (per-source
connection.yaml, global config, per-query header), or is changing defaults the cleaner path given the fixed reader?
- Are there constraints from duckdb-wasm version compatibility I'm missing?
Mostly wanted to surface it in case it's useful input — happy for it to sit as a "noted, maybe later" if that's the right call.
Sketch of the change
The current call to writeParquet uses defaults. The lever is WriterPropertiesBuilder from parquet-wasm:
import {
Table as ParquetTable,
WriterPropertiesBuilder,
Compression,
writeParquet
} from 'parquet-wasm/node/arrow1';
import { tableToIPC } from 'apache-arrow';
// Build once, reuse across flushes
const writerProps = new WriterPropertiesBuilder()
.setCompression(Compression.ZSTD) // vs default SNAPPY
.setMaxRowGroupSize(122_880) // DuckDB's native default
.setDictionaryEnabled(true)
.setWritePageIndex(true) // enables page-level skipping
.setStatisticsEnabled(true)
.build();
// In flush():
const wasmTable = ParquetTable.fromIPCStream(tableToIPC(arrowTable, 'stream'));
const bytes = writeParquet(wasmTable, writerProps);
The four properties above map directly to things duckdb-wasm exploits at read time: smaller files to fetch, fewer/larger row groups for parallel scan, page-level min/max for predicate pushdown, dictionary pages reused across groups for low-cardinality strings.
Per-source overrides could be threaded through from connection.yaml so users can tune for their workload without forking — but defaults alone would likely be a meaningful win for most projects.
Drafted with AI assistance based on a performance issue we noticed in our own usage — sharing as framing rather than a fully-formed proposal.
We have an Evidence project where initial dashboard load is ~10 seconds, dominated by fetching and scanning large parquet files via duckdb-wasm. Looking at
buildMultipartParquetinpackages/lib/universal-sql/src/build-parquet.js, it callsparquet-wasm'swriteParquetwith default writer properties, except ZSTD compression — default row group sizing, no page index.Two things about Evidence's architecture make this worth revisiting:
The reader is always duckdb-wasm. Unlike most parquet pipelines, Evidence isn't writing files that need to be portable across Spark, Trino, Athena, or Pandas — the parquet cache is consumed by duckdb-wasm both at build time (for SSR) and at runtime (in the browser). There's a single known consumer with documented preferences (DuckDB recommends 100K–1M row groups, exploits page indexes, handles ZSTD efficiently). Optimizing for that one engine has no downside.
Evidence is write-once, read-many. Sources are built once per refresh and served to every dashboard load thereafter. That asymmetry means heavier write-time work (better compression, larger sorted row groups, page indexes) directly buys read-time wins. The current defaults are reasonable for a general-purpose writer but don't reflect Evidence's actual workload.
The docs already point users toward
ORDER BYin source queries for better compression and projection pushdown, which suggests the team understands runtime parquet layout matters for duckdb-wasm specifically — tuning the rest of the writer to match feels like a natural extension.Questions I don't know the answer to
connection.yaml, global config, per-query header), or is changing defaults the cleaner path given the fixed reader?Mostly wanted to surface it in case it's useful input — happy for it to sit as a "noted, maybe later" if that's the right call.
Sketch of the change
The current call to
writeParquetuses defaults. The lever isWriterPropertiesBuilderfromparquet-wasm:The four properties above map directly to things duckdb-wasm exploits at read time: smaller files to fetch, fewer/larger row groups for parallel scan, page-level min/max for predicate pushdown, dictionary pages reused across groups for low-cardinality strings.
Per-source overrides could be threaded through from
connection.yamlso users can tune for their workload without forking — but defaults alone would likely be a meaningful win for most projects.