Mark Raasveldt
Parallel Quacking
Parallel Quacking
▸ When building DuckDB we have mostly
focused on building a functional system
▸ Avoid premature optimization
▸ Avoid adding optimizations that prevent
adding features
Parallel Quacking
▸ Suddenly people are benchmarking our system
▸ Including benchmarks in research papers
▸ Yikes!
▸ We haven’t exactly spend a lot of time
optimizing…
Parallel Quacking
▸ We are now pretty happy with functionality
▸ Window functions, subqueries, collations,
(recursive) CTEs, Parquet/Pandas/CSV
readers, …
▸ Maybe we should start optimizing!
Parallel Quacking
▸ DuckDB is currently single-threaded
▸ Parallelism is an obvious performance boost
▸ More importantly: parallelism requires a
structural change to the code
▸ Optimizations need to account for parallelism
▸ Optimizing a single-threaded HT is pointless if
we have to throw it away once we add
parallelism!
Parallel Quacking
▸ Parallelism is actually our oldest open issue!
▸ Created one month after the initial commit
▸ So it’s about time :)
DBMS Parallelism
▸ Short intro to DBMS parallelism
▸ DBMS have two types of parallelism
▸ Inter-query and intra-query parallelism
▸ Inter-query: multiple different queries
can be executed in parallel
▸ Intra-query: a single query can be
parallelized
DBMS Parallelism
▸ Most systems have inter-query
▸ We already had this
▸ Most useful for OLTP systems
▸ Many concurrent clients requests, etc
DBMS Parallelism
▸ Intra-query is not part of most OLTP
systems
▸ e.g. MySQL/PostgreSQL/SQLite
▸ Not useful for small queries
▸ Only useful for complex queries
▸ Aka OLAP systems
DBMS Parallelism
▸ Exchange operator: original way of
doing parallelism
▸ Parallelism is encapsulated in the
exchange operator
▸ All other ops are unaware of parallelism
▸ Easy to bolt onto existing systems
[1993] Encapsulation of Parallelism and
Architecture-Independence in Extensible
Database Query Execution
Goetz Graefe et al.
DBMS Parallelism
DBMS Parallelism
▸ MonetDB uses system similar to exchange
operator
▸ Individual ops are parallelism-unaware
▸ Data is partitioned by mitosis (mergetable?)
▸ Ops execute sequentially on partitions
▸ Result is combined by mat.pack
DBMS Parallelism
▸ Exchange operator works to parallelize queries
▸ It is nice to bolt on to an existing system
▸ Don’t need to change any operators!
▸ But has partitioning/merging overhead…
▸ Works well for certain queries1, not for many
others
▸ 1ungrouped aggregates or aggregates with low
amount of groups
Morsel-Driven Parallelism
▸ Alternative: Morsel-driven parallelism
▸ Parallelism-aware operators
▸ Query is divided into pipelines
▸ Those pipelines are executed in parallel
[2014] Morsel-Driven Parallelism: A
NUMA-Aware Query Evaluation
Framework for the Many-Core Age
Viktor Leis et al.
Morsel-Driven Parallelism
SELECT …
FROM S
JOIN R USING (A) 3: Probe HTs and output result
JOIN T USING (B); (depends on 1 and 2)
1: HT Build “T”
2: HT Build “S”
Morsel-Driven Parallelism
SELECT …
FROM S
JOIN R USING (A)
JOIN T USING (B);
HT Build “T”
HT Build “S”
▸ HT builds of S and T can be trivially parallelized
▸ No shared data
▸ Limited parallelizability: depends on Q complexity…
Morsel-Driven Parallelism
▸ Need to parallelize inside a pipeline
▸ How to do that?
▸ Contention happens at endpoints
▸ Scan of T
▸ HT build at join HT Build “T”
▸ Use parallelism-aware operators at endpoints
▸ The rest of the operators (HT probe, projection,
filter, etc…) don't need to be aware
Morsel-Driven Parallelism
TPC-H SF100, 32 cores
[2014] Morsel-Driven Parallelism: A
NUMA-Aware Query Evaluation
Framework for the Many-Core Age
Viktor Leis et al.
Morsel-Driven Parallelism
TPC-H SF100, 32 cores
[2014] Morsel-Driven Parallelism: A
NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age
Viktor Leis et al.
Morsel-Driven Parallelism
TPC-H SF100, 32 cores
[2014] Morsel-Driven Parallelism: A
NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age
Viktor Leis et al.
Morsel-Driven Parallelism
TPC-H SF100, 32 cores
[2014] Morsel-Driven Parallelism: A
NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age
Viktor Leis et al.
Morsel-Driven Vegetable Soup
▸ Morsel-driven parallelism seems like the way to go
▸ How can we add it to our vegetable soup?
Parallelism in DuckDB
▸ DuckDB uses a pull-based volcano execution model
▸ "Vector Volcano”
▸ Every operator implements a GetChunk operator
▸ Recursively calls GetChunk on children
▸ Until we reach a data source (e.g. table scan)
Parallelism in DuckDB
▸ BuildHashTable: pull everything from RHS (build-side)
▸ ProbeHashTable: pull single chunk from LHS (probe
side)
Parallelism in DuckDB
▸ Have to split up building from probing
▸ Create individual pipelines
▸ Design interface that allows for parallel-aware execution
Parallelism in DuckDB
▸ Contention is in the source and sink of a pipeline
▸ Most difficult contention is in the sink
▸ Splitting up a scan is relatively simple
Parallelism in DuckDB
▸ Sink Interface
▸ Sink has two states
▸ Global state: single state per sink
▸ Local state: single state per thread
▸ Actual content depends on the operator
Parallelism in DuckDB
▸ Sink Interface
▸ Sink takes as input the two states + a DataChunk
▸ Called repeatedly until the source data is exhausted
Parallelism in DuckDB
▸ Sink Interface
▸ Combine is called after source of a single thread is
exhausted
▸ Combine is the final chance to merge any changes
in the local sink state to the global state
Parallelism in DuckDB
▸ Sink Interface
▸ Finalize is called after all tasks related to the sink
are completed
Parallelism in DuckDB
▸ Example: Ungrouped Aggregate
▸ Global state holds the aggregate result, and a lock
Parallelism in DuckDB
▸ Example: Ungrouped Aggregate
▸ Local state holds a thread-local aggregate, and
some intermediates
Parallelism in DuckDB
▸ Example: Ungrouped Aggregate
▸ Sink: Aggregate into thread-local aggregation
Parallelism in DuckDB
▸ Example: Ungrouped Aggregate
▸ Combine: Merge local state into global state
Parallelism in DuckDB
▸ Example: Ungrouped Aggregate
▸ Finalize: Nothing, we are done
▸ (both Combine and Finalize are optional)
Parallelism in DuckDB
▸ Splitting up scans
▸ Splitting up scans is generally not very difficult
▸ But we have multiple types of scans
▸ Base table, parquet, CSV, aggregate HT, etc…
▸ How to split up depends on scan type
Parallelism in DuckDB
▸ Interface for parallel scans:
▸ One task is created for every invoked callback
▸ Implementation is optional
▸ No implementation -> scan will not be parallelized
Parallelism in DuckDB
▸ Currently only implemented for base table
▸ One task for every 100 vectors (102,400 tuples)
▸ Parquet/Pandas is not very complicated
▸ CSV can also benefit…
▸ Future work!
Parallelism in DuckDB
▸ Creating the pipelines
▸ Created by a single traversal of the query tree
▸ Encounter a pipeline breaker: create a new
pipeline
Parallelism in DuckDB
Encounter hash join: create build pipeline in RHS*
SELECT … and create a dependency in main pipeline
FROM S
JOIN R USING (A)
JOIN T USING (B);
* This image is taken from HyPer which builds on the LHS - we build on the RHS.
Is there a standard? Should we switch this? Is it even important?
Parallelism in DuckDB
SELECT …
FROM S
JOIN R USING (A)
Another hash join: create another
JOIN T USING (B); build pipeline and dependency
Parallelism in DuckDB
TPC-H Q1
P1 (depends on P2)
Scans the aggregate HT!
P2
This 0 is a bug in our profiler with parallel execution atm, TODO
Parallelism in DuckDB
▸ Notes on parallelism
▸ The final pipeline (i.e. the one that outputs
results) is not parallelized
▸ Doesn’t matter for TPC-H (there is always a Top-N
or ORDER BY…)
▸ But can definitely matter for other queries!
▸ We can push a “materialize” operator that
materializes in parallel
▸ Future work!
Parallelism in DuckDB
▸ Notes on load balancing
▸ Pipelines are split into tasks
▸ Tasks are scheduled in a concurrent queue
▸ Worker threads work on these tasks in scheduled
order
▸ Except the calling thread: this thread works on its
own query
▸ Short queries will not have to wait for long queries
▸ Every query has at least one thread working on it
Parallelism in DuckDB
▸ NUMA Awareness
▸ TODO :)
Preliminary Results
▸ Results
▸ Before we implemented splitting of scans we were
curious
▸ How much does TPC-H benefit from inter-pipeline
parallelization?
Preliminary Results
▸ Small speedup in some queries
▸ Most queries are dominated by a single pipeline!
* Actually 3 threads, due to an off-by-one :)
Preliminary Results
▸ Preliminary results (including splitting of pipelines)
▸ Notes
▸ We did not implement a good aggregate HT yet!
▸ Currently global HT that is locked on every sink
▸ Join HT/scan also have a (low) amount of contention
▸ Did not have much time to look at it yet
▸ This was all finished last Thursday :)
Preliminary Results
▸ Preliminary results
Preliminary Results
▸ Preliminary results
Preliminary Results
▸ Q1
Parallel Sequential
Preliminary Results
▸ Q18 Sequential
Preliminary Results
▸ Q18 Parallel
Future Work
▸ Future Work
▸ Rework aggregate hash table
▸ More profiling of contention (specifically in scans)
▸ Parallel window functions, ORDER BY, Top N…
▸ Parallel Parquet/CSV/Pandas scans
▸ Expand profiler to better display parallelism/
pipelines