Currently our execution model operates in a pull-based volcano-like fashion. That means that an operator exposes a GetChunk function that fetches a result chunk from the operator. The operator will, in turn, fetch result chunks from its children using this same interface until it reaches a source node (e.g. a base table scan or a parquet file) which can actually emit files, after which execution resumes.
A simple example of such an operator is the projection:
void PhysicalProjection::GetChunkInternal(ExecutionContext &context, DataChunk &chunk, PhysicalOperatorState *state_p) {
auto state = reinterpret_cast<PhysicalProjectionState *>(state_p);
// get the next chunk from the child
children[0]->GetChunk(context, state->child_chunk, state->child_state.get());
if (state->child_chunk.size() == 0) {
return;
}
state->executor.Execute(state->child_chunk, chunk);
}
This works semi-elegantly and has generally served us well. However, now that we have introduced pipeline parallelism the model is beginning to show cracks. In the pipeline parallelism model, we no longer want to have the behavior of "pulling from the root node". Instead, we want to execute pipelines separately.
The way this is done right now is a semi-hacky solution on top of this model. If we have a pipeline (e.g. a hash table build), we pull from the child node of that hash table using GetChunk, and then call Sink with the result of this. Partitioning is done by writing partition information to the thread-local ExecutionContext object, and using that in the source node to determine the desired partitioning. For example, here is how this is done in the TableScan:
// table scan
auto &task = context.task;
// check if there is any parallel state to fetch
state.parallel_state = nullptr;
auto task_info = task.task_info.find(this);
if (task_info != task.task_info.end()) {
// parallel scan init
state.parallel_state = task_info->second;
state.operator_data =
function.parallel_init(context.client, bind_data.get(), state.parallel_state, column_ids, &filters);
} else {
// sequential scan init
state.operator_data = function.init(context.client, bind_data.get(), column_ids, &filters);
}
Problem 1: Code Duplication
As a result of these solutions, there is a bunch of code in the operators that relates to the general control flow of the system that really should not be part of the operators but handled on a higher level. This code is repeated in many places as a result. For example, the code for pulling a new chunk from the child is repeated in many different operators:
// get the next chunk from the child
children[0]->GetChunk(context, state->child_chunk, state->child_state.get());
if (state->child_chunk.size() == 0) {
return;
}
Or, even worse, for operators that reduce cardinality (e.g. filters), the operator needs to handle the case where the chunk becomes empty as a result of the operator and keep on pulling from its child node:
// filter
do {
// fetch a chunk from the child and run the filter
// we repeat this process until either (1) passing tuples are found, or (2) the child is completely exhausted
children[0]->GetChunk(context, chunk, state->child_state.get());
if (chunk.size() == 0) {
return;
}
initial_count = chunk.size();
result_count = state->executor.SelectExpression(chunk, sel);
} while (result_count == 0);
This general control flow code does not belong to the operators themselves, and repeating it everywhere is error-prone and messy. This goes double for the parallel scan partitioning. The interface provided there right now is messy, and is the main reason we have not implemented more parallel scans (e.g. parallel scan of aggregate HTs, etc). Extending this behavior should be easy.
Problem 2: Parallel UNION operators
Another problem we have encountered is with parallelizing UNION operators. In a pipeline that has a union operator (note: the union operator is not duplicate eliminating, i.e. this is equivalent to UNION ALL), parallelization appears rather trivial. Namely, if you have HASH_BUILD(UNION(A, B)), you can parallelize the scans over the separate tables just as you would parallellize the scans within one table. With the pull-based model, however, this is particularly difficult because we now need to decide which side to pull from on every union node on every iteration we make. This could be solved in a similar way to how we resolve the partitioning of the table scans, but this would require us to add bookkeeping for every union along the pipeline in a similarly hacky way.
Problem 3: Individual execution of operators
Another problem with the current approach of inlining operators into a tree is that operators cannot be executed separately from this tree. This makes it more difficult to take an operator and use them elsewhere, as a fake tree has to be constructed (to scan data from). A push-based model resolves this in a much cleaner manner by allowing you to simply execute an operator completely in isolation.
Initial Sketch of Approach
My solution for this approach is to modify the interface which operators implement, and perform the actual data flow around the operators at a higher level. For example, the interface could look like this:
// scan operator, used when the operator is a data source (i.e. table scans, but also for certain pipeline breakers such as aggregate hash tables, orders or or window functions)
virtual void Scan(ExecutionContext &context, PhysicalOperatorState *state, PhysicalPartitionState *partition_state, DataChunk &output) = 0;
// execute operator, used to propagate information through a standard operator such as a filter or projection
virtual void Execute(ExecutionContext &context, PhysicalOperatorState *state, const DataChunk &input, DataChunk &result);
The general loop infrastructure might look like this:
idx_t index = 0;
vector<DataChunk> intermediates;
vector<unique_ptr<PhysicalOperatorState>> states;
// initialize intermediate structures and states
intermediates.resize(pipeline_nodes.size());
states.resize(pipeline_nodes.size());
for(idx_t i = 0; i < pipeline_nodes.size(); i++) {
intermediates[i].Initialize(pipeline_nodes[i]->return_type);
states[i] = pipeline_nodes[i]->GetOperatorState(context);
}
while(true) {
pipeline_nodes[0]->Scan(context, states[0].get(), partition_state, intermediates[0]);
if (intermediates[0].size() == 0) {
// empty result from scan: bail
break;
}
bool finished = true;
for(idx_t i = 1; i < pipeline_nodes.size() - 1; i++) {
pipeline_nodes[i]->Execute(context, states[i].get(), intermediates[i - 1], intermediates[i]);
if (intermediates[i].size() == 0) {
// everything was filtered: move to next node from scan
finished = false;
break;
}
}
if (finished) {
// finished pipeline: sink into operator
auto &sink = pipeline_nodes.back();
sink->Sink(context, global_state, sink_state, intermediates[intermediates.size() - 1]);
}
}
Operators would be generally simplified, e.g. the projection will now look like this:
void PhysicalProjection::Execute(ExecutionContext &context, PhysicalOperatorState *state_p, const DataChunk &input, DataChunk &output) {
auto state = reinterpret_cast<PhysicalProjectionState *>(state_p);
state->executor.Execute(input, output);
}
Currently our execution model operates in a pull-based volcano-like fashion. That means that an operator exposes a
GetChunkfunction that fetches a result chunk from the operator. The operator will, in turn, fetch result chunks from its children using this same interface until it reaches a source node (e.g. a base table scan or a parquet file) which can actually emit files, after which execution resumes.A simple example of such an operator is the projection:
This works semi-elegantly and has generally served us well. However, now that we have introduced pipeline parallelism the model is beginning to show cracks. In the pipeline parallelism model, we no longer want to have the behavior of "pulling from the root node". Instead, we want to execute pipelines separately.
The way this is done right now is a semi-hacky solution on top of this model. If we have a pipeline (e.g. a hash table build), we pull from the child node of that hash table using
GetChunk, and then callSinkwith the result of this. Partitioning is done by writing partition information to the thread-localExecutionContextobject, and using that in the source node to determine the desired partitioning. For example, here is how this is done in the TableScan:Problem 1: Code Duplication
As a result of these solutions, there is a bunch of code in the operators that relates to the general control flow of the system that really should not be part of the operators but handled on a higher level. This code is repeated in many places as a result. For example, the code for pulling a new chunk from the child is repeated in many different operators:
Or, even worse, for operators that reduce cardinality (e.g. filters), the operator needs to handle the case where the chunk becomes empty as a result of the operator and keep on pulling from its child node:
This general control flow code does not belong to the operators themselves, and repeating it everywhere is error-prone and messy. This goes double for the parallel scan partitioning. The interface provided there right now is messy, and is the main reason we have not implemented more parallel scans (e.g. parallel scan of aggregate HTs, etc). Extending this behavior should be easy.
Problem 2: Parallel UNION operators
Another problem we have encountered is with parallelizing UNION operators. In a pipeline that has a union operator (note: the union operator is not duplicate eliminating, i.e. this is equivalent to UNION ALL), parallelization appears rather trivial. Namely, if you have
HASH_BUILD(UNION(A, B)), you can parallelize the scans over the separate tables just as you would parallellize the scans within one table. With the pull-based model, however, this is particularly difficult because we now need to decide which side to pull from on every union node on every iteration we make. This could be solved in a similar way to how we resolve the partitioning of the table scans, but this would require us to add bookkeeping for every union along the pipeline in a similarly hacky way.Problem 3: Individual execution of operators
Another problem with the current approach of inlining operators into a tree is that operators cannot be executed separately from this tree. This makes it more difficult to take an operator and use them elsewhere, as a fake tree has to be constructed (to scan data from). A push-based model resolves this in a much cleaner manner by allowing you to simply execute an operator completely in isolation.
Initial Sketch of Approach
My solution for this approach is to modify the interface which operators implement, and perform the actual data flow around the operators at a higher level. For example, the interface could look like this:
The general loop infrastructure might look like this:
Operators would be generally simplified, e.g. the projection will now look like this: