Thanks to visit codestin.com
Credit goes to GitHub.com

Skip to content

Conversation

@ryzhyk
Copy link
Contributor

@ryzhyk ryzhyk commented Jan 27, 2026

This PR implements two building blocks for various multiway join
algorithms:

  • The Match operator iterates over common keys of multiple indexed streams,
    calling a user-provided closure for each key. The closure can implement
    standard join semantics, but it can also do soemthing different, e.g., it
    can compute the count of values for each key, which can be used as part of
    wcoj-like schemes.

  • The star join operator is a generalization of the 2-way join to compute a
    join of multiple streams on a common join key. It comes in two flavors:

    • an inner-star-join that works in both root and nested circuits
    • a star-join that supports any combination of inner and outer joins and that
      is only defined in the root scope.
      Both flavors support regular join, join_index, and join_flatmap forms.

    The star join operator is built on top of the Match operator.

Star join is a variadic operator that applies to streams with multiple
different value types. It cannot be expressed as a single strongly typed
function. Rather a separate function is required for every distinct number of
arguments. Instead of creating a fixed set of such functions, we define macros
that a client program can be used to instantiate any number of such functions
for every number of arguments required by the program.

@gz
Copy link
Contributor

gz commented Jan 29, 2026

very exciting

@ryzhyk
Copy link
Contributor Author

ryzhyk commented Jan 29, 2026

very exciting

That "TBD" is really building some suspense, doesn't it?

@ryzhyk ryzhyk marked this pull request as ready for review January 29, 2026 16:50
Copilot AI review requested due to automatic review settings January 29, 2026 16:50
@ryzhyk ryzhyk added DBSP core Related to the core DBSP library performance labels Jan 29, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements variadic multiway join support via a new Match operator and star-join operators (inner + mixed inner/outer), with supporting runtime/scheduler plumbing changes.

Changes:

  • Added dynamic Match + StarJoin implementations and public macros to generate N-ary join APIs.
  • Refactored join/saturation infrastructure (runtime saturate flags, simplified saturate factories usage).
  • Introduced flush coordination in scheduling/execution to support multi-step operator flushing.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
crates/dbsp/src/trace/cursor/saturating_cursor.rs Switches saturating behavior from const-generic to runtime flag for more flexible outer-join usage.
crates/dbsp/src/trace.rs Loosens trait bounds for batch merge helper used when merging multiple join outputs.
crates/dbsp/src/operator/star_join.rs Adds macro-based API generator for N-ary star-join (typed surface API).
crates/dbsp/src/operator/non_incremental.rs Adjusts executor transaction/flush interface (currently introduces unimplemented!() stubs).
crates/dbsp/src/operator/dynamic/saturate.rs Simplifies dyn_saturate factories interface to use batch factories directly.
crates/dbsp/src/operator/dynamic/outer_join.rs Updates outer-join implementation to match new saturate/join-trace APIs.
crates/dbsp/src/operator/dynamic/multijoin/star_join.rs Implements dynamic star-join operator on top of the new Match operator + adds tests.
crates/dbsp/src/operator/dynamic/multijoin/match_keys.rs Adds Match operator that iterates common keys of multiple streams and drives multiway join logic.
crates/dbsp/src/operator/dynamic/multijoin.rs Exposes dynamic multijoin modules and re-exports factories/types.
crates/dbsp/src/operator/dynamic/join.rs Reworks join trace saturation from const-generic to runtime field + updates call sites/tests.
crates/dbsp/src/operator/dynamic.rs Exposes new multijoin module.
crates/dbsp/src/operator/apply_n.rs Refactors n-ary apply helper into a free function and changes fixedpoint behavior.
crates/dbsp/src/operator.rs Exports apply_n and adds star_join module.
crates/dbsp/src/circuit/schedule/dynamic_scheduler.rs Adds flush state coordination across workers (piggybacked on commit coordination).
crates/dbsp/src/circuit/schedule.rs Extends Scheduler/Executor traits with flush methods and adjusts executor implementations.
crates/dbsp/src/circuit/metadata.rs Adds metadata labels used by Match operator stats.
crates/dbsp/src/circuit/circuit_builder.rs Adds StreamMetadata::consume_token, Circuit::add_custom_node, and exposes internal stream helpers used by Match.
crates/dbsp/Cargo.toml Adds seq-macro dependency for N-ary API generation.
Cargo.toml Adds workspace dependency version for seq-macro.

Comment on lines +880 to +896
let prefix = self.prefix.take().unwrap();

self.async_stream =
Some(Box::pin(self.inner.clone().async_eval(prefix, snapshots))
as Pin<
Box<dyn AsyncStream<Item = (O, bool, Option<Position>)>>,
>);
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.prefix.take().unwrap() will panic if a flush is requested before the operator has ever observed a Some(prefix) from preprocess_prefix (e.g., first transaction/commit with no prefix updates). Handle the None case by initializing prefix to an empty snapshot up-front, or by skipping the flush computation until a valid prefix is available.

Suggested change
let prefix = self.prefix.take().unwrap();
self.async_stream =
Some(Box::pin(self.inner.clone().async_eval(prefix, snapshots))
as Pin<
Box<dyn AsyncStream<Item = (O, bool, Option<Position>)>>,
>);
if let Some(prefix) = self.prefix.take() {
self.async_stream =
Some(Box::pin(self.inner.clone().async_eval(prefix, snapshots))
as Pin<
Box<dyn AsyncStream<Item = (O, bool, Option<Position>)>>,
>);
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would violate a key invariant on how transactions work. If that happens, a panic is the best course of action.


if let TransactionPhase::Committing(unflushed_operators) = &self.transaction_phase {
let commit_complete = self
let statuses = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the fix to the transactions bug for the recursive circuits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

}

#[track_caller]
pub fn dyn_inner_star_join_index_mono(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are inner joins, please call this star_inner_join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about that: To me "inner star join" and "inner binary join" are more natural than "star inner join" and "binary inner join".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that "Inner" is used frequently for inner implementation classes

}
}

impl<K, V> Stream<RootCircuit, OrdIndexedZSet<K, V>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this code could not have been written in a typed setting.
So maybe these type erasures are good for something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

&match_factories,
self.circuit().global_node_id().child(node_id),
self.circuit().clone(),
saturated.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand exactly how these saturated things will work in a multijoin setting.
The same collection could appear in a join tree on the left and on the right of some LEFT JOINs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and I think it will work correctly in all these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and I tested that it does in this PR)

@@ -0,0 +1,1100 @@
#[macro_export]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a list of the macros you depend on (so I know what documentation to look for)
I could identify paste! and seq!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with macros is you have to export all the ones you depend on, but you only need the once that are documented.

/// Example generated function signature:
///
/// ```text
/// impl<K, V1> Stream<RootCircuit, OrdIndexedZSet<K, V1>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, look, SQL-like indexing from 1!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are nice, but you can add similar comments to the macros above

The Executor trait wraps a circuit scheduler and decides how many times to run
the circuit per parent clock tick: once for the top-level circuit, many times
for iterative circuits, and 0 or 1 times for a non-incremental circuit (an odd
beast, but that's besides the point here).

This trait is invoked in different ways for top-level and nested circuits: in
the top-level scope, the executor is invoked to commit the transaction and
track commit completion status.  In the nested scope, the executor is invoked
to flush the child circuit operator/track flush completion.  Instead of defining different different methods for these two situations, we
overloaded `start_commit_transaction`/`is_commit_complete` methods to do two
different things, making it impossible to explain what the method is actualy
supposed to do.

This commit introduces separate pairs of methods for flush and commmit
operations.  This is still not perfect, since different implementations need to
implement different subsets of the methods, so we probably want two separate
traits.  We can do that in the future, but this is already much cleaner.

Signed-off-by: Leonid Ryzhyk <[email protected]>
This fixes a bug in the handling of recursive circuits that was introduced a
while ago with transactions. The is_flush_complete implementation for a
recursive subcircuit didn't take into account that even aftet the subcircuit
may have received the last input for the current transaction in the local
worker, concurrent workers may not have been flushed yet, and may produce
additional inputs for the local circuit via an exchange operator. The correct
behavior implemented here waits for the subcircuit to get flushed in all
concurrent workers. It could be easily implemented by adding a new consensus
object, but to avoid an extra round of communication, we combine it with the
transaction commit consensus.

Signed-off-by: Leonid Ryzhyk <[email protected]>
Improve the join test: run it in a multithreaded runtime.

Signed-off-by: Leonid Ryzhyk <[email protected]>
SATURATE was implemented as a compile-time generic argument. This doesn't work
for upcoming multiway join changes, where we want to build operators that
evaluate any combination of inner and outer joins. This commit makes SATURATE a
runtime value. This might be slightly less efficient due to the extra runtime
checks, but it's likely to be in the noise.

Signed-off-by: Leonid Ryzhyk <[email protected]>
This commit implements two building blocks for various multiway join
algorithms:

* The `Match` operator iterates over common keys of multiple indexed streams,
  calling a user-provided closure for each key. The closure can implement
  standard join semantics, but it can also do soemthing different, e.g., it
  can compute the count of values for each key, which can be used as part of
  wcoj-like schemes.

* The star join operator is a generalization of the 2-way join to compute a
  join of multiple streams on a common join key.  It comes in two flavors:
  - an inner-star-join that works in both root and nested circuits
  - a star-join that supports any combination of inner and outer joins and that
    is only defined in the root scope.
  Both flavors support regular join, join_index, and join_flatmap forms.

  The star join operator is built on top of the `Match` operator.

Star join is a variadic operator that applies to streams with multiple
different value types. It cannot be expressed as a single strongly typed
function. Rather a separate function is required for every distinct number of
arguments. Instead of creating a fixed set of such functions, we define macros
that a client program can be used to instantiate any number of such functions
for every number of arguments required by the program.

Signed-off-by: Leonid Ryzhyk <[email protected]>
@ryzhyk ryzhyk disabled auto-merge January 30, 2026 22:10
@ryzhyk ryzhyk enabled auto-merge January 30, 2026 22:10
@ryzhyk ryzhyk added this pull request to the merge queue Jan 30, 2026
Merged via the queue into main with commit db28cfe Jan 30, 2026
1 check passed
@ryzhyk ryzhyk deleted the multijoin branch January 30, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants