-
Notifications
You must be signed in to change notification settings - Fork 3
Description
πͺ Motivation
Situations often arise where it would be nice to inject inputs farther down-the-line of pipeline execution than the root node. This is often useful during testing, where the behavior of individual pipeline steps needs to be examined without needing to run data & inputs all the way through the pipeline first.
It is also useful when pipeline steps fail or must be re-run due to misconfiguration or other issues, such as a failure in an externally-configured service. In cases like these, it would be desirable to execute a partial re-run of a pipeline, starting from where the previous run left off. This would avoid duplication of (possibly expensive) work performed by earlier pipeline steps.
Note: The use-case for a partial re-run likely warrants some method of "replaying" pipeline inputs - this could be achieved by caching inputs in the manager's work queues, or something similar.
π Additional Details
For a more concrete example, consider the following pipeline:
+-------+ +--------+ +------+
datainput -> | start | ---> | middle | ---> | last | -> end
+-------+ +--------+ +------+
If an error occurs in middle, we might have reason to send data from the datainput directly to middle, thus bypassing start. This might be implemented in a DataInput spec as follows:
spec:
data:
<data block>
target: middle # add target: <node>Which would result in the DataInput's container pushing data to middle's work queue, instead of root.
There are a few considerations / caveats:
- The DataInput schema will need to be updated to include the
target: <node>option, specifying that the output queue of the DataInput should be something other than the root node. Will default to the root node oftargetis not specified. - With the current implementation, a given node may have more than one workqueue (incoming edge) it gets inputs from in round-robin. Shortcut-inputs could be evenly distributed, put all into one queue, or handled separately - the correct approach is unclear.
- While the DataInput can somewhat-easily be configured to pass data to a different step in the pipeline, it is less straightforward to get the underlying container to pass inputs that
middlewould care about (i.e. emulatestart's output).- This is where an input "replay" will come in handy, but there's still the case where inputs are unavailable such as during a test of a single pipeline step. This likely requires a new DataInput container to be created specifically for this purpose.