Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@phil-opp
Copy link
Collaborator

@phil-opp phil-opp commented Oct 15, 2025

How it works

  • Set the DORA_TEST_WITH_INPUTS env variable with the path to your input JSON file
  • (Optional) Set the DORA_TEST_WRITE_OUTPUTS_TO env variable with the path where the outputs should be written. If not set, dora will write a outputs.jsonl file next to the given inputs file
  • Start the node executable/script

The node will be run as usual, but its event channel will be filled from the given inputs JSON file. No connection to a dora daemon will be made.

Input JSON file format example:

{
     // ID of the node
    "id": "foo",
    // defines the events that the node should receive
    "events": [
        {
            // specifies when the event arrives (seconds since node start)
            "time_offset_secs": 0.7,
            // type of the event (supported types are `Input`, `Stop`, `InputClosed`, `AllInputsClosed`
            "type": "Input",
            // input ID
            "id": "tick"
            // optional: `data` field with input data
        },
        {
            "time_offset_secs": 0.9,
            "type": "Input",
            "id": "tick"
        },
        {
            "time_offset_secs": 1.2,
            "type": "Stop"
        }
    ]
    // other supported fields: name, description, args, env, outputs, inputs, send_stdout_as (they all behave like in dataflow.yaml)
}

Output JSON file format example:

{"id":"random","data":9267023440904143729,"time_offset_secs":0.700793541}
{"id":"random","data":5753749540645363621,"time_offset_secs":0.900897584}

TODO:

  • Documentation
    • API docs
    • dora-rs.ai docs
  • add some tests for our examples and use them on our CI
  • take a look at arrow_integration_test JSON format -> this might be better suitable than our custom input json format
  • add option (via env variable) to write out received inputs as inputs.json files during normal dataflow operation -> to make creating files with complex input data easier
  • add option (via env variable) to omit time offsets in output formats -> to make them diff-able with expected outputs (the time offsets are a bit different on each run)
  • use ArrowTestUnwrap format for outputs.jsonl (not possible because of ArrowJsonBatch::from_batch is incomplete apache/arrow-rs#8684)
    • instead: Include data type in output JSON file

Add an optional `data_format` field that specifies the format of the `data` field. It defaults to deriving the schema from the given JSON object and converting it to the closest Arrow representation. The `ArrowTest` and `ArrowTestUnwrap` formats expect the `data` field to follow the Arrow integration test data format. The `ArrowTestUnwrap` format unwraps the first column of the deserialized RecordBatch to make other Arrow types representable (i.e. not just StructArrays).
The arrow integration test format crate panics in certain situations, which lead to a closed integration test channel. We want to panic on the sending side too in this case to avoid endless loops.
Useful for diffing the file against an expected file (as time offsets are not deterministic).
@phil-opp
Copy link
Collaborator Author

I opened apache/arrow-rs#8737 to add support for binary decoding to arrow-json.

The `arrow_integration_test` crate is incomplete and apparently only for internal use.
- wrap values if necessary
- avoid double-wrapping array values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants