44 releases (11 stable)
Uses new Rust 2024
new 2.0.2 | Sep 18, 2025 |
---|---|
1.0.9 | Sep 9, 2025 |
0.1.33 | Aug 20, 2025 |
0.1.29 | Jul 13, 2025 |
#32 in Development tools
1,768 downloads per month
130KB
2K
SLoC
Dataflow-rs
A high-performance workflow engine for building data processing pipelines in Rust with zero-overhead JSONLogic evaluation.
Dataflow-rs is a Rust library for creating high-performance data processing pipelines with pre-compiled JSONLogic and zero runtime overhead. It features a modular architecture that separates compilation from execution, ensuring predictable low-latency performance. With built-in multi-threading support through ThreadedEngine and high-performance parallel processing via RayonEngine, it provides excellent vertical scaling capabilities. Whether you're building REST APIs, processing Kafka streams, or creating sophisticated data transformation pipelines, Dataflow-rs provides enterprise-grade performance with minimal complexity.
🚀 Key Features
- Zero Runtime Compilation: All JSONLogic expressions pre-compiled at startup for optimal performance.
- Multi-Threading Support: Built-in ThreadedEngine with configurable thread pools for vertical scaling.
- Parallel Processing: RayonEngine leverages work-stealing for CPU-intensive workloads.
- Modular Architecture: Clear separation between compilation (LogicCompiler) and execution (InternalExecutor).
- Direct DataLogic Instantiation: Each engine has its own DataLogic instance for zero contention.
- Immutable Workflows: Workflows compiled once at initialization for predictable performance.
- Dynamic Workflows: Use JSONLogic to control workflow execution based on your data.
- Extensible: Easily add your own custom processing steps (tasks) to the engine.
- Built-in Functions: Comes with thread-safe implementations of data mapping and validation.
- Resilient: Built-in error handling and retry mechanisms to handle transient failures.
- Auditing: Keep track of all the changes that happen to your data as it moves through the pipeline.
🏁 Getting Started
Here's a quick example to get you up and running.
1. Add to Cargo.toml
[dependencies]
dataflow-rs = "1.0.8"
serde_json = "1.0"
2. Create a Workflow
Workflows are defined in JSON and consist of a series of tasks.
{
"id": "data_processor",
"name": "Data Processor",
"tasks": [
{
"id": "transform_data",
"function": {
"name": "map",
"input": {
"mappings": [
{
"path": "data.user_name",
"logic": { "var": "temp_data.name" }
},
{
"path": "data.user_email",
"logic": { "var": "temp_data.email" }
}
]
}
}
}
]
}
3. Run the Engine
use dataflow_rs::{Engine, Workflow};
use dataflow_rs::engine::message::Message;
use serde_json::json;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define workflows
let workflow_json = r#"{ ... }"#; // Your workflow JSON from above
let workflow = Workflow::from_json(workflow_json)?;
// Create engine with workflows (immutable after creation)
let mut engine = Engine::new(
vec![workflow], // Workflows to compile and cache
None, // Custom functions (optional)
None, // Retry config (optional)
);
// Process a single message
let mut message = Message::new(&json!({}));
engine.process_message(&mut message)?;
println!("✅ Processed result: {}", serde_json::to_string_pretty(&message.data)?);
Ok(())
}
4. Multi-Threading with ThreadedEngine
For high-performance multi-threaded processing, use the built-in ThreadedEngine:
use dataflow_rs::{ThreadedEngine, Workflow};
use dataflow_rs::engine::message::Message;
use serde_json::json;
use std::sync::Arc;
use std::thread;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define your workflows
let workflow_json = r#"{ ... }"#; // Your workflow JSON
let workflow = Workflow::from_json(workflow_json)?;
// Create ThreadedEngine with 4 worker threads
let engine = Arc::new(ThreadedEngine::new(
vec![workflow], // Workflows
None, // Custom functions (optional)
None, // Retry config (optional)
4, // Number of worker threads
));
// Process messages concurrently from multiple client threads
let mut handles = Vec::new();
for i in 0..1000 {
let engine = Arc::clone(&engine);
let handle = thread::spawn(move || {
let message = Message::new(&json!({"id": i}));
engine.process_message_sync(message)
});
handles.push(handle);
}
// Wait for all messages to complete
for handle in handles {
handle.join().unwrap()?;
}
println!("✅ Processed 1000 messages with ThreadedEngine!");
Ok(())
}
The ThreadedEngine provides:
- Configurable thread pool for vertical scaling
- Work queue distribution for efficient load balancing
- Graceful shutdown support
- Both sync and async APIs for flexible integration
- Health monitoring and worker restart capabilities
5. High-Performance Parallel Processing with RayonEngine
For CPU-intensive workloads requiring maximum parallelism, use RayonEngine:
use dataflow_rs::{RayonEngine, Workflow, Message};
use serde_json::json;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define your workflows
let workflow_json = r#"{ ... }"#; // Your workflow JSON
let workflow = Workflow::from_json(workflow_json)?;
// Create RayonEngine (uses all CPU cores by default)
let engine = RayonEngine::new(
vec![workflow], // Workflows
None, // Custom functions (optional)
None, // Retry config (optional)
);
// Process batch of messages in parallel
let messages: Vec<Message> = (0..10000)
.map(|i| Message::new(&json!({"id": i})))
.collect();
let results = engine.process_batch(messages);
// Handle results
for result in results {
match result {
Ok(message) => println!("Processed: {:?}", message.data),
Err(e) => eprintln!("Error: {:?}", e),
}
}
println!("✅ Processed 10000 messages with RayonEngine!");
Ok(())
}
The RayonEngine provides:
- Work-stealing parallelism for automatic load balancing
- Thread-local engines for zero contention
- Batch processing for maximum throughput
- Stream processing with parallel iterators
- CPU-optimized for compute-intensive workloads
✨ Core Concepts
- Engine: High-performance engine with pre-compiled logic and immutable workflows.
- ThreadedEngine: Multi-threaded variant with configurable thread pool for vertical scaling.
- RayonEngine: Parallel processing engine using Rayon's work-stealing for CPU-bound workloads.
- LogicCompiler: Compiles all JSONLogic expressions at initialization for zero runtime overhead.
- InternalExecutor: Executes built-in functions using pre-compiled logic from the cache.
- Workflow: A sequence of tasks executed in order, with conditions accessing only metadata.
- Task: A single processing step with optional JSONLogic conditions.
- Message: The data structure flowing through workflows with audit trail support.
🔧 Choosing Between Engine Types
Use Engine
when:
- Processing messages sequentially in a single thread
- Embedding in async runtimes (tokio, async-std)
- Maximum performance for individual message processing
- Simple integration without thread management
Use ThreadedEngine
when:
- Need to process multiple messages concurrently
- Want vertical scaling on multi-core systems
- Have mixed I/O and CPU-bound workloads
- Need a built-in thread pool with work distribution
Use RayonEngine
when:
- Processing large batches of messages in parallel
- CPU-intensive workloads with minimal I/O
- Need automatic work-stealing for load balancing
- Want maximum CPU utilization across all cores
🏗️ Architecture
The v3.0 architecture focuses on simplicity and performance through clear separation of concerns:
Compilation Phase (Startup)
- LogicCompiler compiles all JSONLogic expressions from workflows and tasks
- Creates an indexed cache of compiled logic for O(1) runtime access
- Validates all logic expressions early, failing fast on errors
- Stores compiled logic in contiguous memory for cache efficiency
Execution Phase (Runtime)
- Engine orchestrates message processing through immutable workflows
- InternalExecutor evaluates conditions and executes built-in functions
- Uses compiled logic from cache - zero compilation overhead at runtime
- Direct DataLogic instantiation eliminates any locking or contention
Key Design Decisions
- Immutable Workflows: All workflows defined at engine creation, cannot be modified
- Pre-compilation: All expensive parsing/compilation done once at startup
- Direct Instantiation: Each engine owns its DataLogic instance directly
- Modular Design: Clear boundaries between compilation, execution, and orchestration
⚡ Performance
Dataflow-rs achieves optimal performance through architectural improvements:
- Pre-Compilation: All JSONLogic compiled at startup, zero runtime overhead
- Multi-Threading: Built-in ThreadedEngine for vertical scaling with configurable worker threads
- Parallel Processing: RayonEngine with work-stealing for maximum CPU utilization
- Cache-Friendly: Compiled logic stored contiguously in memory
- Direct Instantiation: DataLogic instances created directly without locking
- Predictable Latency: No runtime allocations for logic evaluation
- Modular Design: Clear separation of compilation and execution phases
Run the included benchmarks to test performance on your hardware:
cargo run --example benchmark # Comprehensive comparison
cargo run --example threaded_benchmark # ThreadedEngine performance
cargo run --example rayon_benchmark # RayonEngine performance
The benchmarks compare single-threaded Engine vs multi-threaded ThreadedEngine vs parallel RayonEngine with various configurations, providing detailed performance metrics and scaling analysis.
🛠️ Custom Functions
You can extend the engine with your own custom logic by implementing the AsyncFunctionHandler
trait:
use async_trait::async_trait;
use dataflow_rs::engine::{AsyncFunctionHandler, FunctionConfig, error::Result, message::{Change, Message}};
use datalogic_rs::DataLogic;
use serde_json::{json, Value};
use std::collections::HashMap;
use std::sync::Arc;
pub struct MyCustomFunction;
#[async_trait]
impl AsyncFunctionHandler for MyCustomFunction {
async fn execute(
&self,
message: &mut Message,
config: &FunctionConfig,
datalogic: Arc<DataLogic>,
) -> Result<(usize, Vec<Change>)> {
// Your custom logic here (can be async or sync)
println!("Hello from a custom async function!");
// Modify message data
message.data["processed"] = json!(true);
// Return status code and changes for audit trail
Ok((200, vec![Change {
path: Arc::from("data.processed"),
old_value: Arc::new(json!(null)),
new_value: Arc::new(json!(true)),
}]))
}
}
// Register when creating the engine:
let mut custom_functions = HashMap::new();
custom_functions.insert(
"my_custom_function".to_string(),
Box::new(MyCustomFunction) as Box<dyn AsyncFunctionHandler + Send + Sync>
);
let engine = Engine::new(
workflows,
Some(custom_functions), // Custom async functions
);
🤝 Contributing
We welcome contributions! Feel free to fork the repository, make your changes, and submit a pull request. Please make sure to add tests for any new features.
🏢 About Plasmatic
Dataflow-rs is developed by the team at Plasmatic. We're passionate about building open-source tools for data processing.
📄 License
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.
Dependencies
~6–17MB
~199K SLoC