This project provides an automated solution for migrating ETL pipelines from Apache Spark Dataflow to Apache DataFusion, leveraging the power of Agentic AI. Apache Spark is a widely adopted distributed data processing engine, popular for its scalability and rich ecosystem. However, as data engineering evolves, there is growing interest in modern, high-performance alternatives like Apache DataFusion—a Rust-based query engine designed for fast, in-memory analytics with a focus on safety and efficiency.
The migration process is often complex due to differences in APIs, execution models, and language ecosystems (Scala/Python for Spark vs. Rust for DataFusion). This project addresses these challenges by using an agentic approach, where autonomous AI agents analyze Spark Dataflow code, interpret transformation logic, and generate equivalent DataFusion code. The system supports both Python and Scala Spark pipelines, and outputs Rust code compatible with DataFusion.
- Seamless Migration: Enable smooth transition from Spark Dataflow to DataFusion with minimal manual effort.
- High Fidelity: Ensure the generated DataFusion code accurately reflects the original Spark logic.
- Performance Optimization: Leverage DataFusion's capabilities to enhance the performance of the migrated pipelines.
- User-Friendly: Provide clear documentation and support to assist users in the migration process.
- Extensibility: Design the system to be easily extendable for future enhancements or support for additional Spark constructs.
-
Automated Migration: Converts Spark Dataflow pipelines to DataFusion code with no manual intervention.
-
Language Support: Handles both Python and Scala Spark pipelines, generating Rust code for Data Fusion.
-
Agentic AI: Utilizes AI agents to analyze and understand the semantics of Spark Dataflow, ensuring accurate and efficient migration.
-
Extensible Architecture: Designed to be easily extended for additional features or support for more complex Spark constructs.
-
Input Analysis: The system takes Spark Dataflow code as input, either in
Python or Scala.
-
Semantic Understanding: AI agents analyze the input code to understand the data transformations, aggregations, and other operations performed.
-
Code Generation: Based on the analysis, the system generates equivalent Rust code that
implements the same logic using DataFusion's APIs.
-
Output: The generated Rust code is outputted, ready to be integrated into
DataFusion projects.
- Performance: DataFusion's Rust-based architecture provides significant performance improvements over traditional Spark pipelines due to its in-memory execution model and efficient query planning.
- Safety: Rust's strong type system and memory safety features reduce runtime errors and improve code reliability.
- Scalability: DataFusion is designed for high-performance analytics, making it suitable
for large-scale data processing tasks.
-
Reduced Complexity: Automating the migration process simplifies the transition from Spark to DataFusion and reduces the need for extensive manual rewriting of code.
-
Modernization: Organizations can modernize their data processing pipelines by adopting DataFusion, which is more aligned with current trends in data engineering.
-
Community Support: DataFusion is part of the Apache Software Foundation, ensuring a robust community and ongoing development.
-
Legacy System Modernization: Organizations with existing Spark Dataflow pipelines can migrate to DataFusion to take advantage of its performance and safety features.
-
New Projects: Data engineers can start new projects directly in DataFusion, using the migration tool to convert any existing Spark code they may have.
-
Cross-Platform Compatibility: Teams working with both Spark and DataFusion can use this tool to maintain consistency across their codebases, allowing for easier collaboration and integration.
-
**Data Analytics **: DataFusion's capabilities make it suitable for complex data analytics tasks, and this migration tool enables teams to leverage existing Spark analytics code in a more efficient environment.
-
ETL Pipeline Migration: Organizations looking to migrate their ETL pipelines from Spark to DataFusion can use this tool to automate the process, ensuring a smooth transition with minimal disruption.
-
Clone the Repository: Start by cloning the project repository from GitHub.
git clone [email protected]:eachsaj/morphai.git
-
Install Dependencies: Navigate to the project directory and install the required dependencies.
cd morphaicreate a python virtual environment For macOS/Linux:
python -m venv venv source venv/bin/activate pip install -r requirements.txtFor Windows:
python -m venv venv venv\Scripts\activate pip install -r requirements.txt
Dependency: Install UVicorn
pip install uv
-
Install Rust: Ensure you have Rust installed on your system. You can install it using rustup.
For testing translated code snippets, you can use the following command to install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
Set Up Environment Variables:
- Create a
.envfile in the project root and configure the necessary environment variables. You can refer to the.env.examplefile for guidance.
OPENAI_API_KEY=<YOUR_OPENAI_API_KEY> LANGSMITH_ENDPOINT=https://api.smith.langchain.com LANGSMITH_TRACING=true LANGSMITH_API_KEY=<YOUR_LANGSMITH_API_KEY> LANGSMITH_PROJECT=<YOUR_LANGSMITH_PROJECT>
- Create a
To see the application in action, you can use the following command:
If your source Spark application reads files, please provide the absolute path to the file(s). Or make sure the same path is available in the generated DataFusion project.
python src/sdm_agent/application.py --help
Run the migration agent application.
options:
-h, --help show this help message and exit
--source SOURCE Path to the source project.
--target TARGET Path to the target project.
- Spark source application
- Migrate this application using MorphAI.
- Generated DataFusion project and Execution Result
The performance of the migration tool has been benchmarked against TPC-H Spark Dataflow pipelines, demonstrating significant improvements in execution speed and resource utilization when running the migrated DataFusion code.
TPC-H Benchmark Spark Code selected top 10 queries out of 21 queries for the migration benchmark.
The migration results show a significant improvement in execution speed and resource utilization when running the migrated DataFusion code compared to the original Spark code.
- All data flows are migrated with 100% accuracy in mapping input sources, business logic implementations, and output sources.
- Query 1 indicates 7.5% cell accuracy. This is not a side effect of migration. This is due to the computation accuracy issues in Apache Spark when aggregating high-precision floating-point values. More specifically , it is a Java problem (known as the “summation problem”(Lafage, 2020) ).
- Query 1 Row accuracy impacted by the cell accuracy
- All other data pipelines/queries’ cell and row accuracy are 100% which means exact matching results compared with the ground truth dataset.
- Overall accuracy of migrated data fusion pipelines is 99.9%
Precision was measured as the ratio of correctly aligned cells for each query. With the exception of the first query, all achieved 100% precision, resulting in an overall precision of 90%. These results demonstrate that the migrated Rust pipelines are highly reliable and successfully fulfill the system’s intended goals.
This work shows that the migration framework delivers consistent and dependable results. It successfully carries over the intent and functionality of existing dataflows while minimizing errors. The automated approach simplifies the transition process, cuts down manual effort, and provides a safer path toward modern data infrastructure. By leveraging intelligent agents, the system achieves reliable transformations and positions teams to focus on optimization rather than translation.









