Thanks to visit codestin.com
Credit goes to github.com

Skip to content

eachsaj/morphai

Repository files navigation

Apache Spark Dataflow migration to Apache Datafusion using Agentic AI

Introduction

This project provides an automated solution for migrating ETL pipelines from Apache Spark Dataflow to Apache DataFusion, leveraging the power of Agentic AI. Apache Spark is a widely adopted distributed data processing engine, popular for its scalability and rich ecosystem. However, as data engineering evolves, there is growing interest in modern, high-performance alternatives like Apache DataFusion—a Rust-based query engine designed for fast, in-memory analytics with a focus on safety and efficiency.

The migration process is often complex due to differences in APIs, execution models, and language ecosystems (Scala/Python for Spark vs. Rust for DataFusion). This project addresses these challenges by using an agentic approach, where autonomous AI agents analyze Spark Dataflow code, interpret transformation logic, and generate equivalent DataFusion code. The system supports both Python and Scala Spark pipelines, and outputs Rust code compatible with DataFusion.

Design Objectives

  1. Seamless Migration: Enable smooth transition from Spark Dataflow to DataFusion with minimal manual effort.
  2. High Fidelity: Ensure the generated DataFusion code accurately reflects the original Spark logic.
  3. Performance Optimization: Leverage DataFusion's capabilities to enhance the performance of the migrated pipelines.
  4. User-Friendly: Provide clear documentation and support to assist users in the migration process.
  5. Extensibility: Design the system to be easily extendable for future enhancements or support for additional Spark constructs.

The solution aims to perform migration of Apache Spark Data Flow to Apache DataFusion DataFlow

Design Objectives

Architecture

Design Objectives

Functions

Design Objectives

Features

  • Automated Migration: Converts Spark Dataflow pipelines to DataFusion code with no manual intervention.

  • Language Support: Handles both Python and Scala Spark pipelines, generating Rust code for Data Fusion.

  • Agentic AI: Utilizes AI agents to analyze and understand the semantics of Spark Dataflow, ensuring accurate and efficient migration.

  • Extensible Architecture: Designed to be easily extended for additional features or support for more complex Spark constructs.

How It Works

  1. Input Analysis: The system takes Spark Dataflow code as input, either in

    Python or Scala.

  2. Semantic Understanding: AI agents analyze the input code to understand the data transformations, aggregations, and other operations performed.

  3. Code Generation: Based on the analysis, the system generates equivalent Rust code that

    implements the same logic using DataFusion's APIs.

  4. Output: The generated Rust code is outputted, ready to be integrated into

    DataFusion projects.

Benefits

  • Performance: DataFusion's Rust-based architecture provides significant performance improvements over traditional Spark pipelines due to its in-memory execution model and efficient query planning.
  • Safety: Rust's strong type system and memory safety features reduce runtime errors and improve code reliability.
  • Scalability: DataFusion is designed for high-performance analytics, making it suitable

for large-scale data processing tasks.

  • Reduced Complexity: Automating the migration process simplifies the transition from Spark to DataFusion and reduces the need for extensive manual rewriting of code.

  • Modernization: Organizations can modernize their data processing pipelines by adopting DataFusion, which is more aligned with current trends in data engineering.

  • Community Support: DataFusion is part of the Apache Software Foundation, ensuring a robust community and ongoing development.

Use Cases

  • Legacy System Modernization: Organizations with existing Spark Dataflow pipelines can migrate to DataFusion to take advantage of its performance and safety features.

  • New Projects: Data engineers can start new projects directly in DataFusion, using the migration tool to convert any existing Spark code they may have.

  • Cross-Platform Compatibility: Teams working with both Spark and DataFusion can use this tool to maintain consistency across their codebases, allowing for easier collaboration and integration.

  • **Data Analytics **: DataFusion's capabilities make it suitable for complex data analytics tasks, and this migration tool enables teams to leverage existing Spark analytics code in a more efficient environment.

  • ETL Pipeline Migration: Organizations looking to migrate their ETL pipelines from Spark to DataFusion can use this tool to automate the process, ensuring a smooth transition with minimal disruption.

How to set up the project

  1. Clone the Repository: Start by cloning the project repository from GitHub.

    git clone [email protected]:eachsaj/morphai.git
  2. Install Dependencies: Navigate to the project directory and install the required dependencies.

    cd morphai

    create a python virtual environment For macOS/Linux:

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    

    For Windows:

    python -m venv venv
    venv\Scripts\activate
    pip install -r requirements.txt
    

    Dependency: Install UVicorn

    pip install uv
  3. Install Rust: Ensure you have Rust installed on your system. You can install it using rustup.

    For testing translated code snippets, you can use the following command to install Rust:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  4. Set Up Environment Variables:

    1. Create a .env file in the project root and configure the necessary environment variables. You can refer to the .env.example file for guidance.
    OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
    LANGSMITH_ENDPOINT=https://api.smith.langchain.com
    LANGSMITH_TRACING=true
    LANGSMITH_API_KEY=<YOUR_LANGSMITH_API_KEY>
    LANGSMITH_PROJECT=<YOUR_LANGSMITH_PROJECT>

How to run the application

To see the application in action, you can use the following command:

Key Requirements

If your source Spark application reads files, please provide the absolute path to the file(s). Or make sure the same path is available in the generated DataFusion project.

To Run application.

 python src/sdm_agent/application.py --help
 
     Run the migration agent application.
         options:
             -h, --help       show this help message and exit
             --source SOURCE  Path to the source project.
             --target TARGET  Path to the target project.

Demo Spark to Datafusion Migration Example

  1. Spark source application

Demo Spark Source Application

  1. Migrate this application using MorphAI.

MorphAI Spark to DataFusion Migration

  1. Generated DataFusion project and Execution Result

Generated DataFusion Project, Code and Execution Result

Performance Benchmark

The performance of the migration tool has been benchmarked against TPC-H Spark Dataflow pipelines, demonstrating significant improvements in execution speed and resource utilization when running the migrated DataFusion code.

Benchmark Spark Code

TPC-H Benchmark Spark Code selected top 10 queries out of 21 queries for the migration benchmark.

TPC-H Benchmark Spark Code

Benchmark Spark to DataFusion Migration Results

The migration results show a significant improvement in execution speed and resource utilization when running the migrated DataFusion code compared to the original Spark code.

Result: Row-level, & Column-level Accuracy

Result: Row-level, & Column-level Accuracy

Result Columns Explanation

Result Columns Explanation

Overall Row & Cell Accuracy per Query

Overall Row & Cell Accuracy per Query

Summary

  1. All data flows are migrated with 100% accuracy in mapping input sources, business logic implementations, and output sources.
  2. Query 1 indicates 7.5% cell accuracy. This is not a side effect of migration. This is due to the computation accuracy issues in Apache Spark when aggregating high-precision floating-point values. More specifically , it is a Java problem (known as the “summation problem”(Lafage, 2020) ).
  3. Query 1 Row accuracy impacted by the cell accuracy
  4. All other data pipelines/queries’ cell and row accuracy are 100% which means exact matching results compared with the ground truth dataset.
  5. Overall accuracy of migrated data fusion pipelines is 99.9%

Precision

Precision was measured as the ratio of correctly aligned cells for each query. With the exception of the first query, all achieved 100% precision, resulting in an overall precision of 90%. These results demonstrate that the migrated Rust pipelines are highly reliable and successfully fulfill the system’s intended goals.

Precision

Conclusion

This work shows that the migration framework delivers consistent and dependable results. It successfully carries over the intent and functionality of existing dataflows while minimizing errors. The automated approach simplifies the transition process, cuts down manual effort, and provides a safer path toward modern data infrastructure. By leveraging intelligent agents, the system achieves reliable transformations and positions teams to focus on optimization rather than translation.

About

An AI agent for migrating spark pipelines to datafusion using lineage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors