Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Smart-Shaped/chaM3Leon

Repository files navigation

chaM3Leon: A Modular Framework for Big Data and ML Applications

A modular and scalable framework based on Java, Python and Apache Spark, designed to support machine learning applications. ChaM3Leon emphasizes transparency, interoperability, and usability. It implements a custom Lambda architecture for real-time and batch data processing, providing a robust platform for Big Data and MLOps.

The chaM3Leon architecture is illustrated in the following Component Diagram, highlighting the connections between layers through provided and required interfaces.

chaM3Leon architecture

Features

  • Modular Architecture: Easily extend and customize layers for your specific needs.
  • Scalable: Built on Apache Spark to handle large-scale data processing.
  • Lambda Architecture: Combines batch and speed layers for efficient data handling.
  • Extensible: Add new layers and components to your application with ease.
  • Multiple Layers: Includes Batch, Speed, ML, and Harvester layers for a full data pipeline.

As of now, we have released four layers (Batch Layer, Speed Layer, Harvester Layer and ML Layer). You can refer to our roadmap to see the planned release dates for other components.

Implementation

The chaM3Leon core framework is based on Java and Maven. It is designed to be modular and scalable, allowing different components and layers to be easily integrated.

The layers can be divided based on their implementation technology:

  • Java Layers (Main Framework):
    • Spark-based:
      • Batch Layer
      • Speed Layer
      • Harvester Layer
    • SpringBoot-based:
      • Serving Layer
  • Python Layer (as Git Submodule):
    • ML Layer: This layer is now implemented as a separate Python library, managed as a Git submodule. It leverages modern MLOps tools including Metaflow, MLflow, and Apache Spark for building and managing machine learning pipelines.

Spark Layers

Spark Layers are based on Apache Spark with Java 11 and are designed to run on a Spark cluster. They are implemented using the Spark Streaming API and the Spark SQL API.

To implement your own version of any Spark Layer you have to:

  • Build the project running at the level of the chaM3Leon pom.xml the following command:
mvn clean install
  • Generate a Maven project and add the chaM3Leon layer you want to implement as dependency on your maven pom.xml as below:
<dependency>
	<groupId>com.smartshaped.chameleon</groupId>
	<artifactId>{layer}</artifactId>
	<version>2.0.0</version>
</dependency>
  • Where {layer} can be:

    • batch
    • speed
    • harvester
  • Add the maven-shade-plugin to generate a shaded jar in order to submit your layer implementation as a Spark application (keep in mind the framework is based on Java 11)

<build>
	<plugins>
		<plugin>
			<groupId>org.apache.maven.plugins</groupId>
			<artifactId>maven-shade-plugin</artifactId>
			<version>3.6.0</version>
			<executions>
				<execution>
					<phase>package</phase>
					<goals>
						<goal>shade</goal>
					</goals>
					<configuration>
						<filters>
							<filter>
								<artifact>*:*</artifact>
								<excludes>
									<exclude>META-INF/*.SF</exclude>
									<exclude>META-INF/*.DSA</exclude>
									<exclude>META-INF/*.RSA</exclude>
								</excludes>
							</filter>
						</filters>
						<transformers>
						  <transformer                                                    
              				implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
              				<manifestEntries>
                				<Specification-Title> Java Advanced Imaging Image I/O Tools</Specification-Title>
                				<Specification-Version>1.1</Specification-Version>          
                				<Specification-Vendor> Sun Microsystems, Inc. </Specification-Vendor>
                				<Implementation-Title> com.sun.media.imageio</Implementation-Title>
                				<Implementation-Version> 1.1</Implementation-Version>       
                				<Implementation-Vendor> Sun Microsystems, Inc.</Implementation-Vendor>
                				<Multi-Release>true</Multi-Release>
              				</manifestEntries>                                            
            			 </transformer>
                         <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                        </transformers>
					</configuration>
				</execution>
			</executions>
		</plugin>
	</plugins>
</build>

After this, you can choose to extend any of the layers following their own documentation:


SpringBoot Layer

The Serving Layer is based on SpringBoot 3.4.2 with Java 21.

To implement your own version of the Serving Layer you can follow the Serving Layer documentation.


Python Layer

The ML Layer is implemented as a Python library, managed as a Git submodule. It leverages Metaflow, MLflow, and Apache Spark.

To implement or extend your machine learning pipelines, you can follow the PyChaM3Leon documentation.


Execution Instructions (Spark Layers)

To generate the .jar of your implemented layer (Batch, Speed, ML or Harvester), run the following command from your project directory:

mvn clean install

Then go to our Docker repository and follow the Docker documentation


Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the Apache-2.0 license.

Additional Video Resources

Youtube:

Roadmap

  • API Gateway (To be determined)

  • Workflow Designer (To be determined, probably Q3/Q4 2025)