mloda: Make data, feature and context engineering shareable

⚠️ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!

🍳 Think of mloda Like Cooking Recipes

Traditional Data Pipelines = Making everything from scratch

Want pasta? Make noodles, sauce, cheese from raw ingredients
Want pizza? Start over - make dough, sauce, cheese again
Want lasagna? Repeat everything once more
Can't share recipes easily - they're mixed with your kitchen setup

mloda = Using recipe components

Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
Use same "tomato sauce" for pasta, pizza, lasagna
Switch kitchens (home → restaurant → food truck) - same recipes work
Share your "tomato sauce" recipe with friends - they don't need your whole kitchen

Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!

Installation

pip install mloda

1. The Core API Call - Your Starting Point

Complete Working Example with DataCreator

# Step 1: Create a sample data source using DataCreator
from mloda.provider import FeatureGroup, DataCreator, FeatureSet, BaseInputData
from typing import Any, Optional
import pandas as pd

class SampleData(FeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return DataCreator({"customer_id", "age", "income"})

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        return pd.DataFrame({
            'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
            'age': [25, 30, 35, None, 45],
            'income': [50000, 75000, None, 60000, 85000]
        })

# Step 2: Load mloda plugins and run pipeline
from mloda.user import PluginLoader
import mloda

PluginLoader.all()

result = mloda.run_all(
    features=[
        "customer_id",                    # Original column
        "age",                            # Original column
        "income__standard_scaled"         # Transform: scale income to mean=0, std=1
    ],
    compute_frameworks=["PandasDataFrame"]
)

# Step 3: Get your processed data
data = result[0]
print(data.head())
# Output: DataFrame with customer_id, age, and scaled income

What just happened?

SampleData class - Created a data source using DataCreator (generates data in-memory)
PluginLoader.all() - Loaded all available transformations (scaling, encoding, imputation, etc.)
mloda.run_all() - Executed the feature pipeline:
- Got data from SampleData
- Extracted customer_id and age as-is
- Applied StandardScaler to income → income__standard_scaled
result[0] - Retrieved the processed pandas DataFrame

Key Insight: The syntax income__standard_scaled is mloda's feature chaining. Behind the scenes, mloda creates a chain of feature group objects (SourceFeatureGroup → StandardScalingFeatureGroup), automatically resolving dependencies. See Section 2 for full explanation of chaining syntax and Section 4 to learn about the underlying feature group architecture.

2. Understanding Feature Chaining (Transformations)

The Power of Double Underscore __ Syntax

As mentioned in Section 1, feature chaining (like income__standard_scaled) is syntactic sugar that mloda converts into a chain of feature group objects. Each transformation (standard_scaled, mean_imputed, etc.) corresponds to a specific feature group class.

mloda's chaining syntax lets you compose transformations using __ as a separator:

# Pattern examples (these show the syntax):
#   "income__standard_scaled"                     # Scale income column
#   "age__mean_imputed"                           # Fill missing age values with mean
#   "category__onehot_encoded"                    # One-hot encode category column
#
# You can chain transformations!
# Pattern: {source}__{transform1}__{transform2}
#   "income__mean_imputed__standard_scaled"       # First impute, then scale

# Real working example:
_ = ["income__standard_scaled", "age__mean_imputed"]  # Valid feature names

Available Transformations:

Transformation	Purpose	Example
`__standard_scaled`	StandardScaler (mean=0, std=1)	`income__standard_scaled`
`__minmax_scaled`	MinMaxScaler (range [0,1])	`age__minmax_scaled`
`__robust_scaled`	RobustScaler (median-based, handles outliers)	`price__robust_scaled`
`__mean_imputed`	Fill missing values with mean	`salary__mean_imputed`
`__median_imputed`	Fill missing values with median	`age__median_imputed`
`__mode_imputed`	Fill missing values with mode	`category__mode_imputed`
`__onehot_encoded`	One-hot encoding	`state__onehot_encoded`
`__label_encoded`	Label encoding	`priority__label_encoded`

Key Insight: Transformations are read left-to-right. income__mean_imputed__standard_scaled means: take income → apply mean imputation → apply standard scaling.

When You Need More Control

Most of the time, simple string syntax is enough:

# Example feature list (simple strings)
example_features = ["customer_id", "income__standard_scaled", "region__onehot_encoded"]

But for advanced configurations, you can explicitly create Feature objects with custom options (covered in Section 3).

3. Advanced: Feature Objects for Complex Configurations

Understanding the Feature Group Architecture

Behind the scenes, chaining like income__standard_scaled creates feature group objects:

# When you write this string:
"income__standard_scaled"

# mloda creates this chain of feature groups:
# StandardScalingFeatureGroup (reads from) → IncomeSourceFeatureGroup

Explicit Feature Objects

For truly custom configurations, you can use Feature objects:

# Example (for custom feature configurations):
# from mloda import Feature, Options
#
# features = [
#     "customer_id",                                   # Simple string
#     Feature(
#         "custom_feature",
#         options=Options({
#             "custom_param": "value",
#             "in_features": "source_column",
#         })
#     ),
# ]
#
# result = mloda.run_all(
#     features=features,
#     compute_frameworks=["PandasDataFrame"]
# )

Deep Dive: Each transformation type (standard_scaled__, mean_imputed__, etc.) maps to a feature group class in mloda_plugins/feature_group/. For example, standard_scaled__ uses ScalingFeatureGroup. When you chain transformations, mloda builds a dependency graph of these feature groups and executes them in the correct order. This architecture makes mloda extensible - you can create custom feature groups for your own transformations!

4. Data Access - Where Your Data Comes From

Three Ways to Provide Data

mloda supports multiple data access patterns depending on your use case:

1. DataCreator - For testing and demos (used in our examples)

# Perfect for creating sample/test data in-memory
# See Section 1 for the SampleData class definition using DataCreator:
#
# class SampleData(FeatureGroup):
#     @classmethod
#     def input_data(cls) -> Optional[BaseInputData]:
#         return DataCreator({"customer_id", "age", "income"})
#
#     @classmethod
#     def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
#         return pd.DataFrame({
#             'customer_id': ['C001', 'C002'],
#             'age': [25, 30],
#             'income': [50000, 75000]
#         })

2. DataAccessCollection - For production file/database access

# Example (requires actual files/databases):
# from mloda.user import DataAccessCollection
#
# # Read from files, folders, or databases
# data_access = DataAccessCollection(
#     files={"customers.csv", "orders.parquet"},           # CSV/Parquet/JSON files
#     folders={"data/raw/"},                                # Entire directories
#     credential_dicts={"host": "db.example.com"}           # Database credentials
# )
#
# result = mloda.run_all(
#     features=["customer_id", "income__standard_scaled"],
#     compute_frameworks=["PandasDataFrame"],
#     data_access_collection=data_access
# )

3. ApiData - For runtime data injection (web requests, real-time predictions)

# Example (for API endpoints and real-time predictions):
# from mloda.provider import ApiDataCollection
#
# api_input_data_collection = ApiDataCollection()
# api_data = api_input_data_collection.setup_key_api_data(
#     key_name="PredictionData",
#     api_input_data={"customer_id": ["C001", "C002"], "age": [25, 30]}
# )
#
# result = mloda.run_all(
#     features=["customer_id", "age__standard_scaled"],
#     compute_frameworks=["PandasDataFrame"],
#     api_input_data_collection=api_input_data_collection,
#     api_data=api_data
# )

Key Insight: Use DataCreator for demos, DataAccessCollection for batch processing from files/databases, and ApiData for real-time predictions and web services.

5. Compute Frameworks - Choose Your Processing Engine

Using Different Data Processing Libraries

mloda supports multiple compute frameworks (pandas, polars, pyarrow, etc.). Most users start with pandas:

# Using the SampleData class from Section 1
# Default: Everything processes with pandas
result = mloda.run_all(
    features=["customer_id", "income__standard_scaled"],
    compute_frameworks=["PandasDataFrame"]  # Use pandas for all features
)

data = result[0]  # Returns pandas DataFrame
print(type(data))  # <class 'pandas.core.frame.DataFrame'>

Why Compute Frameworks Matter:

Pandas: Best for small-to-medium datasets, rich ecosystem, familiar API
Polars: High performance for larger datasets
PyArrow: Memory-efficient, great for columnar data
Spark: Distributed processing for big data

6. Putting It All Together - Complete ML Pipeline

Real-World Example: Customer Churn Prediction

Let's build a complete machine learning pipeline with mloda:

# Step 1: Extend SampleData with more features for ML
# (Reuse the same class to avoid conflicts)
SampleData._original_calculate = SampleData.calculate_feature

@classmethod
def _extended_calculate(cls, data: Any, features: FeatureSet) -> Any:
    import numpy as np
    np.random.seed(42)
    n = 100
    return pd.DataFrame({
        'customer_id': [f'C{i:03d}' for i in range(n)],
        'age': np.random.randint(18, 70, n),
        'income': np.random.randint(30000, 120000, n),
        'account_balance': np.random.randint(0, 10000, n),
        'subscription_tier': np.random.choice(['Basic', 'Premium', 'Enterprise'], n),
        'region': np.random.choice(['North', 'South', 'East', 'West'], n),
        'customer_segment': np.random.choice(['New', 'Regular', 'VIP'], n),
        'churned': np.random.choice([0, 1], n)
    })

SampleData.calculate_feature = _extended_calculate
SampleData._input_data_original = SampleData.input_data()

@classmethod
def _extended_input_data(cls) -> Optional[BaseInputData]:
    return DataCreator({"customer_id", "age", "income", "account_balance",
                       "subscription_tier", "region", "customer_segment", "churned"})

SampleData.input_data = _extended_input_data

# Step 2: Run feature engineering pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

result = mloda.run_all(
    features=[
        "customer_id",
        "age__standard_scaled",
        "income__standard_scaled",
        "account_balance__robust_scaled",
        "subscription_tier__label_encoded",
        "region__label_encoded",
        "customer_segment__label_encoded",
        "churned"
    ],
    compute_frameworks=["PandasDataFrame"]
)

# Step 3: Prepare for ML
processed_data = result[0]
if len(processed_data.columns) > 2:  # Check we have features besides customer_id and churned
    X = processed_data.drop(['customer_id', 'churned'], axis=1)
    y = processed_data['churned']

    # Step 4: Train and evaluate
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"🎯 Model Accuracy: {accuracy:.2%}")
else:
    print("⚠️ Skipping ML - extend SampleData first with more features!")

What mloda Did For You:

✅ Generated sample data with DataCreator
✅ Scaled numeric features (StandardScaler & RobustScaler)
✅ Encoded categorical features (Label encoding)
✅ Returned clean DataFrame ready for sklearn

🎉 You now understand mloda's complete workflow! The same transformations work across pandas, polars, pyarrow, and other frameworks - just change compute_frameworks.

📖 Documentation

Getting Started - Installation and first steps
sklearn Integration - Complete tutorial
Feature Groups - Core concepts
Compute Frameworks - Technology integration
API Reference - Complete API documentation

🤝 Contributing

We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.

Development Guide - How to contribute
GitHub Issues - Report bugs or request features
Email - Direct contact

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.claude		.claude
.devcontainer		.devcontainer
.github		.github
attribution		attribution
docs		docs
memory-bank		memory-bank
mloda		mloda
mloda_plugins		mloda_plugins
tests		tests
.clinerules		.clinerules
.dockerignore		.dockerignore
.gitignore		.gitignore
.releaserc.yaml		.releaserc.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.TXT		LICENSE.TXT
MANIFEST.in		MANIFEST.in
NOTICE.md		NOTICE.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mloda: Make data, feature and context engineering shareable

🍳 Think of mloda Like Cooking Recipes

Installation

1. The Core API Call - Your Starting Point

2. Understanding Feature Chaining (Transformations)

3. Advanced: Feature Objects for Complex Configurations

4. Data Access - Where Your Data Comes From

5. Compute Frameworks - Choose Your Processing Engine

6. Putting It All Together - Complete ML Pipeline

📖 Documentation

🤝 Contributing

📄 License

This project is licensed under the Apache License, Version 2.0.

About

Uh oh!

Releases 24

Uh oh!

Contributors 4

Uh oh!

Languages

License

mloda-ai/mloda

Folders and files

Latest commit

History

Repository files navigation

mloda: Make data, feature and context engineering shareable

🍳 Think of mloda Like Cooking Recipes

Installation

1. The Core API Call - Your Starting Point

2. Understanding Feature Chaining (Transformations)

3. Advanced: Feature Objects for Complex Configurations

4. Data Access - Where Your Data Comes From

5. Compute Frameworks - Choose Your Processing Engine

6. Putting It All Together - Complete ML Pipeline

📖 Documentation

🤝 Contributing

📄 License

This project is licensed under the Apache License, Version 2.0.

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 24

Uh oh!

Contributors 4

Uh oh!

Languages