β οΈ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!
Traditional Data Pipelines = Making everything from scratch
- Want pasta? Make noodles, sauce, cheese from raw ingredients
- Want pizza? Start over - make dough, sauce, cheese again
- Want lasagna? Repeat everything once more
- Can't share recipes easily - they're mixed with your kitchen setup
mloda = Using recipe components
- Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
- Use same "tomato sauce" for pasta, pizza, lasagna
- Switch kitchens (home β restaurant β food truck) - same recipes work
- Share your "tomato sauce" recipe with friends - they don't need your whole kitchen
Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!
pip install mlodaComplete Working Example with DataCreator
# Step 1: Create a sample data source using DataCreator
from mloda.provider import FeatureGroup, DataCreator, FeatureSet, BaseInputData
from typing import Any, Optional
import pandas as pd
class SampleData(FeatureGroup):
@classmethod
def input_data(cls) -> Optional[BaseInputData]:
return DataCreator({"customer_id", "age", "income"})
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
return pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
'age': [25, 30, 35, None, 45],
'income': [50000, 75000, None, 60000, 85000]
})
# Step 2: Load mloda plugins and run pipeline
from mloda.user import PluginLoader
import mloda
PluginLoader.all()
result = mloda.run_all(
features=[
"customer_id", # Original column
"age", # Original column
"income__standard_scaled" # Transform: scale income to mean=0, std=1
],
compute_frameworks=["PandasDataFrame"]
)
# Step 3: Get your processed data
data = result[0]
print(data.head())
# Output: DataFrame with customer_id, age, and scaled incomeWhat just happened?
- SampleData class - Created a data source using DataCreator (generates data in-memory)
- PluginLoader.all() - Loaded all available transformations (scaling, encoding, imputation, etc.)
- mloda.run_all() - Executed the feature pipeline:
- Got data from
SampleData - Extracted
customer_idandageas-is - Applied StandardScaler to
incomeβincome__standard_scaled
- Got data from
- result[0] - Retrieved the processed pandas DataFrame
Key Insight: The syntax
income__standard_scaledis mloda's feature chaining. Behind the scenes, mloda creates a chain of feature group objects (SourceFeatureGroupβStandardScalingFeatureGroup), automatically resolving dependencies. See Section 2 for full explanation of chaining syntax and Section 4 to learn about the underlying feature group architecture.
The Power of Double Underscore __ Syntax
As mentioned in Section 1, feature chaining (like income__standard_scaled) is syntactic sugar that mloda converts into a chain of feature group objects. Each transformation (standard_scaled, mean_imputed, etc.) corresponds to a specific feature group class.
mloda's chaining syntax lets you compose transformations using __ as a separator:
# Pattern examples (these show the syntax):
# "income__standard_scaled" # Scale income column
# "age__mean_imputed" # Fill missing age values with mean
# "category__onehot_encoded" # One-hot encode category column
#
# You can chain transformations!
# Pattern: {source}__{transform1}__{transform2}
# "income__mean_imputed__standard_scaled" # First impute, then scale
# Real working example:
_ = ["income__standard_scaled", "age__mean_imputed"] # Valid feature namesAvailable Transformations:
| Transformation | Purpose | Example |
|---|---|---|
__standard_scaled |
StandardScaler (mean=0, std=1) | income__standard_scaled |
__minmax_scaled |
MinMaxScaler (range [0,1]) | age__minmax_scaled |
__robust_scaled |
RobustScaler (median-based, handles outliers) | price__robust_scaled |
__mean_imputed |
Fill missing values with mean | salary__mean_imputed |
__median_imputed |
Fill missing values with median | age__median_imputed |
__mode_imputed |
Fill missing values with mode | category__mode_imputed |
__onehot_encoded |
One-hot encoding | state__onehot_encoded |
__label_encoded |
Label encoding | priority__label_encoded |
Key Insight: Transformations are read left-to-right.
income__mean_imputed__standard_scaledmeans: takeincomeβ apply mean imputation β apply standard scaling.
When You Need More Control
Most of the time, simple string syntax is enough:
# Example feature list (simple strings)
example_features = ["customer_id", "income__standard_scaled", "region__onehot_encoded"]But for advanced configurations, you can explicitly create Feature objects with custom options (covered in Section 3).
Understanding the Feature Group Architecture
Behind the scenes, chaining like income__standard_scaled creates feature group objects:
# When you write this string:
"income__standard_scaled"
# mloda creates this chain of feature groups:
# StandardScalingFeatureGroup (reads from) β IncomeSourceFeatureGroupExplicit Feature Objects
For truly custom configurations, you can use Feature objects:
# Example (for custom feature configurations):
# from mloda import Feature, Options
#
# features = [
# "customer_id", # Simple string
# Feature(
# "custom_feature",
# options=Options({
# "custom_param": "value",
# "in_features": "source_column",
# })
# ),
# ]
#
# result = mloda.run_all(
# features=features,
# compute_frameworks=["PandasDataFrame"]
# )Deep Dive: Each transformation type (
standard_scaled__,mean_imputed__, etc.) maps to a feature group class inmloda_plugins/feature_group/. For example,standard_scaled__usesScalingFeatureGroup. When you chain transformations, mloda builds a dependency graph of these feature groups and executes them in the correct order. This architecture makes mloda extensible - you can create custom feature groups for your own transformations!
Three Ways to Provide Data
mloda supports multiple data access patterns depending on your use case:
1. DataCreator - For testing and demos (used in our examples)
# Perfect for creating sample/test data in-memory
# See Section 1 for the SampleData class definition using DataCreator:
#
# class SampleData(FeatureGroup):
# @classmethod
# def input_data(cls) -> Optional[BaseInputData]:
# return DataCreator({"customer_id", "age", "income"})
#
# @classmethod
# def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
# return pd.DataFrame({
# 'customer_id': ['C001', 'C002'],
# 'age': [25, 30],
# 'income': [50000, 75000]
# })2. DataAccessCollection - For production file/database access
# Example (requires actual files/databases):
# from mloda.user import DataAccessCollection
#
# # Read from files, folders, or databases
# data_access = DataAccessCollection(
# files={"customers.csv", "orders.parquet"}, # CSV/Parquet/JSON files
# folders={"data/raw/"}, # Entire directories
# credential_dicts={"host": "db.example.com"} # Database credentials
# )
#
# result = mloda.run_all(
# features=["customer_id", "income__standard_scaled"],
# compute_frameworks=["PandasDataFrame"],
# data_access_collection=data_access
# )3. ApiData - For runtime data injection (web requests, real-time predictions)
# Example (for API endpoints and real-time predictions):
# from mloda.provider import ApiDataCollection
#
# api_input_data_collection = ApiDataCollection()
# api_data = api_input_data_collection.setup_key_api_data(
# key_name="PredictionData",
# api_input_data={"customer_id": ["C001", "C002"], "age": [25, 30]}
# )
#
# result = mloda.run_all(
# features=["customer_id", "age__standard_scaled"],
# compute_frameworks=["PandasDataFrame"],
# api_input_data_collection=api_input_data_collection,
# api_data=api_data
# )Key Insight: Use DataCreator for demos, DataAccessCollection for batch processing from files/databases, and ApiData for real-time predictions and web services.
Using Different Data Processing Libraries
mloda supports multiple compute frameworks (pandas, polars, pyarrow, etc.). Most users start with pandas:
# Using the SampleData class from Section 1
# Default: Everything processes with pandas
result = mloda.run_all(
features=["customer_id", "income__standard_scaled"],
compute_frameworks=["PandasDataFrame"] # Use pandas for all features
)
data = result[0] # Returns pandas DataFrame
print(type(data)) # <class 'pandas.core.frame.DataFrame'>Why Compute Frameworks Matter:
- Pandas: Best for small-to-medium datasets, rich ecosystem, familiar API
- Polars: High performance for larger datasets
- PyArrow: Memory-efficient, great for columnar data
- Spark: Distributed processing for big data
Real-World Example: Customer Churn Prediction
Let's build a complete machine learning pipeline with mloda:
# Step 1: Extend SampleData with more features for ML
# (Reuse the same class to avoid conflicts)
SampleData._original_calculate = SampleData.calculate_feature
@classmethod
def _extended_calculate(cls, data: Any, features: FeatureSet) -> Any:
import numpy as np
np.random.seed(42)
n = 100
return pd.DataFrame({
'customer_id': [f'C{i:03d}' for i in range(n)],
'age': np.random.randint(18, 70, n),
'income': np.random.randint(30000, 120000, n),
'account_balance': np.random.randint(0, 10000, n),
'subscription_tier': np.random.choice(['Basic', 'Premium', 'Enterprise'], n),
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'customer_segment': np.random.choice(['New', 'Regular', 'VIP'], n),
'churned': np.random.choice([0, 1], n)
})
SampleData.calculate_feature = _extended_calculate
SampleData._input_data_original = SampleData.input_data()
@classmethod
def _extended_input_data(cls) -> Optional[BaseInputData]:
return DataCreator({"customer_id", "age", "income", "account_balance",
"subscription_tier", "region", "customer_segment", "churned"})
SampleData.input_data = _extended_input_data
# Step 2: Run feature engineering pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
result = mloda.run_all(
features=[
"customer_id",
"age__standard_scaled",
"income__standard_scaled",
"account_balance__robust_scaled",
"subscription_tier__label_encoded",
"region__label_encoded",
"customer_segment__label_encoded",
"churned"
],
compute_frameworks=["PandasDataFrame"]
)
# Step 3: Prepare for ML
processed_data = result[0]
if len(processed_data.columns) > 2: # Check we have features besides customer_id and churned
X = processed_data.drop(['customer_id', 'churned'], axis=1)
y = processed_data['churned']
# Step 4: Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"π― Model Accuracy: {accuracy:.2%}")
else:
print("β οΈ Skipping ML - extend SampleData first with more features!")What mloda Did For You:
- β Generated sample data with DataCreator
- β Scaled numeric features (StandardScaler & RobustScaler)
- β Encoded categorical features (Label encoding)
- β Returned clean DataFrame ready for sklearn
π You now understand mloda's complete workflow! The same transformations work across pandas, polars, pyarrow, and other frameworks - just change
compute_frameworks.
- Getting Started - Installation and first steps
- sklearn Integration - Complete tutorial
- Feature Groups - Core concepts
- Compute Frameworks - Technology integration
- API Reference - Complete API documentation
We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.
- Development Guide - How to contribute
- GitHub Issues - Report bugs or request features
- Email - Direct contact