Make data, feature and context engineering shareable
Project description
mloda: Make data, feature and context engineering shareable
⚠️ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!
🍳 Think of mloda Like Cooking Recipes
Traditional Data Pipelines = Making everything from scratch
- Want pasta? Make noodles, sauce, cheese from raw ingredients
- Want pizza? Start over - make dough, sauce, cheese again
- Want lasagna? Repeat everything once more
- Can't share recipes easily - they're mixed with your kitchen setup
mloda = Using recipe components
- Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
- Use same "tomato sauce" for pasta, pizza, lasagna
- Switch kitchens (home → restaurant → food truck) - same recipes work
- Share your "tomato sauce" recipe with friends - they don't need your whole kitchen
Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!
Installation
pip install mloda
1. The Core API Call - Your Starting Point
Complete Working Example with DataCreator
# Step 1: Create a sample data source using DataCreator
from mloda.provider import FeatureGroup, DataCreator, FeatureSet, BaseInputData
from typing import Any, Optional
import pandas as pd
class SampleData(FeatureGroup):
@classmethod
def input_data(cls) -> Optional[BaseInputData]:
return DataCreator({"customer_id", "age", "income"})
@classmethod
def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
return pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
'age': [25, 30, 35, None, 45],
'income': [50000, 75000, None, 60000, 85000]
})
# Step 2: Load mloda plugins and run pipeline
from mloda.user import PluginLoader
import mloda
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataFrame
PluginLoader.all()
result = mloda.run_all(
features=[
"customer_id", # Original column
"age", # Original column
"income__standard_scaled" # Transform: scale income to mean=0, std=1
],
compute_frameworks={PandasDataFrame}
)
# Step 3: Get your processed data
data = result[0]
print(data.head())
# Output: DataFrame with customer_id, age, and scaled income
What just happened?
- SampleData class - Created a data source using DataCreator (generates data in-memory)
- PluginLoader.all() - Loaded all available transformations (scaling, encoding, imputation, etc.)
- mloda.run_all() - Executed the feature pipeline:
- Got data from
SampleData - Extracted
customer_idandageas-is - Applied StandardScaler to
income→income__standard_scaled
- Got data from
- result[0] - Retrieved the processed pandas DataFrame
Key Insight: The syntax
income__standard_scaledis mloda's feature chaining. Behind the scenes, mloda creates a chain of feature group objects (SourceFeatureGroup→StandardScalingFeatureGroup), automatically resolving dependencies. See Section 2 for full explanation of chaining syntax and Section 4 to learn about the underlying feature group architecture.
2. Understanding Feature Chaining (Transformations)
The Power of Double Underscore __ Syntax
As mentioned in Section 1, feature chaining (like income__standard_scaled) is syntactic sugar that mloda converts into a chain of feature group objects. Each transformation (standard_scaled, mean_imputed, etc.) corresponds to a specific feature group class.
mloda's chaining syntax lets you compose transformations using __ as a separator:
# Pattern examples (these show the syntax):
# "income__standard_scaled" # Scale income column
# "age__mean_imputed" # Fill missing age values with mean
# "category__onehot_encoded" # One-hot encode category column
#
# You can chain transformations!
# Pattern: {source}__{transform1}__{transform2}
# "income__mean_imputed__standard_scaled" # First impute, then scale
# Real working example:
_ = ["income__standard_scaled", "age__mean_imputed"] # Valid feature names
Available Transformations:
| Transformation | Purpose | Example |
|---|---|---|
__standard_scaled |
StandardScaler (mean=0, std=1) | income__standard_scaled |
__minmax_scaled |
MinMaxScaler (range [0,1]) | age__minmax_scaled |
__robust_scaled |
RobustScaler (median-based, handles outliers) | price__robust_scaled |
__mean_imputed |
Fill missing values with mean | salary__mean_imputed |
__median_imputed |
Fill missing values with median | age__median_imputed |
__mode_imputed |
Fill missing values with mode | category__mode_imputed |
__onehot_encoded |
One-hot encoding | state__onehot_encoded |
__label_encoded |
Label encoding | priority__label_encoded |
Key Insight: Transformations are read left-to-right.
income__mean_imputed__standard_scaledmeans: takeincome→ apply mean imputation → apply standard scaling.
When You Need More Control
Most of the time, simple string syntax is enough:
# Example feature list (simple strings)
example_features = ["customer_id", "income__standard_scaled", "region__onehot_encoded"]
But for advanced configurations, you can explicitly create Feature objects with custom options (covered in Section 3).
3. Advanced: Feature Objects for Complex Configurations
Understanding the Feature Group Architecture
Behind the scenes, chaining like income__standard_scaled creates feature group objects:
# When you write this string:
"income__standard_scaled"
# mloda creates this chain of feature groups:
# StandardScalingFeatureGroup (reads from) → IncomeSourceFeatureGroup
Explicit Feature Objects
For truly custom configurations, you can use Feature objects:
# Example (for custom feature configurations):
# from mloda import Feature, Options
#
# features = [
# "customer_id", # Simple string
# Feature(
# "custom_feature",
# options=Options({
# "custom_param": "value",
# "in_features": "source_column",
# })
# ),
# ]
#
# result = mloda.run_all(
# features=features,
# compute_frameworks={PandasDataFrame}
# )
Deep Dive: Each transformation type (
standard_scaled__,mean_imputed__, etc.) maps to a feature group class inmloda_plugins/feature_group/. For example,standard_scaled__usesScalingFeatureGroup. When you chain transformations, mloda builds a dependency graph of these feature groups and executes them in the correct order. This architecture makes mloda extensible - you can create custom feature groups for your own transformations!
4. Data Access - Where Your Data Comes From
Three Ways to Provide Data
mloda supports multiple data access patterns depending on your use case:
1. DataCreator - For testing and demos (used in our examples)
# Perfect for creating sample/test data in-memory
# See Section 1 for the SampleData class definition using DataCreator:
#
# class SampleData(FeatureGroup):
# @classmethod
# def input_data(cls) -> Optional[BaseInputData]:
# return DataCreator({"customer_id", "age", "income"})
#
# @classmethod
# def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
# return pd.DataFrame({
# 'customer_id': ['C001', 'C002'],
# 'age': [25, 30],
# 'income': [50000, 75000]
# })
2. DataAccessCollection - For production file/database access
# Example (requires actual files/databases):
# from mloda.user import DataAccessCollection
#
# # Read from files, folders, or databases
# data_access = DataAccessCollection(
# files={"customers.csv", "orders.parquet"}, # CSV/Parquet/JSON files
# folders={"data/raw/"}, # Entire directories
# credential_dicts={"host": "db.example.com"} # Database credentials
# )
#
# result = mloda.run_all(
# features=["customer_id", "income__standard_scaled"],
# compute_frameworks={PandasDataFrame},
# data_access_collection=data_access
# )
3. ApiData - For runtime data injection (web requests, real-time predictions)
# Example (for API endpoints and real-time predictions):
# from mloda.provider import ApiDataCollection
#
# api_input_data_collection = ApiDataCollection()
# api_data = api_input_data_collection.setup_key_api_data(
# key_name="PredictionData",
# api_input_data={"customer_id": ["C001", "C002"], "age": [25, 30]}
# )
#
# result = mloda.run_all(
# features=["customer_id", "age__standard_scaled"],
# compute_frameworks={PandasDataFrame},
# api_input_data_collection=api_input_data_collection,
# api_data=api_data
# )
Key Insight: Use DataCreator for demos, DataAccessCollection for batch processing from files/databases, and ApiData for real-time predictions and web services.
5. Compute Frameworks - Choose Your Processing Engine
Using Different Data Processing Libraries
mloda supports multiple compute frameworks (pandas, polars, pyarrow, etc.). Most users start with pandas:
# Using the SampleData class from Section 1
# Default: Everything processes with pandas
result = mloda.run_all(
features=["customer_id", "income__standard_scaled"],
compute_frameworks={PandasDataFrame} # Use pandas for all features
)
data = result[0] # Returns pandas DataFrame
print(type(data)) # <class 'pandas.core.frame.DataFrame'>
Why Compute Frameworks Matter:
- Pandas: Best for small-to-medium datasets, rich ecosystem, familiar API
- Polars: High performance for larger datasets
- PyArrow: Memory-efficient, great for columnar data
- Spark: Distributed processing for big data
For most use cases: Start with
compute_frameworks={PandasDataFrame}and switch to others only if you need specific performance characteristics.
6. Putting It All Together - Complete ML Pipeline
Real-World Example: Customer Churn Prediction
Let's build a complete machine learning pipeline with mloda:
# Step 1: Extend SampleData with more features for ML
# (Reuse the same class to avoid conflicts)
SampleData._original_calculate = SampleData.calculate_feature
@classmethod
def _extended_calculate(cls, data: Any, features: FeatureSet) -> Any:
import numpy as np
np.random.seed(42)
n = 100
return pd.DataFrame({
'customer_id': [f'C{i:03d}' for i in range(n)],
'age': np.random.randint(18, 70, n),
'income': np.random.randint(30000, 120000, n),
'account_balance': np.random.randint(0, 10000, n),
'subscription_tier': np.random.choice(['Basic', 'Premium', 'Enterprise'], n),
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'customer_segment': np.random.choice(['New', 'Regular', 'VIP'], n),
'churned': np.random.choice([0, 1], n)
})
SampleData.calculate_feature = _extended_calculate
SampleData._input_data_original = SampleData.input_data()
@classmethod
def _extended_input_data(cls) -> Optional[BaseInputData]:
return DataCreator({"customer_id", "age", "income", "account_balance",
"subscription_tier", "region", "customer_segment", "churned"})
SampleData.input_data = _extended_input_data
# Step 2: Run feature engineering pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
result = mloda.run_all(
features=[
"customer_id",
"age__standard_scaled",
"income__standard_scaled",
"account_balance__robust_scaled",
"subscription_tier__label_encoded",
"region__label_encoded",
"customer_segment__label_encoded",
"churned"
],
compute_frameworks={PandasDataFrame}
)
# Step 3: Prepare for ML
processed_data = result[0]
if len(processed_data.columns) > 2: # Check we have features besides customer_id and churned
X = processed_data.drop(['customer_id', 'churned'], axis=1)
y = processed_data['churned']
# Step 4: Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, model.predict(X_test))
print(f"🎯 Model Accuracy: {accuracy:.2%}")
else:
print("⚠️ Skipping ML - extend SampleData first with more features!")
What mloda Did For You:
- ✅ Generated sample data with DataCreator
- ✅ Scaled numeric features (StandardScaler & RobustScaler)
- ✅ Encoded categorical features (Label encoding)
- ✅ Returned clean DataFrame ready for sklearn
🎉 You now understand mloda's complete workflow! The same transformations work across pandas, polars, pyarrow, and other frameworks - just change
compute_frameworks.
📖 Documentation
- Getting Started - Installation and first steps
- sklearn Integration - Complete tutorial
- Feature Groups - Core concepts
- Compute Frameworks - Technology integration
- API Reference - Complete API documentation
🤝 Contributing
We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.
- Development Guide - How to contribute
- GitHub Issues - Report bugs or request features
- Email - Direct contact
📄 License
This project is licensed under the Apache License, Version 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mloda-0.4.0.tar.gz.
File metadata
- Download URL: mloda-0.4.0.tar.gz
- Upload date:
- Size: 245.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac697afd71b8d8fe6ff7fff4c6d96fdab7030a6d9f83496102a3b723b3069141
|
|
| MD5 |
e16ca83ab9ea7be5b98846b61ab092e4
|
|
| BLAKE2b-256 |
078f5a6c2cf4c7fa4e7de530221e7bc066f0bf6fa9d475018c96cf829f26813b
|
File details
Details for the file mloda-0.4.0-py3-none-any.whl.
File metadata
- Download URL: mloda-0.4.0-py3-none-any.whl
- Upload date:
- Size: 361.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
913d9a90f8f8ddf4a9a744f9a0399d3c7b1632e76cf0917fbf0a5da5433bd5fd
|
|
| MD5 |
9961bc8399333fd3beba8e838ad6a1b1
|
|
| BLAKE2b-256 |
77aa72de16ba87097e587980846c1ba439fd566f407aa71b8ece939fa397e0a1
|