Make data, feature and context engineering shareable

These details have not been verified by PyPI

Project links

Project description

mloda: Make data, feature and context engineering shareable

⚠️ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!

🍳 Think of mloda Like Cooking Recipes

Traditional Data Pipelines = Making everything from scratch

Want pasta? Make noodles, sauce, cheese from raw ingredients
Want pizza? Start over - make dough, sauce, cheese again
Want lasagna? Repeat everything once more
Can't share recipes easily - they're mixed with your kitchen setup

mloda = Using recipe components

Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
Use same "tomato sauce" for pasta, pizza, lasagna
Switch kitchens (home → restaurant → food truck) - same recipes work
Share your "tomato sauce" recipe with friends - they don't need your whole kitchen

Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!

Installation

pip install mloda

1. The Core API Call - Your Starting Point

Complete Working Example with DataCreator

# Step 1: Create a sample data source using DataCreator
from mloda.provider import FeatureGroup, DataCreator, FeatureSet, BaseInputData
from typing import Any, Optional
import pandas as pd

class SampleData(FeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return DataCreator({"customer_id", "age", "income"})

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        return pd.DataFrame({
            'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
            'age': [25, 30, 35, None, 45],
            'income': [50000, 75000, None, 60000, 85000]
        })

# Step 2: Load mloda plugins and run pipeline
from mloda.user import PluginLoader
import mloda
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataFrame

PluginLoader.all()

result = mloda.run_all(
    features=[
        "customer_id",                    # Original column
        "age",                            # Original column
        "income__standard_scaled"         # Transform: scale income to mean=0, std=1
    ],
    compute_frameworks={PandasDataFrame}
)

# Step 3: Get your processed data
data = result[0]
print(data.head())
# Output: DataFrame with customer_id, age, and scaled income

What just happened?

SampleData class - Created a data source using DataCreator (generates data in-memory)
PluginLoader.all() - Loaded all available transformations (scaling, encoding, imputation, etc.)
mloda.run_all() - Executed the feature pipeline:
- Got data from SampleData
- Extracted customer_id and age as-is
- Applied StandardScaler to income → income__standard_scaled
result[0] - Retrieved the processed pandas DataFrame

Key Insight: The syntax income__standard_scaled is mloda's feature chaining. Behind the scenes, mloda creates a chain of feature group objects (SourceFeatureGroup → StandardScalingFeatureGroup), automatically resolving dependencies. See Section 2 for full explanation of chaining syntax and Section 4 to learn about the underlying feature group architecture.

2. Understanding Feature Chaining (Transformations)

The Power of Double Underscore __ Syntax

As mentioned in Section 1, feature chaining (like income__standard_scaled) is syntactic sugar that mloda converts into a chain of feature group objects. Each transformation (standard_scaled, mean_imputed, etc.) corresponds to a specific feature group class.

mloda's chaining syntax lets you compose transformations using __ as a separator:

# Pattern examples (these show the syntax):
#   "income__standard_scaled"                     # Scale income column
#   "age__mean_imputed"                           # Fill missing age values with mean
#   "category__onehot_encoded"                    # One-hot encode category column
#
# You can chain transformations!
# Pattern: {source}__{transform1}__{transform2}
#   "income__mean_imputed__standard_scaled"       # First impute, then scale

# Real working example:
_ = ["income__standard_scaled", "age__mean_imputed"]  # Valid feature names

Available Transformations:

Transformation	Purpose	Example
`__standard_scaled`	StandardScaler (mean=0, std=1)	`income__standard_scaled`
`__minmax_scaled`	MinMaxScaler (range [0,1])	`age__minmax_scaled`
`__robust_scaled`	RobustScaler (median-based, handles outliers)	`price__robust_scaled`
`__mean_imputed`	Fill missing values with mean	`salary__mean_imputed`
`__median_imputed`	Fill missing values with median	`age__median_imputed`
`__mode_imputed`	Fill missing values with mode	`category__mode_imputed`
`__onehot_encoded`	One-hot encoding	`state__onehot_encoded`
`__label_encoded`	Label encoding	`priority__label_encoded`

Key Insight: Transformations are read left-to-right. income__mean_imputed__standard_scaled means: take income → apply mean imputation → apply standard scaling.

When You Need More Control

Most of the time, simple string syntax is enough:

# Example feature list (simple strings)
example_features = ["customer_id", "income__standard_scaled", "region__onehot_encoded"]

But for advanced configurations, you can explicitly create Feature objects with custom options (covered in Section 3).

3. Advanced: Feature Objects for Complex Configurations

Understanding the Feature Group Architecture

Behind the scenes, chaining like income__standard_scaled creates feature group objects:

# When you write this string:
"income__standard_scaled"

# mloda creates this chain of feature groups:
# StandardScalingFeatureGroup (reads from) → IncomeSourceFeatureGroup

Explicit Feature Objects

For truly custom configurations, you can use Feature objects:

# Example (for custom feature configurations):
# from mloda import Feature, Options
#
# features = [
#     "customer_id",                                   # Simple string
#     Feature(
#         "custom_feature",
#         options=Options({
#             "custom_param": "value",
#             "in_features": "source_column",
#         })
#     ),
# ]
#
# result = mloda.run_all(
#     features=features,
#     compute_frameworks={PandasDataFrame}
# )

Deep Dive: Each transformation type (standard_scaled__, mean_imputed__, etc.) maps to a feature group class in mloda_plugins/feature_group/. For example, standard_scaled__ uses ScalingFeatureGroup. When you chain transformations, mloda builds a dependency graph of these feature groups and executes them in the correct order. This architecture makes mloda extensible - you can create custom feature groups for your own transformations!

4. Data Access - Where Your Data Comes From

Three Ways to Provide Data

mloda supports multiple data access patterns depending on your use case:

1. DataCreator - For testing and demos (used in our examples)

# Perfect for creating sample/test data in-memory
# See Section 1 for the SampleData class definition using DataCreator:
#
# class SampleData(FeatureGroup):
#     @classmethod
#     def input_data(cls) -> Optional[BaseInputData]:
#         return DataCreator({"customer_id", "age", "income"})
#
#     @classmethod
#     def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
#         return pd.DataFrame({
#             'customer_id': ['C001', 'C002'],
#             'age': [25, 30],
#             'income': [50000, 75000]
#         })

2. DataAccessCollection - For production file/database access

# Example (requires actual files/databases):
# from mloda.user import DataAccessCollection
#
# # Read from files, folders, or databases
# data_access = DataAccessCollection(
#     files={"customers.csv", "orders.parquet"},           # CSV/Parquet/JSON files
#     folders={"data/raw/"},                                # Entire directories
#     credential_dicts={"host": "db.example.com"}           # Database credentials
# )
#
# result = mloda.run_all(
#     features=["customer_id", "income__standard_scaled"],
#     compute_frameworks={PandasDataFrame},
#     data_access_collection=data_access
# )

3. ApiData - For runtime data injection (web requests, real-time predictions)

# Example (for API endpoints and real-time predictions):
# from mloda.provider import ApiDataCollection
#
# api_input_data_collection = ApiDataCollection()
# api_data = api_input_data_collection.setup_key_api_data(
#     key_name="PredictionData",
#     api_input_data={"customer_id": ["C001", "C002"], "age": [25, 30]}
# )
#
# result = mloda.run_all(
#     features=["customer_id", "age__standard_scaled"],
#     compute_frameworks={PandasDataFrame},
#     api_input_data_collection=api_input_data_collection,
#     api_data=api_data
# )

Key Insight: Use DataCreator for demos, DataAccessCollection for batch processing from files/databases, and ApiData for real-time predictions and web services.

5. Compute Frameworks - Choose Your Processing Engine

Using Different Data Processing Libraries

mloda supports multiple compute frameworks (pandas, polars, pyarrow, etc.). Most users start with pandas:

# Using the SampleData class from Section 1
# Default: Everything processes with pandas
result = mloda.run_all(
    features=["customer_id", "income__standard_scaled"],
    compute_frameworks={PandasDataFrame}  # Use pandas for all features
)

data = result[0]  # Returns pandas DataFrame
print(type(data))  # <class 'pandas.core.frame.DataFrame'>

Why Compute Frameworks Matter:

Pandas: Best for small-to-medium datasets, rich ecosystem, familiar API
Polars: High performance for larger datasets
PyArrow: Memory-efficient, great for columnar data
Spark: Distributed processing for big data

For most use cases: Start with compute_frameworks={PandasDataFrame} and switch to others only if you need specific performance characteristics.

6. Putting It All Together - Complete ML Pipeline

Real-World Example: Customer Churn Prediction

Let's build a complete machine learning pipeline with mloda:

# Step 1: Extend SampleData with more features for ML
# (Reuse the same class to avoid conflicts)
SampleData._original_calculate = SampleData.calculate_feature

@classmethod
def _extended_calculate(cls, data: Any, features: FeatureSet) -> Any:
    import numpy as np
    np.random.seed(42)
    n = 100
    return pd.DataFrame({
        'customer_id': [f'C{i:03d}' for i in range(n)],
        'age': np.random.randint(18, 70, n),
        'income': np.random.randint(30000, 120000, n),
        'account_balance': np.random.randint(0, 10000, n),
        'subscription_tier': np.random.choice(['Basic', 'Premium', 'Enterprise'], n),
        'region': np.random.choice(['North', 'South', 'East', 'West'], n),
        'customer_segment': np.random.choice(['New', 'Regular', 'VIP'], n),
        'churned': np.random.choice([0, 1], n)
    })

SampleData.calculate_feature = _extended_calculate
SampleData._input_data_original = SampleData.input_data()

@classmethod
def _extended_input_data(cls) -> Optional[BaseInputData]:
    return DataCreator({"customer_id", "age", "income", "account_balance",
                       "subscription_tier", "region", "customer_segment", "churned"})

SampleData.input_data = _extended_input_data

# Step 2: Run feature engineering pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

result = mloda.run_all(
    features=[
        "customer_id",
        "age__standard_scaled",
        "income__standard_scaled",
        "account_balance__robust_scaled",
        "subscription_tier__label_encoded",
        "region__label_encoded",
        "customer_segment__label_encoded",
        "churned"
    ],
    compute_frameworks={PandasDataFrame}
)

# Step 3: Prepare for ML
processed_data = result[0]
if len(processed_data.columns) > 2:  # Check we have features besides customer_id and churned
    X = processed_data.drop(['customer_id', 'churned'], axis=1)
    y = processed_data['churned']

    # Step 4: Train and evaluate
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"🎯 Model Accuracy: {accuracy:.2%}")
else:
    print("⚠️ Skipping ML - extend SampleData first with more features!")

What mloda Did For You:

✅ Generated sample data with DataCreator
✅ Scaled numeric features (StandardScaler & RobustScaler)
✅ Encoded categorical features (Label encoding)
✅ Returned clean DataFrame ready for sklearn

🎉 You now understand mloda's complete workflow! The same transformations work across pandas, polars, pyarrow, and other frameworks - just change compute_frameworks.

📖 Documentation

Getting Started - Installation and first steps
sklearn Integration - Complete tutorial
Feature Groups - Core concepts
Compute Frameworks - Technology integration
API Reference - Complete API documentation

🤝 Contributing

We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.

Development Guide - How to contribute
GitHub Issues - Report bugs or request features
Email - Direct contact

📄 License

This project is licensed under the Apache License, Version 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.2

May 7, 2026

0.6.1

Apr 9, 2026

0.6.0

Apr 6, 2026

0.5.7

Mar 30, 2026

0.5.6

Mar 28, 2026

0.5.5

Mar 24, 2026

0.5.4

Mar 21, 2026

0.5.3

Mar 15, 2026

0.5.2

Mar 15, 2026

0.5.1

Mar 14, 2026

0.5.0

Mar 11, 2026

0.4.8

Mar 3, 2026

0.4.7

Feb 13, 2026

0.4.6

Feb 11, 2026

0.4.5

Feb 6, 2026

0.4.4

Feb 4, 2026

0.4.3

Jan 14, 2026

0.4.2

Jan 14, 2026

0.4.1

Dec 17, 2025

This version

0.4.0

Dec 16, 2025

0.3.3

Dec 4, 2025

0.3.2

Dec 3, 2025

0.3.1

Dec 2, 2025

0.3.0

Nov 30, 2025

0.2.15

Nov 28, 2025

0.2.14

Oct 22, 2025

0.2.13

Oct 3, 2025

0.2.12

Jul 30, 2025

0.2.11

Jul 7, 2025

0.2.10

Jun 29, 2025

0.2.9

Apr 22, 2025

0.2.8

Apr 3, 2025

0.2.7

Mar 30, 2025

0.2.6

Mar 22, 2025

0.2.5

Feb 24, 2025

0.2.4

Feb 13, 2025

0.2.3

Jan 31, 2025

0.2.2

Jan 31, 2025

0.2.1

Jan 30, 2025

0.2.0

Jan 27, 2025

0.1.1

Jan 26, 2025

0.0.1

Oct 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mloda-0.4.0.tar.gz (245.1 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mloda-0.4.0-py3-none-any.whl (361.0 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file mloda-0.4.0.tar.gz.

File metadata

Download URL: mloda-0.4.0.tar.gz
Upload date: Dec 16, 2025
Size: 245.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mloda-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ac697afd71b8d8fe6ff7fff4c6d96fdab7030a6d9f83496102a3b723b3069141`
MD5	`e16ca83ab9ea7be5b98846b61ab092e4`
BLAKE2b-256	`078f5a6c2cf4c7fa4e7de530221e7bc066f0bf6fa9d475018c96cf829f26813b`

See more details on using hashes here.

File details

Details for the file mloda-0.4.0-py3-none-any.whl.

File metadata

Download URL: mloda-0.4.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 361.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mloda-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`913d9a90f8f8ddf4a9a744f9a0399d3c7b1632e76cf0917fbf0a5da5433bd5fd`
MD5	`9961bc8399333fd3beba8e838ad6a1b1`
BLAKE2b-256	`77aa72de16ba87097e587980846c1ba439fd566f407aa71b8ece939fa397e0a1`

See more details on using hashes here.

mloda 0.4.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mloda: Make data, feature and context engineering shareable

🍳 Think of mloda Like Cooking Recipes

Installation

1. The Core API Call - Your Starting Point

2. Understanding Feature Chaining (Transformations)

3. Advanced: Feature Objects for Complex Configurations

4. Data Access - Where Your Data Comes From

5. Compute Frameworks - Choose Your Processing Engine

6. Putting It All Together - Complete ML Pipeline

📖 Documentation

🤝 Contributing

📄 License

This project is licensed under the Apache License, Version 2.0.

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes