Skip to main content

Make data, feature and context engineering shareable

Project description

mloda: Make data, feature and context engineering shareable

Website Documentation PyPI version License Tox Checked with mypy code style: ruff

⚠️ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!

🍳 Think of mloda Like Cooking Recipes

Traditional Data Pipelines = Making everything from scratch

  • Want pasta? Make noodles, sauce, cheese from raw ingredients
  • Want pizza? Start over - make dough, sauce, cheese again
  • Want lasagna? Repeat everything once more
  • Can't share recipes easily - they're mixed with your kitchen setup

mloda = Using recipe components

  • Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
  • Use same "tomato sauce" for pasta, pizza, lasagna
  • Switch kitchens (home → restaurant → food truck) - same recipes work
  • Share your "tomato sauce" recipe with friends - they don't need your whole kitchen

Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!

Installation

pip install mloda

1. The Core API Call - Your Starting Point

Complete Working Example with DataCreator

# Step 1: Create a sample data source using DataCreator
from mloda.provider import FeatureGroup, DataCreator, FeatureSet, BaseInputData
from typing import Any, Optional
import pandas as pd

class SampleData(FeatureGroup):
    @classmethod
    def input_data(cls) -> Optional[BaseInputData]:
        return DataCreator({"customer_id", "age", "income"})

    @classmethod
    def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
        return pd.DataFrame({
            'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
            'age': [25, 30, 35, None, 45],
            'income': [50000, 75000, None, 60000, 85000]
        })

# Step 2: Load mloda plugins and run pipeline
from mloda.user import PluginLoader
import mloda
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataFrame

PluginLoader.all()

result = mloda.run_all(
    features=[
        "customer_id",                    # Original column
        "age",                            # Original column
        "income__standard_scaled"         # Transform: scale income to mean=0, std=1
    ],
    compute_frameworks={PandasDataFrame}
)

# Step 3: Get your processed data
data = result[0]
print(data.head())
# Output: DataFrame with customer_id, age, and scaled income

What just happened?

  1. SampleData class - Created a data source using DataCreator (generates data in-memory)
  2. PluginLoader.all() - Loaded all available transformations (scaling, encoding, imputation, etc.)
  3. mloda.run_all() - Executed the feature pipeline:
    • Got data from SampleData
    • Extracted customer_id and age as-is
    • Applied StandardScaler to incomeincome__standard_scaled
  4. result[0] - Retrieved the processed pandas DataFrame

Key Insight: The syntax income__standard_scaled is mloda's feature chaining. Behind the scenes, mloda creates a chain of feature group objects (SourceFeatureGroupStandardScalingFeatureGroup), automatically resolving dependencies. See Section 2 for full explanation of chaining syntax and Section 4 to learn about the underlying feature group architecture.

2. Understanding Feature Chaining (Transformations)

The Power of Double Underscore __ Syntax

As mentioned in Section 1, feature chaining (like income__standard_scaled) is syntactic sugar that mloda converts into a chain of feature group objects. Each transformation (standard_scaled, mean_imputed, etc.) corresponds to a specific feature group class.

mloda's chaining syntax lets you compose transformations using __ as a separator:

# Pattern examples (these show the syntax):
#   "income__standard_scaled"                     # Scale income column
#   "age__mean_imputed"                           # Fill missing age values with mean
#   "category__onehot_encoded"                    # One-hot encode category column
#
# You can chain transformations!
# Pattern: {source}__{transform1}__{transform2}
#   "income__mean_imputed__standard_scaled"       # First impute, then scale

# Real working example:
_ = ["income__standard_scaled", "age__mean_imputed"]  # Valid feature names

Available Transformations:

Transformation Purpose Example
__standard_scaled StandardScaler (mean=0, std=1) income__standard_scaled
__minmax_scaled MinMaxScaler (range [0,1]) age__minmax_scaled
__robust_scaled RobustScaler (median-based, handles outliers) price__robust_scaled
__mean_imputed Fill missing values with mean salary__mean_imputed
__median_imputed Fill missing values with median age__median_imputed
__mode_imputed Fill missing values with mode category__mode_imputed
__onehot_encoded One-hot encoding state__onehot_encoded
__label_encoded Label encoding priority__label_encoded

Key Insight: Transformations are read left-to-right. income__mean_imputed__standard_scaled means: take income → apply mean imputation → apply standard scaling.

When You Need More Control

Most of the time, simple string syntax is enough:

# Example feature list (simple strings)
example_features = ["customer_id", "income__standard_scaled", "region__onehot_encoded"]

But for advanced configurations, you can explicitly create Feature objects with custom options (covered in Section 3).

3. Advanced: Feature Objects for Complex Configurations

Understanding the Feature Group Architecture

Behind the scenes, chaining like income__standard_scaled creates feature group objects:

# When you write this string:
"income__standard_scaled"

# mloda creates this chain of feature groups:
# StandardScalingFeatureGroup (reads from) → IncomeSourceFeatureGroup

Explicit Feature Objects

For truly custom configurations, you can use Feature objects:

# Example (for custom feature configurations):
# from mloda import Feature, Options
#
# features = [
#     "customer_id",                                   # Simple string
#     Feature(
#         "custom_feature",
#         options=Options({
#             "custom_param": "value",
#             "in_features": "source_column",
#         })
#     ),
# ]
#
# result = mloda.run_all(
#     features=features,
#     compute_frameworks={PandasDataFrame}
# )

Deep Dive: Each transformation type (standard_scaled__, mean_imputed__, etc.) maps to a feature group class in mloda_plugins/feature_group/. For example, standard_scaled__ uses ScalingFeatureGroup. When you chain transformations, mloda builds a dependency graph of these feature groups and executes them in the correct order. This architecture makes mloda extensible - you can create custom feature groups for your own transformations!

4. Data Access - Where Your Data Comes From

Three Ways to Provide Data

mloda supports multiple data access patterns depending on your use case:

1. DataCreator - For testing and demos (used in our examples)

# Perfect for creating sample/test data in-memory
# See Section 1 for the SampleData class definition using DataCreator:
#
# class SampleData(FeatureGroup):
#     @classmethod
#     def input_data(cls) -> Optional[BaseInputData]:
#         return DataCreator({"customer_id", "age", "income"})
#
#     @classmethod
#     def calculate_feature(cls, data: Any, features: FeatureSet) -> Any:
#         return pd.DataFrame({
#             'customer_id': ['C001', 'C002'],
#             'age': [25, 30],
#             'income': [50000, 75000]
#         })

2. DataAccessCollection - For production file/database access

# Example (requires actual files/databases):
# from mloda.user import DataAccessCollection
#
# # Read from files, folders, or databases
# data_access = DataAccessCollection(
#     files={"customers.csv", "orders.parquet"},           # CSV/Parquet/JSON files
#     folders={"data/raw/"},                                # Entire directories
#     credential_dicts={"host": "db.example.com"}           # Database credentials
# )
#
# result = mloda.run_all(
#     features=["customer_id", "income__standard_scaled"],
#     compute_frameworks={PandasDataFrame},
#     data_access_collection=data_access
# )

3. ApiData - For runtime data injection (web requests, real-time predictions)

# Example (for API endpoints and real-time predictions):
# from mloda.provider import ApiDataCollection
#
# api_input_data_collection = ApiDataCollection()
# api_data = api_input_data_collection.setup_key_api_data(
#     key_name="PredictionData",
#     api_input_data={"customer_id": ["C001", "C002"], "age": [25, 30]}
# )
#
# result = mloda.run_all(
#     features=["customer_id", "age__standard_scaled"],
#     compute_frameworks={PandasDataFrame},
#     api_input_data_collection=api_input_data_collection,
#     api_data=api_data
# )

Key Insight: Use DataCreator for demos, DataAccessCollection for batch processing from files/databases, and ApiData for real-time predictions and web services.

5. Compute Frameworks - Choose Your Processing Engine

Using Different Data Processing Libraries

mloda supports multiple compute frameworks (pandas, polars, pyarrow, etc.). Most users start with pandas:

# Using the SampleData class from Section 1
# Default: Everything processes with pandas
result = mloda.run_all(
    features=["customer_id", "income__standard_scaled"],
    compute_frameworks={PandasDataFrame}  # Use pandas for all features
)

data = result[0]  # Returns pandas DataFrame
print(type(data))  # <class 'pandas.core.frame.DataFrame'>

Why Compute Frameworks Matter:

  • Pandas: Best for small-to-medium datasets, rich ecosystem, familiar API
  • Polars: High performance for larger datasets
  • PyArrow: Memory-efficient, great for columnar data
  • Spark: Distributed processing for big data

For most use cases: Start with compute_frameworks={PandasDataFrame} and switch to others only if you need specific performance characteristics.

6. Putting It All Together - Complete ML Pipeline

Real-World Example: Customer Churn Prediction

Let's build a complete machine learning pipeline with mloda:

# Step 1: Extend SampleData with more features for ML
# (Reuse the same class to avoid conflicts)
SampleData._original_calculate = SampleData.calculate_feature

@classmethod
def _extended_calculate(cls, data: Any, features: FeatureSet) -> Any:
    import numpy as np
    np.random.seed(42)
    n = 100
    return pd.DataFrame({
        'customer_id': [f'C{i:03d}' for i in range(n)],
        'age': np.random.randint(18, 70, n),
        'income': np.random.randint(30000, 120000, n),
        'account_balance': np.random.randint(0, 10000, n),
        'subscription_tier': np.random.choice(['Basic', 'Premium', 'Enterprise'], n),
        'region': np.random.choice(['North', 'South', 'East', 'West'], n),
        'customer_segment': np.random.choice(['New', 'Regular', 'VIP'], n),
        'churned': np.random.choice([0, 1], n)
    })

SampleData.calculate_feature = _extended_calculate
SampleData._input_data_original = SampleData.input_data()

@classmethod
def _extended_input_data(cls) -> Optional[BaseInputData]:
    return DataCreator({"customer_id", "age", "income", "account_balance",
                       "subscription_tier", "region", "customer_segment", "churned"})

SampleData.input_data = _extended_input_data

# Step 2: Run feature engineering pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

result = mloda.run_all(
    features=[
        "customer_id",
        "age__standard_scaled",
        "income__standard_scaled",
        "account_balance__robust_scaled",
        "subscription_tier__label_encoded",
        "region__label_encoded",
        "customer_segment__label_encoded",
        "churned"
    ],
    compute_frameworks={PandasDataFrame}
)

# Step 3: Prepare for ML
processed_data = result[0]
if len(processed_data.columns) > 2:  # Check we have features besides customer_id and churned
    X = processed_data.drop(['customer_id', 'churned'], axis=1)
    y = processed_data['churned']

    # Step 4: Train and evaluate
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"🎯 Model Accuracy: {accuracy:.2%}")
else:
    print("⚠️ Skipping ML - extend SampleData first with more features!")

What mloda Did For You:

  1. ✅ Generated sample data with DataCreator
  2. ✅ Scaled numeric features (StandardScaler & RobustScaler)
  3. ✅ Encoded categorical features (Label encoding)
  4. ✅ Returned clean DataFrame ready for sklearn

🎉 You now understand mloda's complete workflow! The same transformations work across pandas, polars, pyarrow, and other frameworks - just change compute_frameworks.

📖 Documentation

🤝 Contributing

We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.

📄 License

This project is licensed under the Apache License, Version 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mloda-0.4.0.tar.gz (245.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mloda-0.4.0-py3-none-any.whl (361.0 kB view details)

Uploaded Python 3

File details

Details for the file mloda-0.4.0.tar.gz.

File metadata

  • Download URL: mloda-0.4.0.tar.gz
  • Upload date:
  • Size: 245.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mloda-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ac697afd71b8d8fe6ff7fff4c6d96fdab7030a6d9f83496102a3b723b3069141
MD5 e16ca83ab9ea7be5b98846b61ab092e4
BLAKE2b-256 078f5a6c2cf4c7fa4e7de530221e7bc066f0bf6fa9d475018c96cf829f26813b

See more details on using hashes here.

File details

Details for the file mloda-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: mloda-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 361.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for mloda-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 913d9a90f8f8ddf4a9a744f9a0399d3c7b1632e76cf0917fbf0a5da5433bd5fd
MD5 9961bc8399333fd3beba8e838ad6a1b1
BLAKE2b-256 77aa72de16ba87097e587980846c1ba439fd566f407aa71b8ece939fa397e0a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page