Skip to main content

Rethinking Data and Feature Engineering

Project description

mloda: Revolutionary Process-Data Separation for Feature and Data Engineering

Documentation PyPI version License -Tox -Checked with mypy -code style: ruff

โš ๏ธ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!

๐Ÿš€ Transforming Feature Engineering Through Process-Data Separation

mloda revolutionizes feature engineering by separating processes (transformations) from data, enabling unprecedented flexibility, reusability, and scalability in machine learning workflows.

๐Ÿค– Built for the AI Era: While others write code, AI writes mloda plugins. Check the inline comments in our experimental plugin code - all AI written.

๐ŸŒ Share Without Secrets: Traditional pipelines lock business logic inside - mloda plugins separate transformations from business context, enabling safe community sharing.

๐ŸŽฏ Try the first example out NOW: sklearn Integration Example - See mloda transform traditional sklearn pipelines!

๐Ÿ“‹ Table of Contents

๐Ÿณ Think of mloda Like Cooking Recipes

Traditional Data Pipelines = Making everything from scratch

  • Want pasta? Make noodles, sauce, cheese from raw ingredients
  • Want pizza? Start over - make dough, sauce, cheese again
  • Want lasagna? Repeat everything once more
  • Can't share recipes easily - they're mixed with your kitchen setup

mloda = Using recipe components

  • Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
  • Use same "tomato sauce" for pasta, pizza, lasagna
  • Switch kitchens (home โ†’ restaurant โ†’ food truck) - same recipes work
  • Share your "tomato sauce" recipe with friends - they don't need your whole kitchen

Real Example: You need to clean customer ages (remove outliers, fill missing values)

  • Traditional: Write age-cleaning code for training, testing, production separately
  • mloda: Create one "clean_age" plugin, use everywhere - development, testing, production, analysis

Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!

๐Ÿ’ก The Value Proposition

What mloda aims to enable:

Challenge Traditional Pain Point mloda's Approach
โฐ Repetitive Work Rebuild same transformations for each environment Write once, reuse across all environments
๐Ÿ› Consistency Issues Different implementations create bugs Single implementation ensures consistency
๐Ÿ‘ฅ Knowledge Silos Senior expertise locked in complex pipelines Reusable patterns everyone can use
๐Ÿš€ Deployment Friction Train/serve skew causes production issues Same logic guaranteed everywhere
๐Ÿ’ก Innovation Bottleneck Time spent on solved problems Focus energy on unique business value

Vision: Enable data teams to spend more time solving unique business problems and less time rebuilding common patterns, while reducing the risk of inconsistencies across environments.

๐Ÿ“Š Why Process-Data Separation Changes Everything

Aspect Traditional Approach mloda Approach
๐Ÿ”„ Reusability Transformations tied to specific datasets Same feature definitions work across all contexts
โšก Flexibility Locked to single compute framework Multi-framework support with automatic optimization
๐Ÿ“ Maintainability Complex nested pipeline objects Clean, declarative feature names
๐Ÿญ Scalability Framework-specific limitations Horizontal scaling without architectural changes

For those who know: Want Iceberg-like metadata capabilities across your entire data and feature lifecycle? That's exactly what mloda aims for.

๐Ÿš€ Quick Start

Installation

pip install mloda

Your First Feature Pipeline

import numpy as np
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup

np.random.seed(42)
n_samples = 1000

class YourFirstSyntheticDataSet(AbstractFeatureGroup):
    @classmethod
    def input_data(cls):
        return DataCreator({"age", "weight", "state", "gender"})

    @classmethod
    def calculate_feature(cls, data, features):
        return {
                "age": np.random.randint(25, 65, 500),
                "weight": np.random.normal(80, 20, 500),  # Different distribution
                "state": np.random.choice(["WA", "OR"], 500),  # Different states!
                "gender": np.random.choice(["M", "F", "Other"], 500),  # New category!
            }

# Define features with automatic dependency resolution
features = [
    "standard_scaled__mean_imputed__age",
    "onehot_encoded__state", 
    "robust_scaled__weight"
]

# Execute with automatic framework selection
result = mlodaAPI.run_all(features, compute_frameworks={PandasDataframe})

๐Ÿ”„ Write Once, Run Anywhere: Environments & Frameworks

The Core Promise: One plugin definition works across all environments and technologies.

# Traditional approach: Rebuild for each context
def clean_age_training(data): ...      # Training pipeline
def clean_age_testing(data): ...       # Testing pipeline  
def clean_age_production(data): ...    # Production API
def clean_age_spark(data): ...         # Big data processing
def clean_age_analysis(data): ...      # Analytics

# mloda approach: Write once, use everywhere
class CleanAgePlugin(AbstractFeatureGroup):
    @classmethod
    def calculate_feature(cls, data, features):
        # Single implementation for all contexts
        return process_age_data(data["age"])

# Same plugin, different environments & frameworks
mlodaAPI.run_all(["clean_age"], compute_frameworks={PandasDataframe})  # Dev
mlodaAPI.run_all(["clean_age"], compute_frameworks={SparkDataframe})   # Production
mlodaAPI.run_all(["clean_age"], compute_frameworks={PolarsDataframe})  # High performance
mlodaAPI.run_all(["clean_age"], compute_frameworks={DuckDBFramework})  # Analytics

Result: 5+ implementations โ†’ 1 plugin that adapts automatically.

Different Data Scales, Same Processing Logic

graph TB
    subgraph "๐Ÿ“Š Data Scenarios"
        CSV["๐Ÿ“„ Development<br/>Small CSV files<br/>~1K rows"]
        BATCH["๐Ÿ‹๏ธ Training<br/>Full dataset<br/>~1M+ rows"]
        SINGLE["โšก Inference<br/>Single row<br/>Real-time"]
        ANALYSIS["๐Ÿ“ˆ Analysis<br/>Historical batch<br/>Post-deployment"]
    end
    
    subgraph "๐ŸŽฏ Same Features Applied"
        RESULT["standard_scaled__mean_imputed__age<br/>onehot_encoded__state<br/>robust_scaled__weight<br/><br/>"]
    end
    
    CSV --> RESULT
    BATCH --> RESULT
    SINGLE --> RESULT
    ANALYSIS --> RESULT
    
    style CSV fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    style BATCH fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style SINGLE fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
    style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    style RESULT fill:#e8f5e8,stroke:#4caf50,stroke-width:3px

๐ŸŒ Deploy Anywhere Python Runs

Universal Deployment: mloda runs wherever Python runs - no special infrastructure needed.

Environment Use Case Example
๐Ÿ’ป Local Development Prototyping & testing Jupyter notebooks, VS Code
โ˜๏ธ Any Cloud Production workloads AWS, GCP, Azure, DigitalOcean
๐Ÿข On-Premise Enterprise & compliance Air-gapped environments
๐Ÿ“Š Notebooks Data science workflows Jupyter, Colab, Databricks
๐ŸŒ Web APIs Real-time serving Flask, FastAPI, Django
โš™๏ธ Orchestration Batch processing Airflow, Prefect, Dagster
๐Ÿณ Containers Microservices Docker, Kubernetes
โšก Serverless Event-driven AWS Lambda, Google Functions

No vendor lock-in. No special runtime. Just Python.

๐ŸŽฏ Minimal Dependencies, Maximum Compatibility

PyArrow-Only Core: mloda uses only PyArrow as its core dependency - no other Python modules required.

Why PyArrow? It's the universal language of modern data:

  • Interoperability: Native bridge between Pandas, Polars, Spark, DuckDB
  • Performance: Zero-copy data sharing between frameworks
  • Standards: Apache Arrow is the foundation of modern data tools
  • Future-Proof: Industry standard for columnar data processing

This architectural choice enables mloda's seamless framework switching without dependency conflicts.

๐Ÿ”ง Complete Data Processing Capabilities

Beyond Feature Engineering: mloda provides full data processing operations:

Operation Purpose Example Use Case
๐Ÿ”— Joins Combine datasets User profiles + transaction history
๐Ÿ”€ Merges Consolidate data sources Multiple feature tables into one
๐Ÿ” Filters Data selection & quality Remove outliers, select time ranges
๐Ÿท๏ธ Domain Data organization & governance Logical data grouping and access control

All operations work seamlessly across any compute framework with the same simple API.

๐Ÿ‘ฅ Logical Role-Based Data Governance

Clear Role Separation: mloda logically splits data responsibilities into three distinct roles:

Role Responsibility Key Activities
๐Ÿ—๏ธ Data Producer Create & maintain plugins Define data access, implement feature groups, ensure quality
๐Ÿ‘ค Data User Consume features via API Request features, configure workflows, build ML models
๐Ÿ›ก๏ธ Data Owner Governance & lifecycle Control access, manage compliance, oversee data quality

Organizational Clarity: Each role has defined boundaries, enabling proper data governance while maintaining development flexibility. Learn more about roles

๐ŸŒ Community-Driven Plugin Ecosystem

Share Transformations, Keep Secrets: Unlike traditional pipelines where business logic is embedded, mloda separates transformation patterns from business context.

Challenge Traditional Pipelines mloda Solution
๐Ÿ”’ Knowledge Sharing Business logic embedded - can't share Transformations separated - safe to share
๐Ÿ”„ Reusability Rebuild common patterns everywhere Community library of proven patterns
โšก Innovation Everyone reinvents the wheel Build on collective knowledge
๐ŸŽฏ Focus Waste time on solved problems Focus on unique business value

Result: A thriving ecosystem where data teams contribute transformation patterns while protecting their competitive advantages.

๐Ÿ“– Documentation

๐Ÿค Contributing

We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.

๐Ÿ“„ License

This project is licensed under the Apache License, Version 2.0.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mloda-0.2.11.tar.gz (191.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mloda-0.2.11-py3-none-any.whl (302.2 kB view details)

Uploaded Python 3

File details

Details for the file mloda-0.2.11.tar.gz.

File metadata

  • Download URL: mloda-0.2.11.tar.gz
  • Upload date:
  • Size: 191.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for mloda-0.2.11.tar.gz
Algorithm Hash digest
SHA256 c88d40bb37aabe0a474d15cacbd9e62325ac6c1871674cb216db5346dc45a2b9
MD5 011ab0d2f29362c3e52031f49ba35465
BLAKE2b-256 26815975749bd8400cf81a1bedbb7cd45fd0e583fd83f5a3e26aee9a33fdd020

See more details on using hashes here.

File details

Details for the file mloda-0.2.11-py3-none-any.whl.

File metadata

  • Download URL: mloda-0.2.11-py3-none-any.whl
  • Upload date:
  • Size: 302.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for mloda-0.2.11-py3-none-any.whl
Algorithm Hash digest
SHA256 f89fe4b6e305644218b41cc83756ca682646d78f81cd4ff1d1c18c9bbd671ae0
MD5 2adb271571f07b2af305bc3044c3f4bb
BLAKE2b-256 37e6ea1138340c1e339816c30747d644d9d3a2507a21adae0f526be080fbbb99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page