Rethinking Data and Feature Engineering
Project description
mloda: Revolutionary Process-Data Separation for Feature and Data Engineering
โ ๏ธ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!
๐ Transforming Feature Engineering Through Process-Data Separation
mloda revolutionizes feature engineering by separating processes (transformations) from data, enabling unprecedented flexibility, reusability, and scalability in machine learning workflows.
๐ค Built for the AI Era: While others write code, AI writes mloda plugins. Check the inline comments in our experimental plugin code - all AI written.
๐ Share Without Secrets: Traditional pipelines lock business logic inside - mloda plugins separate transformations from business context, enabling safe community sharing.
๐ฏ Try the first example out NOW: sklearn Integration Example - See mloda transform traditional sklearn pipelines!
๐ Table of Contents
- ๐ณ Think of mloda Like Cooking Recipes
- ๐ก The Value Proposition
- ๐ Why Process-Data Separation Changes Everything
- ๐ Quick Start
- ๐ Write Once, Run Anywhere
- ๐ Deploy Anywhere Python Runs
- ๐ฏ Minimal Dependencies
- ๐ง Complete Data Processing
- ๐ฅ Role-Based Governance
- ๐ Community-Driven Plugin Ecosystem
- ๐ Documentation
- ๐ค Contributing
- ๐ License
๐ณ Think of mloda Like Cooking Recipes
Traditional Data Pipelines = Making everything from scratch
- Want pasta? Make noodles, sauce, cheese from raw ingredients
- Want pizza? Start over - make dough, sauce, cheese again
- Want lasagna? Repeat everything once more
- Can't share recipes easily - they're mixed with your kitchen setup
mloda = Using recipe components
- Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
- Use same "tomato sauce" for pasta, pizza, lasagna
- Switch kitchens (home โ restaurant โ food truck) - same recipes work
- Share your "tomato sauce" recipe with friends - they don't need your whole kitchen
Real Example: You need to clean customer ages (remove outliers, fill missing values)
- Traditional: Write age-cleaning code for training, testing, production separately
- mloda: Create one "clean_age" plugin, use everywhere - development, testing, production, analysis
Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!
๐ก The Value Proposition
What mloda aims to enable:
| Challenge | Traditional Pain Point | mloda's Approach |
|---|---|---|
| โฐ Repetitive Work | Rebuild same transformations for each environment | Write once, reuse across all environments |
| ๐ Consistency Issues | Different implementations create bugs | Single implementation ensures consistency |
| ๐ฅ Knowledge Silos | Senior expertise locked in complex pipelines | Reusable patterns everyone can use |
| ๐ Deployment Friction | Train/serve skew causes production issues | Same logic guaranteed everywhere |
| ๐ก Innovation Bottleneck | Time spent on solved problems | Focus energy on unique business value |
Vision: Enable data teams to spend more time solving unique business problems and less time rebuilding common patterns, while reducing the risk of inconsistencies across environments.
๐ Why Process-Data Separation Changes Everything
| Aspect | Traditional Approach | mloda Approach |
|---|---|---|
| ๐ Reusability | Transformations tied to specific datasets | Same feature definitions work across all contexts |
| โก Flexibility | Locked to single compute framework | Multi-framework support with automatic optimization |
| ๐ Maintainability | Complex nested pipeline objects | Clean, declarative feature names |
| ๐ญ Scalability | Framework-specific limitations | Horizontal scaling without architectural changes |
For those who know: Want Iceberg-like metadata capabilities across your entire data and feature lifecycle? That's exactly what mloda aims for.
๐ Quick Start
Installation
pip install mloda
Your First Feature Pipeline
import numpy as np
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
np.random.seed(42)
n_samples = 1000
class YourFirstSyntheticDataSet(AbstractFeatureGroup):
@classmethod
def input_data(cls):
return DataCreator({"age", "weight", "state", "gender"})
@classmethod
def calculate_feature(cls, data, features):
return {
"age": np.random.randint(25, 65, 500),
"weight": np.random.normal(80, 20, 500), # Different distribution
"state": np.random.choice(["WA", "OR"], 500), # Different states!
"gender": np.random.choice(["M", "F", "Other"], 500), # New category!
}
# Define features with automatic dependency resolution
features = [
"standard_scaled__mean_imputed__age",
"onehot_encoded__state",
"robust_scaled__weight"
]
# Execute with automatic framework selection
result = mlodaAPI.run_all(features, compute_frameworks={PandasDataframe})
๐ Write Once, Run Anywhere: Environments & Frameworks
The Core Promise: One plugin definition works across all environments and technologies.
# Traditional approach: Rebuild for each context
def clean_age_training(data): ... # Training pipeline
def clean_age_testing(data): ... # Testing pipeline
def clean_age_production(data): ... # Production API
def clean_age_spark(data): ... # Big data processing
def clean_age_analysis(data): ... # Analytics
# mloda approach: Write once, use everywhere
class CleanAgePlugin(AbstractFeatureGroup):
@classmethod
def calculate_feature(cls, data, features):
# Single implementation for all contexts
return process_age_data(data["age"])
# Same plugin, different environments & frameworks
mlodaAPI.run_all(["clean_age"], compute_frameworks={PandasDataframe}) # Dev
mlodaAPI.run_all(["clean_age"], compute_frameworks={SparkDataframe}) # Production
mlodaAPI.run_all(["clean_age"], compute_frameworks={PolarsDataframe}) # High performance
mlodaAPI.run_all(["clean_age"], compute_frameworks={DuckDBFramework}) # Analytics
Result: 5+ implementations โ 1 plugin that adapts automatically.
Different Data Scales, Same Processing Logic
graph TB
subgraph "๐ Data Scenarios"
CSV["๐ Development<br/>Small CSV files<br/>~1K rows"]
BATCH["๐๏ธ Training<br/>Full dataset<br/>~1M+ rows"]
SINGLE["โก Inference<br/>Single row<br/>Real-time"]
ANALYSIS["๐ Analysis<br/>Historical batch<br/>Post-deployment"]
end
subgraph "๐ฏ Same Features Applied"
RESULT["standard_scaled__mean_imputed__age<br/>onehot_encoded__state<br/>robust_scaled__weight<br/><br/>"]
end
CSV --> RESULT
BATCH --> RESULT
SINGLE --> RESULT
ANALYSIS --> RESULT
style CSV fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style BATCH fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style SINGLE fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px
style RESULT fill:#e8f5e8,stroke:#4caf50,stroke-width:3px
๐ Deploy Anywhere Python Runs
Universal Deployment: mloda runs wherever Python runs - no special infrastructure needed.
| Environment | Use Case | Example |
|---|---|---|
| ๐ป Local Development | Prototyping & testing | Jupyter notebooks, VS Code |
| โ๏ธ Any Cloud | Production workloads | AWS, GCP, Azure, DigitalOcean |
| ๐ข On-Premise | Enterprise & compliance | Air-gapped environments |
| ๐ Notebooks | Data science workflows | Jupyter, Colab, Databricks |
| ๐ Web APIs | Real-time serving | Flask, FastAPI, Django |
| โ๏ธ Orchestration | Batch processing | Airflow, Prefect, Dagster |
| ๐ณ Containers | Microservices | Docker, Kubernetes |
| โก Serverless | Event-driven | AWS Lambda, Google Functions |
No vendor lock-in. No special runtime. Just Python.
๐ฏ Minimal Dependencies, Maximum Compatibility
PyArrow-Only Core: mloda uses only PyArrow as its core dependency - no other Python modules required.
Why PyArrow? It's the universal language of modern data:
- Interoperability: Native bridge between Pandas, Polars, Spark, DuckDB
- Performance: Zero-copy data sharing between frameworks
- Standards: Apache Arrow is the foundation of modern data tools
- Future-Proof: Industry standard for columnar data processing
This architectural choice enables mloda's seamless framework switching without dependency conflicts.
๐ง Complete Data Processing Capabilities
Beyond Feature Engineering: mloda provides full data processing operations:
| Operation | Purpose | Example Use Case |
|---|---|---|
| ๐ Joins | Combine datasets | User profiles + transaction history |
| ๐ Merges | Consolidate data sources | Multiple feature tables into one |
| ๐ Filters | Data selection & quality | Remove outliers, select time ranges |
| ๐ท๏ธ Domain | Data organization & governance | Logical data grouping and access control |
All operations work seamlessly across any compute framework with the same simple API.
๐ฅ Logical Role-Based Data Governance
Clear Role Separation: mloda logically splits data responsibilities into three distinct roles:
| Role | Responsibility | Key Activities |
|---|---|---|
| ๐๏ธ Data Producer | Create & maintain plugins | Define data access, implement feature groups, ensure quality |
| ๐ค Data User | Consume features via API | Request features, configure workflows, build ML models |
| ๐ก๏ธ Data Owner | Governance & lifecycle | Control access, manage compliance, oversee data quality |
Organizational Clarity: Each role has defined boundaries, enabling proper data governance while maintaining development flexibility. Learn more about roles
๐ Community-Driven Plugin Ecosystem
Share Transformations, Keep Secrets: Unlike traditional pipelines where business logic is embedded, mloda separates transformation patterns from business context.
| Challenge | Traditional Pipelines | mloda Solution |
|---|---|---|
| ๐ Knowledge Sharing | Business logic embedded - can't share | Transformations separated - safe to share |
| ๐ Reusability | Rebuild common patterns everywhere | Community library of proven patterns |
| โก Innovation | Everyone reinvents the wheel | Build on collective knowledge |
| ๐ฏ Focus | Waste time on solved problems | Focus on unique business value |
Result: A thriving ecosystem where data teams contribute transformation patterns while protecting their competitive advantages.
๐ Documentation
- Getting Started - Installation and first steps
- sklearn Integration - Complete tutorial
- Feature Groups - Core concepts
- Compute Frameworks - Technology integration
- API Reference - Complete API documentation
๐ค Contributing
We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.
- Development Guide - How to contribute
- GitHub Issues - Report bugs or request features
- Email - Direct contact
๐ License
This project is licensed under the Apache License, Version 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mloda-0.2.10.tar.gz.
File metadata
- Download URL: mloda-0.2.10.tar.gz
- Upload date:
- Size: 191.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef60a100bcaaf83fedaf3afe7ba7e06a1deb5c50784ce6a619d09cb2f3287343
|
|
| MD5 |
906295f24dd1883e6ef4baef408d6d62
|
|
| BLAKE2b-256 |
8ff4d5ec5393a64d7e7531ee44145275352b72814e7e421767f5765e138b18aa
|
File details
Details for the file mloda-0.2.10-py3-none-any.whl.
File metadata
- Download URL: mloda-0.2.10-py3-none-any.whl
- Upload date:
- Size: 302.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ce64aadfb286d926f8bc4638b19423f7683153d1242d014975fe79b56ed6c7d
|
|
| MD5 |
0e2312a2ac29ca7f112b995087a30fdf
|
|
| BLAKE2b-256 |
31f255704b713ac375c2aac5974dc718d93f184feda392821a406be15183f948
|