Rethinking Data and Feature Engineering
Project description
mloda: Make data and feature engineering shareable
⚠️ Early Version Notice: mloda is in active development. Some features described below are still being implemented. We're actively seeking feedback to shape the future of the framework. Share your thoughts!
🍳 Think of mloda Like Cooking Recipes
Traditional Data Pipelines = Making everything from scratch
- Want pasta? Make noodles, sauce, cheese from raw ingredients
- Want pizza? Start over - make dough, sauce, cheese again
- Want lasagna? Repeat everything once more
- Can't share recipes easily - they're mixed with your kitchen setup
mloda = Using recipe components
- Create reusable recipes: "tomato sauce", "pasta dough", "cheese blend"
- Use same "tomato sauce" for pasta, pizza, lasagna
- Switch kitchens (home → restaurant → food truck) - same recipes work
- Share your "tomato sauce" recipe with friends - they don't need your whole kitchen
Result: Instead of rebuilding the same thing 10 times, build once and reuse everywhere!
Installation
pip install mloda
1. The Core API Call - Your Starting Point
The One Command That Does Everything
# This is the heart of mloda. You describe what you want and mloda resolves the dependencies.
from mloda_core.api.request import mlodaAPI
result = mlodaAPI.run_all(
features=["age", "standard_scaled__weight"]
)
# That's it! You get processed data back
data = result[0]
print(data.head())
What just happened?
- mloda found your data automatically
- Applied transformations (scaling, encoding)
- Returned clean, ready-to-use DataFrame
Key Insight: As long as the plugins and data accesses exist, mloda can derive any feature automatically.
2. Setting Up Your Data
Using DataCreator - The mloda Way
# DataCreator: Perfect for testing, demos, and prototyping
# Use this when you need synthetic data or want to test mloda without external files
from mloda_core.abstract_plugins.components.input_data.creator.data_creator import DataCreator
from mloda_core.abstract_plugins.abstract_feature_group import AbstractFeatureGroup
class SampleDataFeature(AbstractFeatureGroup):
@classmethod
def input_data(cls):
# Define what columns your data will have
return DataCreator({
"age", "weight", "state", "income", "target"
})
@classmethod
def calculate_feature(cls, data, features):
# Generate sample data that matches your DataCreator specification
# This is where you'd normally load from files, databases, or APIs
return {
'age': [25, 30, 35, None, 45, 28, 33],
'weight': [150, 180, None, 200, 165, 140, 175],
'state': ['CA', 'NY', 'TX', 'CA', 'FL', 'NY', 'TX'],
'income': [50000, 75000, 85000, 60000, None, 45000, 70000],
'target': [1, 0, 1, 0, 1, 0, 1]
}
When to Use DataCreator vs Other Data Access Methods:
- DataCreator: For testing, demos, synthetic data, or when you want to generate data programmatically within mloda
- File Access (
DataAccessCollectionwith files): When your data lives in CSV, JSON, Parquet, etc. - Database Access (
DataAccessCollectionwith credentials): When connecting to SQL databases, data warehouses - API Access: When fetching data from REST APIs or other web services
Key Insight: DataCreator is mloda's built-in data generation tool - perfect for getting started without external dependencies. Once you're ready for production, switch to file or database access methods.
Quick Start with Your Own Data:
# Replace DataCreator with real data access
from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
# For files
data_access = DataAccessCollection(files={"your_data.csv"})
# For databases
data_access = DataAccessCollection(
credential_dicts=[{"host": "your-db.com", "username": "user"}]
)
3. Understanding What You Get Back
The Result Structure
from mloda_core.api.request import mlodaAPI
from mloda_plugins.compute_framework.base_implementations.pandas.dataframe import PandasDataframe
result = mlodaAPI.run_all(features, compute_frameworks={PandasDataframe})
# result is always a LIST of result objects
data_list = result
# Each object matches your compute framework type: pd.DataFrame, spark.DataFrame, etc.
# Access your processed data
data = result[0] # Most common case: single result
print(f"Shape: {data.shape}, Columns: {list(data.columns)}")
Key Insight: mloda returns a list of results. Most simple cases return a single DataFrame that you access with
result[0].
4. The Features Parameter
Feature Object Syntax
from mloda_core.abstract_plugins.components.feature import Feature
from mloda_core.abstract_plugins.components.options import Options
from mloda_core.abstract_plugins.plugin_loader.plugin_loader import PluginLoader
# Load all available plugins (required before using features)
PluginLoader.all()
features = [
"age", # Simple string
Feature(
"weight_replaced",
options=Options(
group={
"imputation_method": "mean",
"mloda_source_feature": "weight",
}
),
),
"onehot_encoded__state" # Chaining syntax
]
Three Ways to Define Features:
- Simple strings: For basic columns like "age"
- Feature objects: For explicit configuration and advanced options
- Chaining syntax: Convenient shorthand for transformations
5. Compute Frameworks
Choose Your Processing Engine
# Different processing engines
features = [
Feature("age", compute_framework=PandasDataframe.get_class_name()),
Feature("weight", compute_framework=PolarsDataframe.get_class_name()),
]
# Mixed - familiar, extensive ecosystem
result = mlodaAPI.run_all(features)
6. Data Access
Tell mloda Where Your Data Lives
from mloda_core.abstract_plugins.components.data_access_collection import DataAccessCollection
# Configure data sources
data_access = DataAccessCollection(
files={"data/customers.csv"}, # Specific files
folders={"data/archive/"}, # Entire directories
credential_dicts=[{"host": "db.example.com"}] # Database credentials
)
result = mlodaAPI.run_all(
features=["age", "standard_scaled__income"],
compute_frameworks={PandasDataframe},
data_access_collection=data_access
)
Key Insight: Configure data access once globally, and all features can use it automatically.
7. Putting It All Together
Real-World Feature Engineering Pipeline
# Complete mlodaAPI call
result = mlodaAPI.run_all(
# What you want
features=[
"standard_scaled__age",
"onehot_encoded__state",
"mean_imputed__income"
],
# How to process it
compute_frameworks={PandasDataframe},
# Where to get it
data_access_collection=DataAccessCollection(files={"data/customers.csv"})
)
# Get your results
processed_data = result[0]
print(f"✅ Created {len(processed_data.columns)} features from {len(processed_data)} rows")
# Use in your ML pipeline
from sklearn.model_selection import train_test_split
X = processed_data.drop('target', axis=1)
y = processed_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
🎉 You now understand mloda's core workflow!
📖 Documentation
- Getting Started - Installation and first steps
- sklearn Integration - Complete tutorial
- Feature Groups - Core concepts
- Compute Frameworks - Technology integration
- API Reference - Complete API documentation
🤝 Contributing
We welcome contributions! Whether you're building plugins, adding features, or improving documentation, your input is invaluable.
- Development Guide - How to contribute
- GitHub Issues - Report bugs or request features
- Email - Direct contact
📄 License
This project is licensed under the Apache License, Version 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mloda-0.2.13.tar.gz.
File metadata
- Download URL: mloda-0.2.13.tar.gz
- Upload date:
- Size: 200.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1535fb5c354cd348f75e9148c8ba0514efd4b09f6a37d6b01b9a8ef42a95954
|
|
| MD5 |
61acb01718246a427633f0553abd0842
|
|
| BLAKE2b-256 |
30711db54664f25cf803cc35e38340f97843eeb621e4e767f968264f534ea8be
|
File details
Details for the file mloda-0.2.13-py3-none-any.whl.
File metadata
- Download URL: mloda-0.2.13-py3-none-any.whl
- Upload date:
- Size: 308.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b88e97a298b0e59da4d3550de95957545685094fafea29a566696e52af8f2a5
|
|
| MD5 |
5807c6eb2b4d5f94603d0c9f45370534
|
|
| BLAKE2b-256 |
880543e809ff6e76c1c53b14b058490e481f3ed96034f40dfbc88232d27c19a1
|