ML Platform for your local machine using cheap cloud services for scalable resources.
Project description
mlforge
A simple feature store SDK for machine learning workflows. Build, version, and serve ML features with point-in-time correctness.
Installation
pip install mlforge-sdk
Or with uv:
uv add mlforge-sdk
Quick Start
Define features with the @feature decorator:
import mlforge as mlf
import polars as pl
from datetime import timedelta
@mlf.feature(
keys=["user_id"],
source="data/transactions.parquet",
timestamp="transaction_date",
interval=timedelta(days=1),
metrics=[
mlf.Rolling(
windows=["7d", "30d"],
aggregations={"amount": ["sum", "mean", "count"]}
)
],
validators={
"amount": [mlf.not_null(), mlf.greater_than(0)],
},
description="User spending patterns over rolling windows"
)
def user_spend(df: pl.DataFrame) -> pl.DataFrame:
return df.select(["user_id", "transaction_date", "amount"])
Register and build features:
import mlforge as mlf
import my_features
defs = mlf.Definitions(
name="my-project",
features=[my_features],
offline_store=mlf.LocalStore("./feature_store")
)
# Build features with automatic versioning
defs.build()
Retrieve features for training with point-in-time correctness:
import mlforge as mlf
training_df = mlf.get_training_data(
entity_df=labels_df,
features=["user_spend"],
store=mlf.LocalStore("./feature_store"),
timestamp="label_time"
)
Features
- 🎯 Feature Definition: Define features with the
@mlf.featuredecorator - 📊 Rolling Aggregations: Compute time-windowed metrics with
mlf.Rolling - ✅ Data Validation: Built-in validators for data quality (
not_null,greater_than, etc.) - 🔢 Semantic Versioning: Automatic version detection and bumping (MAJOR/MINOR/PATCH)
- 💾 Storage Backends: Local filesystem and Amazon S3 support
- ⏰ Point-in-Time Joins: Retrieve training data with temporal correctness
- 📝 Feature Metadata: Automatic tracking of schemas, versions, and change history
- 🔧 CLI Tools: Build, validate, inspect, and sync features from the command line
- 🤝 Git Collaboration: Share feature definitions via Git, sync data locally
CLI Usage
Build Features
Build all features with automatic versioning:
mlforge build
Build specific features:
mlforge build --features user_spend,merchant_spend
Build features by tag:
mlforge build --tags users
Override automatic versioning:
mlforge build --version 2.0.0
Versioning
List all versions of a feature:
mlforge versions user_spend
Inspect a specific version:
mlforge inspect user_spend --version 1.0.0
Validation
Validate features without building:
mlforge validate
Validate specific features:
mlforge validate --features user_spend
Feature Discovery
List registered features:
mlforge list
List features by tag:
mlforge list --tags users
Inspect feature metadata:
mlforge inspect user_spend
Display feature manifest:
mlforge manifest
Team Collaboration
Sync features after pulling metadata from Git:
mlforge sync
Preview what would be synced:
mlforge sync --dry-run
Sync specific features:
mlforge sync --features user_spend
Force sync even if source data changed:
mlforge sync --force
Automatic Versioning
mlforge automatically versions your features using semantic versioning:
- MAJOR (2.0.0): Breaking changes (columns removed, dtype changed)
- MINOR (1.1.0): Additive changes (columns added, config changed)
- PATCH (1.0.1): Data refresh (same schema and config)
# First build creates v1.0.0
defs.build()
# Rebuild with same schema → v1.0.1 (PATCH)
defs.build(force=True)
# Add a column → v1.1.0 (MINOR)
# Remove a column → v2.0.0 (MAJOR)
Features are stored in versioned directories:
feature_store/
├── user_spend/
│ ├── 1.0.0/
│ │ ├── data.parquet
│ │ └── .meta.json
│ ├── 1.0.1/
│ │ └── ...
│ ├── _latest.json
│ └── .gitignore
Git Collaboration
mlforge enables teams to share feature definitions via Git:
- Metadata is committed:
.meta.jsonand_latest.jsonfiles - Data is ignored: Auto-generated
.gitignoreexcludesdata.parquet - Teammates sync locally: Run
mlforge syncto rebuild data
# Developer 1: Build and commit metadata
mlforge build --features user_spend
git add feature_store/user_spend/
git commit -m "feat: add user_spend feature"
git push
# Developer 2: Pull and sync
git pull
mlforge sync # Rebuilds data.parquet from metadata
Validators
Built-in validators for data quality:
import mlforge as mlf
@mlf.feature(
keys=["id"],
source="data.parquet",
validators={
"email": [mlf.not_null(), mlf.matches_regex(r"^[\w.-]+@[\w.-]+\.\w+$")],
"age": [mlf.not_null(), mlf.in_range(0, 120)],
"status": [mlf.is_in(["active", "inactive"])],
"score": [mlf.greater_than_or_equal(0), mlf.less_than_or_equal(100)],
}
)
def validated_feature(df):
return df.select(["id", "email", "age", "status", "score"])
Available validators:
not_null()- No null valuesunique()- All values uniquegreater_than(value)- All values > thresholdless_than(value)- All values < thresholdgreater_than_or_equal(value)- All values ≥ thresholdless_than_or_equal(value)- All values ≤ thresholdin_range(min, max)- All values within rangematches_regex(pattern)- All values match regexis_in(values)- All values in allowed set
Storage Backends
Local Storage
import mlforge as mlf
store = mlf.LocalStore("./feature_store")
S3 Storage
import mlforge as mlf
store = mlf.S3Store(
bucket="my-features",
prefix="prod/features",
region="us-west-2"
)
S3 credentials are resolved via standard AWS credential chain (environment variables, ~/.aws/credentials, or IAM roles).
Entity Keys
Create reusable entity key transformations:
import mlforge as mlf
# Create surrogate key from multiple columns
with_user_id = mlf.entity_key("first_name", "last_name", "dob", alias="user_id")
@mlf.feature(
keys=["user_id"],
source="data/transactions.parquet"
)
def user_feature(df):
return df.pipe(with_user_id).select(["user_id", "amount"])
Generate surrogate keys directly:
import polars as pl
import mlforge as mlf
df = pl.DataFrame({
"first": ["Alice", "Bob"],
"last": ["Smith", "Jones"],
})
df = mlf.surrogate_key(df, ["first", "last"], alias="user_id")
# Adds column: user_id = hash("Alice:Smith"), hash("Bob:Jones")
Point-in-Time Correctness
Retrieve training data with temporal correctness to prevent label leakage:
import mlforge as mlf
import polars as pl
# Labels with timestamps
labels_df = pl.DataFrame({
"user_id": ["u1", "u2", "u3"],
"label_time": ["2024-01-15", "2024-01-16", "2024-01-17"],
"label": [1, 0, 1],
})
# Get features as they existed at label_time
training_df = mlf.get_training_data(
entity_df=labels_df,
features=["user_spend"],
store=mlf.LocalStore("./feature_store"),
timestamp="label_time"
)
This ensures that features computed at 2024-01-15 only use data available before that date, preventing future information from leaking into training data.
Documentation
Full documentation is available at https://chonalchendo.github.io/mlforge
Requirements
- Python ≥ 3.13
- Polars ≥ 1.35.2
Contributing
Contributions are welcome! Please see the repository for development setup and guidelines.
License
MIT License - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlforge_sdk-0.5.0-py3-none-any.whl.
File metadata
- Download URL: mlforge_sdk-0.5.0-py3-none-any.whl
- Upload date:
- Size: 76.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8d9e4d7da9efbb2aa57d850cd53d8e060a1d301c47873c5878f7c2d11af562f
|
|
| MD5 |
bc6431d9c65db60346d135ce842d8bc1
|
|
| BLAKE2b-256 |
c31d7bdc765d77a99d99ff43df934ac07f8044e4cbb88b80c1e766e99c392f3c
|
Provenance
The following attestation bundles were made for mlforge_sdk-0.5.0-py3-none-any.whl:
Publisher:
publish.yaml on chonalchendo/mlforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mlforge_sdk-0.5.0-py3-none-any.whl -
Subject digest:
c8d9e4d7da9efbb2aa57d850cd53d8e060a1d301c47873c5878f7c2d11af562f - Sigstore transparency entry: 829355817
- Sigstore integration time:
-
Permalink:
chonalchendo/mlforge@fb0d8ffe246da01767478aac2f60d9086cb62ff2 -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/chonalchendo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@fb0d8ffe246da01767478aac2f60d9086cb62ff2 -
Trigger Event:
push
-
Statement type: