ML pipeline framework: medallion architecture + MLflow + Unity Catalog
Project description
Dross
The byproduct of refinement becomes the foundation of insight.
Dross is a reusable ML pipeline framework for Kaggle and data science projects. It provides:
- Medallion Architecture: Bronze (ingestion) → Silver (cleaning) → Gold (preparation)
- MLflow Integration: Experiment tracking and model versioning
- Unity Catalog: Centralized data governance and catalog management
- Extensible Models: BaseModel ABC for custom implementations
- Feature Extraction: TF-IDF vectorization utilities
Quick Start
Installation
pip install dross
Basic Usage
from dross.data import MedallionPipeline
from dross.tracking import ExperimentTracker, UCClient
from dross.models import BaseModel, get_model
from dross.utilities import TfidfVectorizer
# Setup medallion pipeline
uc_client = UCClient(server="http://localhost:8080")
pipeline = MedallionPipeline(cfg.unity_catalog, uc_client, storage_base)
# Ingest data to Bronze
await pipeline.ingest(raw_csv_path, columns=schema_def)
# Clean data to Silver
await pipeline.clean(source_table, target_table, transform_func)
# Prepare for training in Gold
await pipeline.prepare(source_table, target_table)
# Track experiments
tracker = ExperimentTracker(experiment_name="my-experiment")
tracker.start_run(run_name="run-1", tags={"model": "logistic_regression"})
tracker.log_metrics({"accuracy": 0.95})
tracker.end_run()
Architecture
Medallion Layers
Bronze Layer (Raw)
↓ [ingest]
Silver Layer (Cleaned)
↓ [clean]
Gold Layer (Prepared)
↓ [train/analyze]
Configuration
Dross expects configuration via kef.yaml:
unity_catalog:
project_name: my_project
schema:
bronze: my_project_bronze
silver: my_project_silver
gold: my_project_gold
storage_base: file:///data/kaggle/my_project
mlflow:
experiment_name: my-experiment
tracking_uri: http://localhost:5000
tables:
bronze:
raw: raw
silver:
cleaned: cleaned
gold:
dataset: dataset
CLI
Dross includes minimal CLI utilities:
# Validate configuration
dross config validate
# Show schema expectations
dross schema
# Version info
dross --version
Project Integration
Step 1: Add Dross Dependency
# pyproject.toml
dependencies = [
"dross",
"kef",
...
]
Step 2: Implement Project Models
# src/my_project/models/custom_model.py
from dross.models import BaseModel
from sklearn.ensemble import GradientBoostingClassifier
class GradBoostModel(BaseModel):
def build(self):
self.model = GradientBoostingClassifier(
n_estimators=self.config.get("n_estimators", 100)
)
Step 3: Use in Data Pipeline
# src/my_project/data/ingest.py
from dross.data import MedallionPipeline
from dross.tracking import UCClient
from kef import cfg
async def run_ingest():
uc = UCClient(server=cfg.unity_catalog.get("server"))
pipeline = MedallionPipeline(cfg.unity_catalog, uc, cfg.paths.storage)
await pipeline.ingest(source_csv, columns=schema_def)
Step 4: Define Make Targets
# Makefile - data pipeline targets
data.ingest:
uv run python -m my_project.data.ingest
data.clean:
uv run python -m my_project.data.clean
data.prepare:
uv run python -m my_project.data.prepare
data.pipeline: data.ingest data.clean data.prepare
train:
uv run python -m my_project.train
eval:
uv run python -m my_project.evaluate
Components
dross.data.MedallionPipeline
Orchestrates medallion layers with DuckDB and Unity Catalog.
Methods:
ingest(source_file, catalog, schema_name, table_name, columns)- Bronze ingestionclean(source_table, target_table, transform_func, ...)- Silver transformationprepare(source_table, target_table, ...)- Gold preparation
dross.tracking.ExperimentTracker
MLflow experiment tracking wrapper.
Methods:
start_run(run_name, tags)- Start tracking runlog_params(params)- Log hyperparameterslog_metrics(metrics, step)- Log metricslog_model(model, name)- Log model artifactlog_artifact(local_path, artifact_path)- Log file artifactend_run(status)- End run
dross.tracking.UCClient
Unity Catalog client wrapper.
Methods:
catalog_create(name)- Create catalogschema_create(catalog, name)- Create schematable_create(catalog, schema, name, columns, storage_location)- Create table
dross.models.BaseModel
Abstract base for model implementations.
Methods:
build()- Initialize model instancefit(X, y)- Train modelpredict(X)- Get predictionspredict_proba(X)- Get probability predictions
dross.utilities.TfidfVectorizer
Scikit-learn TF-IDF wrapper with sensible defaults.
from dross.utilities import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, lowercase=True)
X = vectorizer.fit_transform(texts)
Development
# Setup
make sync
# Format & lint
make fmt
make lint
# Type check
make typecheck
# Run tests
make test
# Full QA
make qa
# Build
make build
License
MIT - See LICENSE file
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dross-0.1.0.tar.gz.
File metadata
- Download URL: dross-0.1.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
727529a3d95c4062f72aa22e9bb414dabb6d663bf366216d7f272bd269464724
|
|
| MD5 |
e1cec2cae53299055e4ca6571d992b2e
|
|
| BLAKE2b-256 |
27def2e9cd3f7be9473dc7c6a741917215bd66bef54ad55b4ac0da4c1f3b43b8
|
File details
Details for the file dross-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dross-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d53e815def2e5b0d31d8f142cba65a22daf54616355aefa3cdb4721d262ce32c
|
|
| MD5 |
123d9c45f1936e1379e1b47d5dfcf37e
|
|
| BLAKE2b-256 |
8d5825a99706766aa676036b8e49867b5db02b1eaa31b4353ba5f6ece80af335
|