Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets
Project description
Macrodata Refiner
Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.
It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.
It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.
Quickstart
Install:
pip install macrodata-refiner
Create a Macrodata API key:
Log in:
macrodata login
Cloud example
Launch a robotics pipeline on Macrodata Cloud.
This requires a valid API key.
import refiner as mdr
(
mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
.map(
mdr.robotics.motion_trim(
threshold=0.001,
pad_frames=5,
)
)
.write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
.launch_cloud(
name="motion_trim",
num_workers=4,
)
)
Need cloud GPUs? See Launchers for the GPU-specific cloud options.
Local example
Launch a local pipeline:
import refiner as mdr
def add_preview(row):
return row.update(
preview=" ".join(row["text"].split()[:20]),
)
(
mdr.read_jsonl("input/*.jsonl")
.filter(mdr.col("lang") == "en")
.with_columns(
text=mdr.col("text").str.strip(),
text_len=mdr.col("text").str.len(),
)
.map(add_preview)
.write_parquet("s3://my-bucket/english-cleanup/")
.launch_local(
name="english-cleanup",
num_workers=2,
)
)
pip install gives you:
- the Python package as
refiner - the CLI as
macrodata
Batteries included
- training-data-first pipeline primitives instead of generic ETL abstractions
- multimodal processing, with robotics support today
- a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
- access to any storage backend supported by
fsspec(S3, GCP, Hugging Face, etc.) - local execution for development and elastic cloud execution for large runs
- built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact
Docs
Getting started:
Core concepts:
Modalities and platform:
Community
- join the Macrodata Discord: https://discord.gg/S8kZtmBR2x
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file macrodata_refiner-0.2.2.tar.gz.
File metadata
- Download URL: macrodata_refiner-0.2.2.tar.gz
- Upload date:
- Size: 128.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab338bb7ee1d93b0e03cd912af0577060232f2d68732abeb0466f99214d7a990
|
|
| MD5 |
a25cadbba734708706aa828c6ae781c6
|
|
| BLAKE2b-256 |
9fb4d2ac42ee4be40c1bf4352590472613312a354d65f121e34f8dbb009b09d3
|
Provenance
The following attestation bundles were made for macrodata_refiner-0.2.2.tar.gz:
Publisher:
pypi-release.yml on macrodata-labs/refiner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
macrodata_refiner-0.2.2.tar.gz -
Subject digest:
ab338bb7ee1d93b0e03cd912af0577060232f2d68732abeb0466f99214d7a990 - Sigstore transparency entry: 1244060271
- Sigstore integration time:
-
Permalink:
macrodata-labs/refiner@ae2f8be7b4ed553f2da108dd98a4ccaf9c73d2ea -
Branch / Tag:
refs/heads/main - Owner: https://github.com/macrodata-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-release.yml@ae2f8be7b4ed553f2da108dd98a4ccaf9c73d2ea -
Trigger Event:
push
-
Statement type:
File details
Details for the file macrodata_refiner-0.2.2-py3-none-any.whl.
File metadata
- Download URL: macrodata_refiner-0.2.2-py3-none-any.whl
- Upload date:
- Size: 163.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07592ed26b9436e30fe11026bc5cfd1eb9e7455f6ca3299768b0f3302010a443
|
|
| MD5 |
b42d7aaed361e03f004e81173ad71ea0
|
|
| BLAKE2b-256 |
a7c8c2a3512a8988810cf8038b32b67a4acbfc9ab0bbebae46688779e26c34e3
|
Provenance
The following attestation bundles were made for macrodata_refiner-0.2.2-py3-none-any.whl:
Publisher:
pypi-release.yml on macrodata-labs/refiner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
macrodata_refiner-0.2.2-py3-none-any.whl -
Subject digest:
07592ed26b9436e30fe11026bc5cfd1eb9e7455f6ca3299768b0f3302010a443 - Sigstore transparency entry: 1244060554
- Sigstore integration time:
-
Permalink:
macrodata-labs/refiner@ae2f8be7b4ed553f2da108dd98a4ccaf9c73d2ea -
Branch / Tag:
refs/heads/main - Owner: https://github.com/macrodata-labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-release.yml@ae2f8be7b4ed553f2da108dd98a4ccaf9c73d2ea -
Trigger Event:
push
-
Statement type: