Skip to main content

Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets

Project description

Macrodata

Macrodata Refiner

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

Quickstart

Install:

pip install macrodata-refiner

Create a Macrodata API key:

Log in:

macrodata login

Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)

Local example

Launch a local pipeline:

import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)

pip install gives you:

  • the Python package as refiner
  • the CLI as macrodata

Batteries included

  • training-data-first pipeline primitives instead of generic ETL abstractions
  • multimodal processing, with robotics support today
  • a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
  • access to any storage backend supported by fsspec (S3, GCP, Hugging Face, etc.)
  • local execution for development and elastic cloud execution for large runs
  • built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

Docs

Getting started:

Core concepts:

Modalities and platform:

Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macrodata_refiner-0.2.0.tar.gz (111.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macrodata_refiner-0.2.0-py3-none-any.whl (148.5 kB view details)

Uploaded Python 3

File details

Details for the file macrodata_refiner-0.2.0.tar.gz.

File metadata

  • Download URL: macrodata_refiner-0.2.0.tar.gz
  • Upload date:
  • Size: 111.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macrodata_refiner-0.2.0.tar.gz
Algorithm Hash digest
SHA256 080aac4e1c4554efee28df6a4c17bbfdbbcb29799e34454f359cf1a56d79cd10
MD5 ace3047ba399cb0a0570322a1823f2c5
BLAKE2b-256 512b3e26b60f387dcb7c3a723246e82a3cfe01b5e222f2c3e7efd31658014b41

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.0.tar.gz:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file macrodata_refiner-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for macrodata_refiner-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e308a85d015770d7da29511555967a6530c352f55565049f3d2b4027245b1894
MD5 f613c5119ed7c99d7c2ffc654ef4038b
BLAKE2b-256 cfadcdd900e98f86b1134c223c47c79f1e59b9050d7502aa7ade320179eb3c3d

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.0-py3-none-any.whl:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page