Skip to main content

Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets

Project description

Macrodata

Macrodata Refiner

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

Quickstart

Install:

pip install macrodata-refiner

Create a Macrodata API key:

Log in:

macrodata login

Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)

Need cloud GPUs? See Launchers for the GPU-specific cloud options.

Local example

Launch a local pipeline:

import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)

pip install gives you:

  • the Python package as refiner
  • the CLI as macrodata

Batteries included

  • training-data-first pipeline primitives instead of generic ETL abstractions
  • multimodal processing, with robotics support today
  • a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
  • access to any storage backend supported by fsspec (S3, GCP, Hugging Face, etc.)
  • local execution for development and elastic cloud execution for large runs
  • built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

Docs

Getting started:

Core concepts:

Modalities and platform:

Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macrodata_refiner-0.2.2.tar.gz (128.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macrodata_refiner-0.2.2-py3-none-any.whl (163.2 kB view details)

Uploaded Python 3

File details

Details for the file macrodata_refiner-0.2.2.tar.gz.

File metadata

  • Download URL: macrodata_refiner-0.2.2.tar.gz
  • Upload date:
  • Size: 128.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macrodata_refiner-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ab338bb7ee1d93b0e03cd912af0577060232f2d68732abeb0466f99214d7a990
MD5 a25cadbba734708706aa828c6ae781c6
BLAKE2b-256 9fb4d2ac42ee4be40c1bf4352590472613312a354d65f121e34f8dbb009b09d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.2.tar.gz:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file macrodata_refiner-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for macrodata_refiner-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 07592ed26b9436e30fe11026bc5cfd1eb9e7455f6ca3299768b0f3302010a443
MD5 b42d7aaed361e03f004e81173ad71ea0
BLAKE2b-256 a7c8c2a3512a8988810cf8038b32b67a4acbfc9ab0bbebae46688779e26c34e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.2-py3-none-any.whl:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page