Skip to main content

Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets

Project description

Macrodata

Macrodata Refiner

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

Quickstart

Install:

pip install macrodata-refiner

Create a Macrodata API key:

Log in:

macrodata login

Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)

Local example

Launch a local pipeline:

import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)

pip install gives you:

  • the Python package as refiner
  • the CLI as macrodata

Batteries included

  • training-data-first pipeline primitives instead of generic ETL abstractions
  • multimodal processing, with robotics support today
  • a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
  • access to any storage backend supported by fsspec (S3, GCP, Hugging Face, etc.)
  • local execution for development and elastic cloud execution for large runs
  • built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

Docs

Getting started:

Core concepts:

Modalities and platform:

Community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macrodata_refiner-0.2.1.tar.gz (112.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macrodata_refiner-0.2.1-py3-none-any.whl (149.4 kB view details)

Uploaded Python 3

File details

Details for the file macrodata_refiner-0.2.1.tar.gz.

File metadata

  • Download URL: macrodata_refiner-0.2.1.tar.gz
  • Upload date:
  • Size: 112.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for macrodata_refiner-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c3d0e592a26fc277e82342bb526780bb8492196337168d857b4e693c093c8657
MD5 faf1de1f2d1bca33f4a26cd3a90c24e5
BLAKE2b-256 f199972eb51fb197a54c8b984aa973f927c4ac845ee08a02bf34eb5f694e6261

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.1.tar.gz:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file macrodata_refiner-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for macrodata_refiner-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 da154d2cd7efdd5877d8db2e9c9bdcf1592864a1d093a49e600fb67c456f7ff2
MD5 d9f86cf56ee28bb45426ac4a251be333
BLAKE2b-256 09d170a90a423eecffe86af8e895332709e9105eb8485cb00ae0e47742d92a73

See more details on using hashes here.

Provenance

The following attestation bundles were made for macrodata_refiner-0.2.1-py3-none-any.whl:

Publisher: pypi-release.yml on macrodata-labs/refiner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page