Skip to main content

Reproducibility and provenance tracker for ML training pipelines

Project description

roar

Run Observation & Artifact Registration

roar tracks data artifacts and execution steps in ML pipelines, enabling reproducibility and lineage queries. roar tracking happens automagically by observing your commands as they run, capturing essential context without requiring you to define a pipeline explicitly.

By identifying files based on their actual content rather than their names, it ensures you can always trace a result back to the exact inputs and code that produced it. This gives you reliable reproducibility and a clear history of your artifacts, all derived naturally from your workflow.

While roar captures your work locally, connecting it to a GLaaS (Global Lineage-as-a-Service) server like glaas.ai allows you to publish your lineage graphs to a shared global registry for easy visualization and collaboration. Now your team can search for any artifact by its hash to see exactly how it was made and generate the precise commands needed to reproduce it on another machine.

Installation

pip install roar-cli
# or with uv
uv pip install roar-cli

Requires Python 3.10+.

Platform Support

Platform Status
Linux x86_64 ✅ Full support
Linux aarch64 ✅ Full support
macOS 🚧 Experimental (limitations)
Windows Coming soon

PyPI wheels are published for Linux and macOS (x86_64 and arm64).

Development Installation

# Clone the repository
git clone https://github.com/treqs/roar.git
cd roar

# Install in development mode
uv pip install -e ".[dev]"
# or without uv
pip install -e ".[dev]"

Quick Start

# Initialize roar in your project
cd my-ml-project
roar init

# Run commands with provenance tracking
roar run python preprocess.py --input data.csv --output features.parquet
roar run python train.py --data features.parquet --output model.pt
roar run python evaluate.py --model model.pt --output metrics.json

Tracer Backends

roar run relies on a Rust "tracer" binary to observe file I/O. If you see an error like "No tracer binary found", build one of the backends below.

Backends

Backend Binary Platforms Notes
eBPF roar-tracer-ebpf Linux Fastest, but requires permissions and kernel support.
preload roar-tracer-preload + libroar_tracer_preload macOS, Linux Uses DYLD_INSERT_LIBRARIES (macOS) or LD_PRELOAD (Linux). Not compatible with processes that ignore preload env vars (e.g., SIP/hardened runtime on macOS), or fully-static binaries (common with Go).
ptrace roar-tracer Linux Slowest, broadest compatibility on Linux.

Building

cd rust

# eBPF (Linux)
cargo build --release -p roar-tracer-ebpf

# preload (macOS & Linux)
cargo build --release -p roar-tracer-preload

# ptrace (Linux)
cargo build --release -p roar-tracer

Selecting A Backend

By default, roar uses auto mode: prefer eBPF, then preload, then ptrace.

# Show what roar can currently find and whether it looks usable
roar tracer status

# Set a default backend (auto|ebpf|preload|ptrace)
roar tracer set-default preload

macOS Tracing Limitations

On macOS, roar uses the preload backend (DYLD_INSERT_LIBRARIES). macOS System Integrity Protection (SIP) silently blocks library injection for Apple-signed platform binaries — anything under /usr/bin/, /bin/, /sbin/, or /System/. When this happens, roar run will complete successfully but capture no file I/O events.

Affected: /usr/bin/python3, /bin/sh, /usr/bin/ruby, and all other Apple-shipped binaries.

Workaround: Use non-Apple builds of your tools:

# Homebrew
brew install python3
roar run python3 train.py          # Uses /opt/homebrew/bin/python3 — works

# conda / pyenv / nix also work
roar run ~/.pyenv/shims/python train.py

# This will NOT capture file events (SIP blocks it):
roar run /usr/bin/python3 train.py

roar prints a warning when it detects no events were captured from a SIP-protected binary.

Commands

roar init

Initialize roar in the current directory. Creates a .roar/ directory to store the local database and a config.toml with default settings.

roar init           # Initialize, prompt for gitignore
roar init -y        # Initialize and auto-add to gitignore
roar init -n        # Initialize without modifying gitignore

roar run <command>

Run a command with provenance tracking. Roar captures:

  • Files read and written
  • Git commit and branch
  • Execution time and exit code
  • Command arguments
roar run python train.py --epochs 10 --lr 0.001
roar run ./scripts/preprocess.sh
roar run torchrun --nproc_per_node=4 train.py

# Re-run a previous DAG step
roar run @2                    # Re-run DAG node 2
roar run @2 --epochs=10        # Re-run with parameter override

roar reproduce <hash>

Reproduce an artifact by tracing its lineage.

# Show the reproduction plan (preview)
roar reproduce abc123de

# Run full reproduction
roar reproduce abc123de --run

# Run without prompts
roar reproduce abc123de --run -y

# Include system packages during setup
roar reproduce abc123de --run --package-sync

# Show all required packages (no truncation)
roar reproduce abc123de --list-requirements

Full reproduction clones the git repository, creates a virtual environment, installs recorded packages, and runs the pipeline steps.

roar build <command>

Run a build step with provenance tracking. Build steps run before pipeline steps during reproduction.

# Compile native extensions
roar build maturin develop --release
roar build make -j4

# Install local packages
roar build pip install -e .

Use for setup that should run before the main pipeline (compiling, installing).

roar auth

Manage GLaaS authentication.

roar auth register    # Show SSH public key for registration
roar auth test        # Test connection to GLaaS server
roar auth status      # Show current auth status

To register with GLaaS:

  1. Run roar auth register to display your public key
  2. Sign up at https://glaas.ai where you can paste your public key
  3. Run roar auth test to verify

roar config

View or set configuration options.

roar config list
roar config get <key>
roar config set <key> <value>

Run roar config list to see all available options with descriptions. Common options:

Key Default Description
output.track_repo_files false Include repo files in provenance
output.quiet false Suppress written files report
filters.ignore_system_reads true Ignore /sys, /etc, /sbin reads
filters.ignore_package_reads true Ignore installed package reads
filters.ignore_torch_cache true Ignore torch/triton cache
filters.ignore_tmp_files true Ignore /tmp files
glaas.url https://api.glaas.ai GLaaS server URL
glaas.web_url https://glaas.ai GLaaS web UI URL
registration.omit.enabled true Enable secret filtering
hash.primary blake3 Primary hash algorithm
logging.level warning Log level (debug, info, warning, error)

roar dag

Display the pipeline DAG for the current session.

roar dag                  # Compact view with colors
roar dag --expanded       # Show all executions including reruns
roar dag --json           # Machine-readable JSON output
roar dag --show-artifacts # Show intermediate artifacts

roar env

Manage persistent environment variables injected into roar run and roar build.

roar env set FOO bar      # Set FOO=bar
roar env get FOO          # Print value of FOO
roar env list             # List all env vars
roar env unset FOO        # Remove FOO

roar log

Display recent job execution history.

roar log                  # Show recent job history

roar register

Register artifact lineage with GLaaS.

roar register model.pt              # Register model lineage
roar register --dry-run model.pt    # Preview without registering
roar register -y model.pt           # Skip confirmation prompt

roar put

Upload artifacts to cloud storage and register lineage with GLaaS.

roar put model.pt s3://bucket/models/ -m "Final model"
roar put ./checkpoints/ gs://bucket/run-42/ -m "All checkpoints"
roar put @2 s3://bucket/outputs/ -m "Step 2 outputs"

Options:

  • -m, --message — Description of the upload (required)
  • --dry-run — Preview without uploading
  • --no-tag — Skip git tagging

Source formats:

  • File path: model.pt, ./data/output.csv
  • Directory: ./checkpoints/ (uploads all files recursively)
  • Job reference: @2 (uploads outputs from step 2)
  • No source: uploads all outputs from the current session

roar get

Download artifacts from cloud storage.

roar get s3://bucket/models/model.pt ./local/
roar get gs://bucket/data/train.csv
roar get https://example.com/weights.pt --hash abc123...
roar get s3://bucket/checkpoints/ ./local/ # Download all files under prefix

Options:

  • -m, --message — Annotation for this download
  • --hash — Expected BLAKE3 hash (for verification)
  • --tag — Create a git tag for this download
  • --force — Overwrite existing files
  • --dry-run — Preview without downloading

Downloads are registered locally as source nodes in the DAG (outputs only, no inputs). They appear in GLaaS when downstream jobs are registered via roar put or roar register.

roar reset

Start a fresh session. Previous session data is preserved in the database.

roar reset                # Reset with confirmation prompt
roar reset -y             # Reset without confirmation

roar show

Show session, job, or artifact details.

roar show                          # Show active session overview
roar show @1                       # Show details for step 1
roar show @B1                      # Show details for build step 1
roar show a1b2c3d4                 # Show job by UID
roar show ./output/model.pkl       # Show artifact by path

roar status

Show a summary of the active session.

roar status

roar pop

Remove the most recent job from the active session. Useful for undoing a mistaken roar run or correcting the pipeline before registration.

roar pop              # Pop with confirmation prompt
roar pop -y           # Pop without confirmation (skip prompt)

What it does:

  • Removes the last job from the session history
  • Deletes output artifacts created by that job (unless they're packages/system files)
  • Does not affect the original input files

Concepts

Artifacts

Data files tracked by their content hash (BLAKE3). The same file content always has the same hash, regardless of filename or location.

Jobs

Recorded executions that consume input artifacts and produce output artifacts. Each roar run creates a job record.

Collections

Named groups of artifacts, used for downloaded datasets or upload bundles.

Workflow Example

# Record your pipeline
roar run python preprocess.py
roar run python train.py --epochs 10
roar run python evaluate.py

# Later, reproduce an artifact
roar reproduce <model-hash> --run

Git Integration

Roar automatically captures git metadata:

  • Current commit hash
  • Branch name
  • Repository path

Data Storage

All data is stored locally in .roar/roar.db (SQLite). The database includes:

  • Artifact hashes and metadata
  • Job records with inputs/outputs
  • Hash cache for performance

Add .roar/ to your .gitignore (roar offers to do this during roar init).

GLaaS Server

Roar can register artifacts and jobs with a GLaaS (Global Lineage-as-a-Service) server using the roar register command.

Server Setup

# Install with server dependencies
uv pip install -e ".[server]"
# or without uv
pip install -e ".[server]"

# Run the server
glaas-server

# Or with custom host/port
GLAAS_HOST=0.0.0.0 GLAAS_PORT=8080 glaas-server

The server provides:

  • REST API for artifact and job registration
  • Web UI at / with artifact and job browsers
  • Search and filtering by command, GPU, file type, etc.

Client Configuration

# Set the GLaaS server URL
roar config set glaas.url http://localhost:8000

# Show your SSH key (copy to GLaaS web UI)
roar auth register

# Test authentication
roar auth test

Development

Prerequisites

Setup

# Install dev dependencies
uv pip install -e ".[dev]"

Running Quality Checks

# Linting
ruff check .

# Format check
ruff format --check

# Type checking
mypy roar

# Run all checks at once
ruff check . && ruff format --check && mypy roar

Running Tests

# Run all tests (excluding those requiring a live GLaaS server)
pytest tests/ -v -m "not glaas and not live_glaas"

# Run with coverage
pytest tests/ -v --cov=roar --cov-report=term-missing -m "not glaas and not live_glaas"

# Run tests in parallel
pytest tests/ -v -n auto -m "not glaas and not live_glaas"

# Run only unit tests (fast)
pytest tests/ -v -m "not integration and not e2e and not glaas and not live_glaas"

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roar_cli-0.2.9.tar.gz (9.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

roar_cli-0.2.9-cp313-cp313-manylinux_2_34_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

roar_cli-0.2.9-cp313-cp313-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

roar_cli-0.2.9-cp313-cp313-macosx_10_12_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

roar_cli-0.2.9-cp312-cp312-manylinux_2_34_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

roar_cli-0.2.9-cp312-cp312-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

roar_cli-0.2.9-cp312-cp312-macosx_10_12_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

roar_cli-0.2.9-cp311-cp311-manylinux_2_34_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

roar_cli-0.2.9-cp311-cp311-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

roar_cli-0.2.9-cp311-cp311-macosx_10_12_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

roar_cli-0.2.9-cp310-cp310-manylinux_2_34_x86_64.whl (9.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

roar_cli-0.2.9-cp310-cp310-macosx_11_0_arm64.whl (7.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

roar_cli-0.2.9-cp310-cp310-macosx_10_12_x86_64.whl (7.6 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file roar_cli-0.2.9.tar.gz.

File metadata

  • Download URL: roar_cli-0.2.9.tar.gz
  • Upload date:
  • Size: 9.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for roar_cli-0.2.9.tar.gz
Algorithm Hash digest
SHA256 5a9aee99e5b0e49c965511fb8f4e2c8ec83195a3bbd5a6d5af59e8447268d5df
MD5 500584f46ab09ee13d02c454ca6e966e
BLAKE2b-256 d0860a26c091a03f3e1905f47829068717c21ca8c3be3bad26aae2cf7d9ade4b

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8a8edb4654adaa7226e5158bd5af5ca6f6ede5b63dbefe257b449a82f61f6e97
MD5 9960f88cfcf49601b867d88c587cfa6f
BLAKE2b-256 23a140772c3f659edd572db458c276c35237d1cad22ca92891bed7234eed42cd

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aad686233c9c63ad5197579172765aacea87aabfad945c86d471eabe95bcedf1
MD5 b87e742fc98b59293a817f80186a4ba5
BLAKE2b-256 2f3307eb20094f9a7e46882956ce9c366a19b2bf94835e0479ee8e57222f4e60

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9ccfb76eb528088b6c48e534745fcc25e8db8e5d0819d41eb3b588731551f702
MD5 f9b1a1cb7f375c5f0857d1781e867783
BLAKE2b-256 056795571027ecf7a7b7895f74d1faca6ebeff0f20155275bfdc109c5446989e

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d08fef0b9fc0c9567dde08ba393794418b94454b99d94beb4b942ac07f25f202
MD5 a07be45b24199dc28fd4a1c856795f75
BLAKE2b-256 93dde223fdd5e47eb471bd07243dc0335cfa8041dc8583e99baec819c1b7f0ce

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4a6de2fcf8635c0bdeb9a1ca901de02c7a13e25310b00f59ec1f892cf7b35da4
MD5 ebd859bfb311c91325cef2814e1700f9
BLAKE2b-256 a5a0889404736e4d6fbbbd5fcf9d2a533b5b67c44bd7671d9b84eab6a738c414

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 80c3ac764c9972b9fdb3f8f83d2dab17bae45425d81a3a65e61f4eb35a980162
MD5 d350555c7372d3bfaa3104a85002984a
BLAKE2b-256 2448ee2498af4373bae849553feeb0e99e0765f753aeb0508d47817f52ec79e5

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 b9a26c138a0263aad9714c55ed370bebdf83a04a09aee0b5400872f4933c4f48
MD5 65660b3796c1b1cf1026f9261c492826
BLAKE2b-256 116aae7770c9b821a1c5e5807eb9761bb539ee702946c3f3c63e91cec2ac66b7

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1d2dd46efea3106be55e6cec5a899d8edfc78b2c0cabb03ee2b6780fca23df93
MD5 61e7debcab903f6803cc424661bb7e2d
BLAKE2b-256 1505dbc03059f58cba9ffbc82fe685f6c32d334370f19edad5af06cf56463ed3

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 72029f153a0844ce00aebb4ba94129c7450a91da42690c0668ba41b49ff0b613
MD5 747dce24f8cd208a0100bea34e5092d8
BLAKE2b-256 d6e63395bcd8951ad591bcda7cd1a2e75db225ece53c92fd7a70b97181841d0a

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f7f0cafcb680323875edeef953d0cbee2acb111ba14c79f9ab4aa69ec6b97f97
MD5 64463ebe4f3a6c582b786462da6641ef
BLAKE2b-256 9def01cb0c8fd02be48b71cc9a42f3176bc2f4c1271c782a6c4aef98a0aa4424

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da5ff2051cd727af4f4a583c25808354c13b5884105e1d360f0c555b0012769b
MD5 89ded22d185523e8de6dd74cdef2ba64
BLAKE2b-256 3822383d26710158b41a801f928309a5f9cd3c02621be3c71bc675003afeae0a

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.9-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.9-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e8567d45646c5d6d4a839b6048f483bbef845679faf08cfd5c3f027150cc80a7
MD5 2debd2a15382e797ada494835405d143
BLAKE2b-256 98eb6623680d2201b30752578d98e448e069832f5ce0149484f7fc9a0b604bb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page