Skip to main content

Reproducibility and provenance tracker for ML training pipelines

Project description

roar

Run Observation & Artifact Registration

roar tracks data artifacts and execution steps in ML pipelines, enabling reproducibility and lineage queries. roar tracking happens automagically by observing your commands as they run, capturing essential context without requiring you to define a pipeline explicitly.

By identifying files based on their actual content rather than their names, it ensures you can always trace a result back to the exact inputs and code that produced it. This gives you reliable reproducibility and a clear history of your artifacts, all derived naturally from your workflow.

While roar captures your work locally, connecting it to a GLaaS (Global Lineage-as-a-Service) server like glaas.ai allows you to publish your lineage graphs to a shared global registry for easy visualization and collaboration. Now your team can search for any artifact by its hash to see exactly how it was made and generate the precise commands needed to reproduce it on another machine.

Installation

pip install roar-cli
# or with uv
uv pip install roar-cli

Requires Python 3.10+.

Platform Support

Platform Status
Linux x86_64 ✅ Full support
Linux aarch64 ✅ Full support
macOS 🚧 Experimental (limitations)
Windows Coming soon

PyPI wheels are published for Linux and macOS (x86_64 and arm64).

Development Installation

# Clone the repository
git clone https://github.com/treqs/roar.git
cd roar

# Install in development mode
uv pip install -e ".[dev]"
# or without uv
pip install -e ".[dev]"

Quick Start

# Initialize roar in your project
cd my-ml-project
roar init

# Run commands with provenance tracking
roar run python preprocess.py --input data.csv --output features.parquet
roar run python train.py --data features.parquet --output model.pt
roar run python evaluate.py --model model.pt --output metrics.json

Tracer Backends

roar run relies on a Rust "tracer" binary to observe file I/O. If you see an error like "No tracer binary found", build one of the backends below.

Backends

Backend Binary Platforms Notes
eBPF roar-tracer-ebpf Linux Fastest, but requires permissions and kernel support.
preload roar-tracer-preload + libroar_tracer_preload macOS, Linux Uses DYLD_INSERT_LIBRARIES (macOS) or LD_PRELOAD (Linux). Not compatible with processes that ignore preload env vars (e.g., SIP/hardened runtime on macOS), or fully-static binaries (common with Go).
ptrace roar-tracer Linux Slowest, broadest compatibility on Linux.

Building

cd rust

# eBPF (Linux)
cargo build --release -p roar-tracer-ebpf

# preload (macOS & Linux)
cargo build --release -p roar-tracer-preload

# ptrace (Linux)
cargo build --release -p roar-tracer

Selecting A Backend

By default, roar uses auto mode: prefer eBPF, then preload, then ptrace.

# Show what roar can currently find and whether it looks usable
roar tracer status

# Set a default backend (auto|ebpf|preload|ptrace)
roar tracer set-default preload

macOS Tracing Limitations

On macOS, roar uses the preload backend (DYLD_INSERT_LIBRARIES). macOS System Integrity Protection (SIP) silently blocks library injection for Apple-signed platform binaries — anything under /usr/bin/, /bin/, /sbin/, or /System/. When this happens, roar run will complete successfully but capture no file I/O events.

Affected: /usr/bin/python3, /bin/sh, /usr/bin/ruby, and all other Apple-shipped binaries.

Workaround: Use non-Apple builds of your tools:

# Homebrew
brew install python3
roar run python3 train.py          # Uses /opt/homebrew/bin/python3 — works

# conda / pyenv / nix also work
roar run ~/.pyenv/shims/python train.py

# This will NOT capture file events (SIP blocks it):
roar run /usr/bin/python3 train.py

roar prints a warning when it detects no events were captured from a SIP-protected binary.

Commands

roar init

Initialize roar in the current directory. Creates a .roar/ directory to store the local database and a config.toml with default settings.

roar init           # Initialize, prompt for gitignore
roar init -y        # Initialize and auto-add to gitignore
roar init -n        # Initialize without modifying gitignore

roar run <command>

Run a command with provenance tracking. Roar captures:

  • Files read and written
  • Git commit and branch
  • Execution time and exit code
  • Command arguments
roar run python train.py --epochs 10 --lr 0.001
roar run ./scripts/preprocess.sh
roar run torchrun --nproc_per_node=4 train.py

# Re-run a previous DAG step
roar run @2                    # Re-run DAG node 2
roar run @2 --epochs=10        # Re-run with parameter override

roar reproduce <hash>

Reproduce an artifact by tracing its lineage.

# Show the reproduction plan (preview)
roar reproduce abc123de

# Run full reproduction
roar reproduce abc123de --run

# Run without prompts
roar reproduce abc123de --run -y

# Include system packages during setup
roar reproduce abc123de --run --package-sync

# Show all required packages (no truncation)
roar reproduce abc123de --list-requirements

Full reproduction clones the git repository, creates a virtual environment, installs recorded packages, and runs the pipeline steps.

roar build <command>

Run a build step with provenance tracking. Build steps run before pipeline steps during reproduction.

# Compile native extensions
roar build maturin develop --release
roar build make -j4

# Install local packages
roar build pip install -e .

Use for setup that should run before the main pipeline (compiling, installing).

roar auth

Manage GLaaS authentication.

roar auth register    # Show SSH public key for registration
roar auth test        # Test connection to GLaaS server
roar auth status      # Show current auth status

To register with GLaaS:

  1. Run roar auth register to display your public key
  2. Sign up at https://glaas.ai where you can paste your public key
  3. Run roar auth test to verify

roar config

View or set configuration options.

roar config list
roar config get <key>
roar config set <key> <value>

Run roar config list to see all available options with descriptions. Common options:

Key Default Description
output.track_repo_files false Include repo files in provenance
output.quiet false Suppress written files report
filters.ignore_system_reads true Ignore /sys, /etc, /sbin reads
filters.ignore_package_reads true Ignore installed package reads
filters.ignore_torch_cache true Ignore torch/triton cache
filters.ignore_tmp_files true Ignore /tmp files
glaas.url https://api.glaas.ai GLaaS server URL
glaas.web_url https://glaas.ai GLaaS web UI URL
registration.omit.enabled true Enable secret filtering
hash.primary blake3 Primary hash algorithm
logging.level warning Log level (debug, info, warning, error)

roar dag

Display the pipeline DAG for the current session.

roar dag                  # Compact view with colors
roar dag --expanded       # Show all executions including reruns
roar dag --json           # Machine-readable JSON output
roar dag --show-artifacts # Show intermediate artifacts

roar env

Manage persistent environment variables injected into roar run and roar build.

roar env set FOO bar      # Set FOO=bar
roar env get FOO          # Print value of FOO
roar env list             # List all env vars
roar env unset FOO        # Remove FOO

roar log

Display recent job execution history.

roar log                  # Show recent job history

roar label

Manage local labels for DAGs (sessions), jobs, and artifacts.

# Set labels (patches the current label document)
roar label set dag current owner=alice team=ml
roar label set job @2 phase=train lr=0.001
roar label set artifact ./outputs/model.pt model.name=resnet50 stage=baseline

# Copy labels from one entity to another
roar label cp job @2 artifact ./outputs/model.pt

# Show current labels
roar label show dag current
roar label show job @2
roar label show artifact ./outputs/model.pt

# Show label history (all versions)
roar label history dag current
roar label history artifact <artifact-hash>

Entity targets:

  • dag: current or a session hash prefix
  • job: step ref (@N or @BN) or job UID
  • artifact: file path or artifact hash

Labels are stored locally and included in lineage registration/publish flows to GLaaS when supported by the configured server.

roar register

Register session, job, step, or artifact lineage with GLaaS.

roar register model.pt              # Register model lineage
roar register --dry-run model.pt    # Preview without registering
roar register -y model.pt           # Skip confirmation prompt
roar register @4                    # Register lineage for DAG step 4
roar register deadbeef              # Register lineage for a local job UID
roar register 7f1e...c9a4           # Register lineage for a tracked artifact hash
roar register 8d7a1f2c...           # Register a whole local session
roar register s3://bucket/run/out   # Register a tracked remote S3 artifact

Supported targets:

  • Local artifact path: model.pt, ./outputs/metrics.json
  • Tracked artifact hash: primitive or composite
  • Local job UID: full UID or unique prefix
  • Step reference: @N or @BN
  • Local session hash: full hash or unique prefix
  • Tracked remote path: s3://...

For bare 8-character hex targets, roar register prefers a matching local job UID before falling back to session-hash-prefix resolution.

roar put

Upload artifacts to cloud storage and register lineage with GLaaS.

roar put model.pt s3://bucket/models/ -m "Final model"
roar put ./checkpoints/ gs://bucket/run-42/ -m "All checkpoints"
roar put @2 s3://bucket/outputs/ -m "Step 2 outputs"

Options:

  • -m, --message — Description of the upload (required)
  • --dry-run — Preview without uploading
  • --no-tag — Skip git tagging

Source formats:

  • File path: model.pt, ./data/output.csv
  • Directory: ./checkpoints/ (uploads all files recursively)
  • Job reference: @2 (uploads outputs from step 2)
  • No source: uploads all outputs from the current session

roar get

Download artifacts from cloud storage.

roar get s3://bucket/models/model.pt ./local/
roar get gs://bucket/data/train.csv
roar get https://example.com/weights.pt --hash abc123...
roar get s3://bucket/checkpoints/ ./local/ # Download all files under prefix

Options:

  • -m, --message — Annotation for this download
  • --hash — Expected BLAKE3 hash (for verification)
  • --tag — Create a git tag for this download
  • --force — Overwrite existing files
  • --dry-run — Preview without downloading

Downloads are registered locally as source nodes in the DAG (outputs only, no inputs). They appear in GLaaS when downstream jobs are registered via roar put or roar register.

roar reset

Start a fresh session. Previous session data is preserved in the database.

roar reset                # Reset with confirmation prompt
roar reset -y             # Reset without confirmation

roar show

Show session, job, or artifact details.

roar show                          # Show active session overview
roar show @1                       # Show details for step 1
roar show @B1                      # Show details for build step 1
roar show a1b2c3d4                 # Show job by UID
roar show ./output/model.pkl       # Show artifact by path

roar status

Show a summary of the active session, including the current DAG hash.

roar status

roar pop

Remove the most recent job from the active session. Useful for undoing a mistaken roar run or correcting the pipeline before registration.

roar pop              # Pop with confirmation prompt
roar pop -y           # Pop without confirmation (skip prompt)

What it does:

  • Removes the last job from the session history
  • Deletes output artifacts created by that job (unless they're packages/system files)
  • Does not affect the original input files

Concepts

Artifacts

Data files tracked by their content hash (BLAKE3). The same file content always has the same hash, regardless of filename or location.

Jobs

Recorded executions that consume input artifacts and produce output artifacts. Each roar run creates a job record.

Collections

Named groups of artifacts, used for downloaded datasets or upload bundles.

Workflow Example

# Record your pipeline
roar run python preprocess.py
roar run python train.py --epochs 10
roar run python evaluate.py

# Later, reproduce an artifact
roar reproduce <model-hash> --run

Git Integration

Roar automatically captures git metadata:

  • Current commit hash
  • Branch name
  • Repository path

Data Storage

All data is stored locally in .roar/roar.db (SQLite). The database includes:

  • Artifact hashes and metadata
  • Job records with inputs/outputs
  • Hash cache for performance

Add .roar/ to your .gitignore (roar offers to do this during roar init).

GLaaS Server

Roar can register sessions, jobs, steps, and artifacts with a GLaaS (Global Lineage-as-a-Service) server using the roar register command.

Server Setup

# Install with server dependencies
uv pip install -e ".[server]"
# or without uv
pip install -e ".[server]"

# Run the server
glaas-server

# Or with custom host/port
GLAAS_HOST=0.0.0.0 GLAAS_PORT=8080 glaas-server

The server provides:

  • REST API for artifact and job registration
  • Web UI at / with artifact and job browsers
  • Search and filtering by command, GPU, file type, etc.

Client Configuration

# Set the GLaaS server URL
roar config set glaas.url http://localhost:8000

# Show your SSH key (copy to GLaaS web UI)
roar auth register

# Test authentication
roar auth test

[!TIP] Roar activity can be registered without authentication. Unauthenticated registrations are attributed to a public "anonymous" user, but are not guaranteed persistence. For persistent attribution, we recommend setting up roar auth.

Development

Prerequisites

Setup

# Install dev dependencies
uv pip install -e ".[dev]"

Running Quality Checks

# Linting
ruff check .

# Format check
ruff format --check

# Type checking
mypy roar

# Run all checks at once
ruff check . && ruff format --check && mypy roar

Running Tests

# Run all tests (excluding those requiring a live GLaaS server)
pytest tests/ -v -m "not glaas and not live_glaas"

# Run with coverage
pytest tests/ -v --cov=roar --cov-report=term-missing -m "not glaas and not live_glaas"

# Run tests in parallel
pytest tests/ -v -n auto -m "not glaas and not live_glaas"

# Run only unit tests (fast)
pytest tests/ -v -m "not integration and not e2e and not glaas and not live_glaas"

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roar_cli-0.2.11.tar.gz (9.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

roar_cli-0.2.11-cp313-cp313-manylinux_2_34_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

roar_cli-0.2.11-cp313-cp313-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

roar_cli-0.2.11-cp313-cp313-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

roar_cli-0.2.11-cp312-cp312-manylinux_2_34_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

roar_cli-0.2.11-cp312-cp312-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

roar_cli-0.2.11-cp312-cp312-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

roar_cli-0.2.11-cp311-cp311-manylinux_2_34_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

roar_cli-0.2.11-cp311-cp311-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

roar_cli-0.2.11-cp311-cp311-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

roar_cli-0.2.11-cp310-cp310-manylinux_2_34_x86_64.whl (9.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

roar_cli-0.2.11-cp310-cp310-macosx_11_0_arm64.whl (7.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

roar_cli-0.2.11-cp310-cp310-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file roar_cli-0.2.11.tar.gz.

File metadata

  • Download URL: roar_cli-0.2.11.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for roar_cli-0.2.11.tar.gz
Algorithm Hash digest
SHA256 2e66391d46b04c84c6da44d2fd7bb6348a463c4426beed9330ae821e497b1c9c
MD5 089aa086bd45b30356e4940e5e1feb80
BLAKE2b-256 f27312f4e93c546d13230441336d84f2e0c4c9e316fd1d92551fe63e8d1f8a49

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8892356004e87bd499b226a7f13f0b1c6f73956491860c84545dff29df325219
MD5 d479ceda982a577c43eb2ec800130ffc
BLAKE2b-256 18d82d88e25731484cdc191ad1bc06f8d1e02af65220efaace0543bd22c408c1

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 40203c038a693e0fb18218bf8a8274096d8f61f514c78ec576542f397d1803a7
MD5 e37c067559f4861f3a454973de00a142
BLAKE2b-256 ff7022c438f90742190395159d84e44a96312dee1e228ae86675acf38baeffee

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3400f7fda09166f907a8ff12b42a8e8f0d6f00ffbf8c2e16bea859da88f00689
MD5 d0ccf68fc6e3693bf9dbf2b1b3255662
BLAKE2b-256 046d8ff594b31881ebad145c74b036c20e65d6c02cf8e608bcdedea5120e633f

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fb9f41d7ed6529f776bc19ae0741a1bcd96f2bcc74e59458a0b4e4d467f5ed54
MD5 ebf00856a0d7c74efc5a7d750594828d
BLAKE2b-256 e4965eb83d73ed54c88d6d77bdfbb0df85286f6ab51801d4f57933b938523884

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 990865b1748b691efb623120c84bdb61608f6f55ced7589c0a85a2ca964133ba
MD5 1acd5407baa1c171f270d22ffc0bf6e8
BLAKE2b-256 69478332b7f048d7fd78b4a30109b719820bb0f723d6016520596adb94a2a579

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5fd9834e7da8b17a97cf6c6f19d6b0e56c47a8881ca6295ada2cf397b966f443
MD5 a76283dc0606627f587976a85395368a
BLAKE2b-256 da2dd5067dc0322cac8c3f4e84e7e6f7acb49d750d0c188c3990eb60e206f33d

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fb03c923469744932f91af6787129041b92d6a2a3de81d09b8ad31ae27889d87
MD5 b6b49b8439e1081ee15758227f59ac9c
BLAKE2b-256 92364ea53a665ee47ff11b4b5f531323eb83370d755a132c5047d64ee7846a7d

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc015673d859cb163c79bf64adbf2b7faf7db6fe5790b82198b2d4773f0e5287
MD5 c3a508970513e013afc6b4a2ca873453
BLAKE2b-256 c864001e72dcba2b630a66a371c91d814e171c35073611ce05eb404a1662ee33

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9710cd17b828ea6341bc6e3a96b4d300ebabc0de7b22c0cc36d25693e3fd7a9e
MD5 5edf138656122195dd67dfa16c312aa7
BLAKE2b-256 10a190c0f77da9a46ae7793d7f1cdc9faec96e6d813f926525eecdc5d22cc4d6

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 88db09b02ac330c67c6827c32216450a517808085a200dd0ad15a21a8829bd21
MD5 c508c9eec5c1d97fef63dec8c397ddd5
BLAKE2b-256 f4e3cec4638c03cc21e9d585cc6eab52c3bfbc76938d95181b2a742901432dc4

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0028155e4de94efff6c9c5f69349606ec46a512e097510ffc2609e3bfb0741f
MD5 681d025e24d5d9400eb1592506a5feec
BLAKE2b-256 42aa34f38434714e9e9eb1ba5e24e67a1211ef80185f734ce054bf04666f4a22

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.11-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.11-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 19e5188c7e607cbe3bdce31162d9ca4a194246b80696f9acc598a2a5d1d32253
MD5 9ee709435688f670f1c5085f44656e10
BLAKE2b-256 b82f5951efa27829e9e2dde057b788ba7723f921cae12db543c3367a3de0515d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page