Skip to main content

Reproducibility and provenance tracker for ML training pipelines

Project description

roar

Run Observation & Artifact Registration

roar tracks data artifacts and execution steps in ML pipelines, enabling reproducibility and lineage queries. roar tracking happens automagically by observing your commands as they run, capturing essential context without requiring you to define a pipeline explicitly.

By identifying files based on their actual content rather than their names, it ensures you can always trace a result back to the exact inputs and code that produced it. This gives you reliable reproducibility and a clear history of your artifacts, all derived naturally from your workflow.

While roar captures your work locally, connecting it to a GLaaS (Global Lineage-as-a-Service) server like glaas.ai allows you to publish your lineage graphs to a shared global registry for easy visualization and collaboration. Now your team can search for any artifact by its hash to see exactly how it was made and generate the precise commands needed to reproduce it on another machine.

Installation

pip install roar-cli
# or with uv
uv pip install roar-cli

Requires Python 3.10+.

Platform Support

Platform Status
Linux x86_64 ✅ Full support
Linux aarch64 ✅ Full support
macOS 🚧 Experimental (limitations)
Windows Coming soon

PyPI wheels are published for Linux and macOS (x86_64 and arm64).

Development Installation

# Clone the repository
git clone https://github.com/treqs/roar.git
cd roar

# Install in development mode
uv pip install -e ".[dev]"
# or without uv
pip install -e ".[dev]"

Quick Start

# Initialize roar in your project
cd my-ml-project
roar init

# Run commands with provenance tracking
roar run python preprocess.py --input data.csv --output features.parquet
roar run python train.py --data features.parquet --output model.pt
roar run python evaluate.py --model model.pt --output metrics.json

Tracer Backends

roar run relies on a Rust "tracer" binary to observe file I/O. If you see an error like "No tracer binary found", build one of the backends below.

Backends

Backend Binary Platforms Notes
eBPF roar-tracer-ebpf Linux Fastest, but requires permissions and kernel support.
preload roar-tracer-preload + libroar_tracer_preload macOS, Linux Uses DYLD_INSERT_LIBRARIES (macOS) or LD_PRELOAD (Linux). Not compatible with processes that ignore preload env vars (e.g., SIP/hardened runtime on macOS), or fully-static binaries (common with Go).
ptrace roar-tracer Linux Slowest, broadest compatibility on Linux.

Building

cd rust

# eBPF (Linux)
cargo build --release -p roar-tracer-ebpf

# preload (macOS & Linux)
cargo build --release -p roar-tracer-preload

# ptrace (Linux)
cargo build --release -p roar-tracer

Selecting A Backend

By default, roar uses auto mode: prefer eBPF, then preload, then ptrace.

# Show what roar can currently find and whether it looks usable
roar tracer status

# Set a default backend (auto|ebpf|preload|ptrace)
roar tracer set-default preload

macOS Tracing Limitations

On macOS, roar uses the preload backend (DYLD_INSERT_LIBRARIES). macOS System Integrity Protection (SIP) silently blocks library injection for Apple-signed platform binaries — anything under /usr/bin/, /bin/, /sbin/, or /System/. When this happens, roar run will complete successfully but capture no file I/O events.

Affected: /usr/bin/python3, /bin/sh, /usr/bin/ruby, and all other Apple-shipped binaries.

Workaround: Use non-Apple builds of your tools:

# Homebrew
brew install python3
roar run python3 train.py          # Uses /opt/homebrew/bin/python3 — works

# conda / pyenv / nix also work
roar run ~/.pyenv/shims/python train.py

# This will NOT capture file events (SIP blocks it):
roar run /usr/bin/python3 train.py

roar prints a warning when it detects no events were captured from a SIP-protected binary.

Commands

roar init

Initialize roar in the current directory. Creates a .roar/ directory to store the local database and a config.toml with default settings.

roar init           # Initialize, prompt for gitignore
roar init -y        # Initialize and auto-add to gitignore
roar init -n        # Initialize without modifying gitignore

roar run <command>

Run a command with provenance tracking. Roar captures:

  • Files read and written
  • Git commit and branch
  • Execution time and exit code
  • Command arguments
roar run python train.py --epochs 10 --lr 0.001
roar run ./scripts/preprocess.sh
roar run torchrun --nproc_per_node=4 train.py

# Re-run a previous DAG step
roar run @2                    # Re-run DAG node 2
roar run @2 --epochs=10        # Re-run with parameter override

roar reproduce <hash>

Reproduce an artifact by tracing its lineage.

# Show the reproduction plan (preview)
roar reproduce abc123de

# Run full reproduction
roar reproduce abc123de --run

# Run without prompts
roar reproduce abc123de --run -y

# Include system packages during setup
roar reproduce abc123de --run --package-sync

# Show all required packages (no truncation)
roar reproduce abc123de --list-requirements

Full reproduction clones the git repository, creates a virtual environment, installs recorded packages, and runs the pipeline steps.

roar build <command>

Run a build step with provenance tracking. Build steps run before pipeline steps during reproduction.

# Compile native extensions
roar build maturin develop --release
roar build make -j4

# Install local packages
roar build pip install -e .

Use for setup that should run before the main pipeline (compiling, installing).

roar auth

Manage GLaaS authentication.

roar auth register    # Show SSH public key for registration
roar auth test        # Test connection to GLaaS server
roar auth status      # Show current auth status

To register with GLaaS:

  1. Run roar auth register to display your public key
  2. Sign up at https://glaas.ai where you can paste your public key
  3. Run roar auth test to verify

roar config

View or set configuration options.

roar config list
roar config get <key>
roar config set <key> <value>

Run roar config list to see all available options with descriptions. Common options:

Key Default Description
output.track_repo_files false Include repo files in provenance
output.quiet false Suppress written files report
filters.ignore_system_reads true Ignore /sys, /etc, /sbin reads
filters.ignore_package_reads true Ignore installed package reads
filters.ignore_torch_cache true Ignore torch/triton cache
filters.ignore_tmp_files true Ignore /tmp files
glaas.url https://api.glaas.ai GLaaS server URL
glaas.web_url https://glaas.ai GLaaS web UI URL
registration.omit.enabled true Enable secret filtering
hash.primary blake3 Primary hash algorithm
logging.level warning Log level (debug, info, warning, error)

roar dag

Display the pipeline DAG for the current session.

roar dag                  # Compact view with colors
roar dag --expanded       # Show all executions including reruns
roar dag --json           # Machine-readable JSON output
roar dag --show-artifacts # Show intermediate artifacts

roar env

Manage persistent environment variables injected into roar run and roar build.

roar env set FOO bar      # Set FOO=bar
roar env get FOO          # Print value of FOO
roar env list             # List all env vars
roar env unset FOO        # Remove FOO

roar log

Display recent job execution history.

roar log                  # Show recent job history

roar register

Register artifact lineage with GLaaS.

roar register model.pt              # Register model lineage
roar register --dry-run model.pt    # Preview without registering
roar register -y model.pt           # Skip confirmation prompt

roar put

Upload artifacts to cloud storage and register lineage with GLaaS.

roar put model.pt s3://bucket/models/ -m "Final model"
roar put ./checkpoints/ gs://bucket/run-42/ -m "All checkpoints"
roar put @2 s3://bucket/outputs/ -m "Step 2 outputs"

Options:

  • -m, --message — Description of the upload (required)
  • --dry-run — Preview without uploading
  • --no-tag — Skip git tagging

Source formats:

  • File path: model.pt, ./data/output.csv
  • Directory: ./checkpoints/ (uploads all files recursively)
  • Job reference: @2 (uploads outputs from step 2)
  • No source: uploads all outputs from the current session

roar get

Download artifacts from cloud storage.

roar get s3://bucket/models/model.pt ./local/
roar get gs://bucket/data/train.csv
roar get https://example.com/weights.pt --hash abc123...
roar get s3://bucket/checkpoints/ ./local/ # Download all files under prefix

Options:

  • -m, --message — Annotation for this download
  • --hash — Expected BLAKE3 hash (for verification)
  • --tag — Create a git tag for this download
  • --force — Overwrite existing files
  • --dry-run — Preview without downloading

Downloads are registered locally as source nodes in the DAG (outputs only, no inputs). They appear in GLaaS when downstream jobs are registered via roar put or roar register.

roar reset

Start a fresh session. Previous session data is preserved in the database.

roar reset                # Reset with confirmation prompt
roar reset -y             # Reset without confirmation

roar show

Show session, job, or artifact details.

roar show                          # Show active session overview
roar show @1                       # Show details for step 1
roar show @B1                      # Show details for build step 1
roar show a1b2c3d4                 # Show job by UID
roar show ./output/model.pkl       # Show artifact by path

roar status

Show a summary of the active session.

roar status

roar pop

Remove the most recent job from the active session. Useful for undoing a mistaken roar run or correcting the pipeline before registration.

roar pop              # Pop with confirmation prompt
roar pop -y           # Pop without confirmation (skip prompt)

What it does:

  • Removes the last job from the session history
  • Deletes output artifacts created by that job (unless they're packages/system files)
  • Does not affect the original input files

Concepts

Artifacts

Data files tracked by their content hash (BLAKE3). The same file content always has the same hash, regardless of filename or location.

Jobs

Recorded executions that consume input artifacts and produce output artifacts. Each roar run creates a job record.

Collections

Named groups of artifacts, used for downloaded datasets or upload bundles.

Workflow Example

# Record your pipeline
roar run python preprocess.py
roar run python train.py --epochs 10
roar run python evaluate.py

# Later, reproduce an artifact
roar reproduce <model-hash> --run

Git Integration

Roar automatically captures git metadata:

  • Current commit hash
  • Branch name
  • Repository path

Data Storage

All data is stored locally in .roar/roar.db (SQLite). The database includes:

  • Artifact hashes and metadata
  • Job records with inputs/outputs
  • Hash cache for performance

Add .roar/ to your .gitignore (roar offers to do this during roar init).

GLaaS Server

Roar can register artifacts and jobs with a GLaaS (Global Lineage-as-a-Service) server using the roar register command.

Server Setup

# Install with server dependencies
uv pip install -e ".[server]"
# or without uv
pip install -e ".[server]"

# Run the server
glaas-server

# Or with custom host/port
GLAAS_HOST=0.0.0.0 GLAAS_PORT=8080 glaas-server

The server provides:

  • REST API for artifact and job registration
  • Web UI at / with artifact and job browsers
  • Search and filtering by command, GPU, file type, etc.

Client Configuration

# Set the GLaaS server URL
roar config set glaas.url http://localhost:8000

# Show your SSH key (copy to GLaaS web UI)
roar auth register

# Test authentication
roar auth test

Development

Prerequisites

Setup

# Install dev dependencies
uv pip install -e ".[dev]"

Running Quality Checks

# Linting
ruff check .

# Format check
ruff format --check

# Type checking
mypy roar

# Run all checks at once
ruff check . && ruff format --check && mypy roar

Running Tests

# Run all tests (excluding those requiring a live GLaaS server)
pytest tests/ -v -m "not glaas and not live_glaas"

# Run with coverage
pytest tests/ -v --cov=roar --cov-report=term-missing -m "not glaas and not live_glaas"

# Run tests in parallel
pytest tests/ -v -n auto -m "not glaas and not live_glaas"

# Run only unit tests (fast)
pytest tests/ -v -m "not integration and not e2e and not glaas and not live_glaas"

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

roar_cli-0.2.6-cp313-cp313-manylinux_2_34_x86_64.whl (10.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

roar_cli-0.2.6-cp313-cp313-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

roar_cli-0.2.6-cp313-cp313-macosx_10_12_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

roar_cli-0.2.6-cp312-cp312-manylinux_2_34_x86_64.whl (10.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

roar_cli-0.2.6-cp312-cp312-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

roar_cli-0.2.6-cp312-cp312-macosx_10_12_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

roar_cli-0.2.6-cp311-cp311-manylinux_2_34_x86_64.whl (10.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

roar_cli-0.2.6-cp311-cp311-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

roar_cli-0.2.6-cp311-cp311-macosx_10_12_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

roar_cli-0.2.6-cp310-cp310-manylinux_2_34_x86_64.whl (10.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

roar_cli-0.2.6-cp310-cp310-macosx_11_0_arm64.whl (7.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

roar_cli-0.2.6-cp310-cp310-macosx_10_12_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file roar_cli-0.2.6-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 730bbb255f2d6ce76d7543f72fa40bf69c6dda578e1e798dc6f6a28bb317f6df
MD5 a62b4119e0c1915a067ac2808cf1ae84
BLAKE2b-256 567be6554cad899759dc3f5d5cc405ae00e0b47b39cc2793243aabf844afde69

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dae8023aca536b6efed74b9ac48b8062682e00c7883e1a28da819e165b74a38e
MD5 fc33b1bf99e2edc22061938d49a064ed
BLAKE2b-256 94ac900f547e27c15a33a6cedbc98cba1c345a625ec2b24d36dcf1d20b82308f

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 89827a8eec2171cb8be124b09e88eba421b579c01ae4d2085e19caa43a80da46
MD5 772dc4782e1146d85bdc254a12ee20ae
BLAKE2b-256 996a3a8ed9da8a8ed577592eeabaea5f903ca53b8965e34056929d44de4bcb9f

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7174c69b14b9650785e36e406fabd2210470d39163d43610f859cc720fd8ad0d
MD5 9312ec57b87eff6532b6bebcefa06512
BLAKE2b-256 2a12de5c39c0d34231ef50072eedb1d747f33498b32f7b7be616577f8f2c8e73

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9045884f855808ac7a17a3ac3a89fb78f47b5dacb84ce221d0c551fbdaa0a289
MD5 899086f49aed0f46fcbd10333901ed84
BLAKE2b-256 c43dfa221a8eb5781a7705e19a8b8bd70e3150f7e6c963ab45b99280ba6a1800

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 528b45fa59d38f2d69e0fd197d2ac507d754507cf859685b63aa638f213c77a9
MD5 1bea44901ba0f2590d0a0837c229e54d
BLAKE2b-256 4981020a8f3d27939d10e727907df56dca63f7ba55f539c1809b05a3f4e383f0

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 9fa791a77f395182cd97f5eb71e73e844789e5db826a25122247b2ca31893d7c
MD5 aac626a128546caf92f01294858bdb6f
BLAKE2b-256 578c8c9b8755adbffdbcc778277f057776f8e7e5f320c166a27feb6c63ca42d7

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 26f077dd9af9842ac859fe4e4dc6eeaa88019af6903f6920653372e3736e23b6
MD5 bded0f5be6872eaed922ca44e3c4364c
BLAKE2b-256 a06b7da9c7be587919090f6c17ce99246449c1f5c384ef5533a20cec88b210a3

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 648e3e09bf684db75980825585149fde7ce8bc00e3d43e4bee182eb2b6ce823c
MD5 b184834544c14008a879db38d12275f6
BLAKE2b-256 c86cc4b85e1637bab20a8aaef5e83febd5552ae4c4f69e4a0d14d999c8bd3a7c

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a5c1fcbc6c546b48eb2f1d4bd05277cd5a7917837073ce2a21a7de789ece5b4a
MD5 6e48e0136a60f5430eb1e4624f4f6d81
BLAKE2b-256 62d0b60fc329bc8ec9aac827a2631a5b4f469416c441f100aabf4f5aed5d9ba3

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1ff87e0f93b2ff3aead0e4ad2a52e0e515681028769deb75ca1b98f6bee48745
MD5 13fe7a8c80d0a4ad763f18771de1f2be
BLAKE2b-256 495f26042be6f0693a5523a7a8622f698177794c55a3e30aa85b3adec28fa896

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.6-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.6-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d6b69f5165e3a736bbfd32e1d0270cc8064b6fc6f6697e8ee6eba7b385695e65
MD5 26783e6058b24293570886776abd7c12
BLAKE2b-256 549b0fbf2c3799073b37aea694c2c2f4001d9a8e1ac9c9f34fb020d85e9c77e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page