Skip to main content

Reproducibility and provenance tracker for ML training pipelines

Project description

roar

Run Observation & Artifact Registration

A local front-end to TReqs' Graph Lineage-as-a-Service (GLaaS). Roar tracks data artifacts and execution steps in ML pipelines, enabling reproducibility and lineage queries.

Installation

pip install roar-cli
# or with uv
uv pip install roar-cli

Requires Python 3.10+ and Linux (x86_64 or aarch64) for full functionality.

Platform Support

Platform roar run Other commands
Linux x86_64 Full support Full support
Linux aarch64 Full support Full support
macOS Not supported Full support
Windows Not supported Full support

The roar run command uses a native tracer binary that requires Linux. Other commands work on all platforms.

Development Installation

# Clone the repository
git clone https://github.com/treqs/roar.git
cd roar

# Install in development mode (automatically builds tracer if Rust is installed)
uv pip install -e ".[dev]"
# or without uv
pip install -e ".[dev]"

Quick Start

# Initialize roar in your project
cd my-ml-project
roar init

# Run commands with provenance tracking
roar run python preprocess.py --input data.csv --output features.parquet
roar run python train.py --data features.parquet --output model.pt
roar run python evaluate.py --model model.pt --output metrics.json

Commands

roar init

Initialize roar in the current directory. Creates a .roar/ directory to store the local database.

roar init

roar run <command>

Run a command with provenance tracking. Roar captures:

  • Files read and written
  • Git commit and branch
  • Execution time and exit code
  • Command arguments
roar run python train.py --epochs 10 --lr 0.001
roar run ./scripts/preprocess.sh
roar run torchrun --nproc_per_node=4 train.py

roar reproduce <hash>

Reproduce an artifact by tracing its lineage.

# Show the reproduction plan
roar reproduce abc123de

# Run reproduction immediately
roar reproduce abc123de --run

# Run without prompts
roar reproduce abc123de --run -y

roar build <command>

Run a build step with provenance tracking. Build steps run before pipeline steps during reproduction.

# Compile native extensions
roar build maturin develop --release
roar build make -j4

# Install local packages
roar build pip install -e .

Use for setup that should run before the main pipeline (compiling, installing).

roar auth

Manage GLaaS (Graph Lineage-as-a-Service) authentication.

roar auth register    # Register SSH key with GLaaS server
roar auth test        # Test authentication
roar auth status      # Show authentication status

roar config

View or set configuration options.

roar config list
roar config get <key>
roar config set <key> <value>

Available configuration options:

Key Default Description
output.track_repo_files false Include repo files in provenance
output.quiet false Suppress written files report
filters.ignore_system_reads true Ignore /sys, /etc reads
filters.ignore_package_reads true Ignore installed package reads
filters.ignore_torch_cache true Ignore torch/triton cache
filters.ignore_tmp_files true Ignore /tmp files
glaas.url (none) GLaaS server URL

Concepts

Artifacts

Data files tracked by their content hash (BLAKE3). The same file content always has the same hash, regardless of filename or location.

Jobs

Recorded executions that consume input artifacts and produce output artifacts. Each roar run creates a job record.

Collections

Named groups of artifacts, used for downloaded datasets or upload bundles.

Workflow Example

# Record your pipeline
roar run python preprocess.py
roar run python train.py --epochs 10
roar run python evaluate.py

# Later, reproduce an artifact
roar reproduce <model-hash> --run

Git Integration

Roar automatically captures git metadata:

  • Current commit hash
  • Branch name
  • Repository path

Data Storage

All data is stored locally in .roar/roar.db (SQLite). The database includes:

  • Artifact hashes and metadata
  • Job records with inputs/outputs
  • Hash cache for performance

Add .roar/ to your .gitignore (roar offers to do this during roar init).

GLaaS Server

Roar can register artifacts and jobs with a GLaaS (Graph Lineage-as-a-Service) server using the roar register command.

Server Setup

# Install with server dependencies
uv pip install -e ".[server]"
# or without uv
pip install -e ".[server]"

# Run the server
glaas-server

# Or with custom host/port
GLAAS_HOST=0.0.0.0 GLAAS_PORT=8080 glaas-server

The server provides:

  • REST API for artifact and job registration
  • Web UI at / with artifact and job browsers
  • Search and filtering by command, GPU, file type, etc.

Client Configuration

# Set the GLaaS server URL
roar config set glaas.url http://localhost:8000

# Register your SSH key
roar auth register

# Test authentication
roar auth test

Development

Prerequisites

Setup

# Install dev dependencies (automatically builds tracer if Rust is installed)
uv pip install -e ".[dev]"

Running Quality Checks

# Linting
ruff check .

# Format check
ruff format --check

# Type checking
mypy roar

# Run all checks at once
ruff check . && ruff format --check && mypy roar

Running Tests

# Run all tests (excluding those requiring a live GLaaS server)
pytest tests/ -v -m "not glaas and not live_glaas"

# Run with coverage
pytest tests/ -v --cov=roar --cov-report=term-missing -m "not glaas and not live_glaas"

# Run tests in parallel
pytest tests/ -v -n auto -m "not glaas and not live_glaas"

# Run only unit tests (fast)
pytest tests/ -v -m "not integration and not e2e and not glaas and not live_glaas"

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roar_cli-0.2.0.tar.gz (474.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

roar_cli-0.2.0-py3-none-manylinux_2_17_x86_64.whl (539.2 kB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

File details

Details for the file roar_cli-0.2.0.tar.gz.

File metadata

  • Download URL: roar_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 474.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for roar_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 13f26978867f12a135fc169fc053dcbd414dd9af428b788b41760950f2bf2e51
MD5 c5c3126cdc665beb84f60b788ad15f7e
BLAKE2b-256 6d943398cb8f16ffe687882ea95fd27da1578a769ba310ea731b4fc80a15db36

See more details on using hashes here.

File details

Details for the file roar_cli-0.2.0-py3-none-manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for roar_cli-0.2.0-py3-none-manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 991a618e824978cb2ebd1cccc5c285ef13664de2598a743c1ec13f3d93e244d3
MD5 7a762a84117e5cb0dd537e8f4cc3d006
BLAKE2b-256 bf4be35a85e70b6842d26fd35b7d40778ab900582db63b24e1a6381764838cc3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page