Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

๐Ÿ™ OctoRun

Distributed Parallel Execution Made Simple

Run Python scripts across multiple GPUs with intelligent chunk management, failure recovery, and live job monitoring

PyPI version Python CUDA License


Overview

OctoRun dispatches one Python worker process per GPU, divides your dataset into chunks, and coordinates work across multiple machines through shared lock files. Each worker independently claims a chunk, processes it, and writes its output atomically โ€” making it safe to run OctoRun on many nodes simultaneously without any central coordinator.

Key Features

  • Automatic GPU Detection โ€” detects and allocates all available GPUs
  • Chunk-Based Work Distribution โ€” divides keys by stride, not by file, so any total_chunks value works regardless of input sharding
  • Failure Recovery โ€” stale locks (heartbeat timeout 5 min) are automatically reclaimed; configurable retries
  • Multi-Machine Safe โ€” multiple OctoRun instances on different nodes share the same lock directory on a shared filesystem
  • Live Job Monitoring โ€” octorun status shows active sessions, completed chunks, and stale locks at a glance

Installation

# Via uv (recommended)
uv tool install octorun

# In a project venv
uv add octorun

# Via pip
pip install octorun

Optional extras:

# GPU benchmark tooling (requires PyTorch)
pip install "octorun[benchmark]"

Quick Start

# 1. Generate a default config
octorun save_config --script ./your_script.py

# 2. Run across all available GPUs
octorun run --config config.json

# 3. Monitor progress from any machine
octorun status ./logs

Commands

run (r)

Launch workers across all available GPUs.

octorun run --config config.json
octorun run --config config.json --kwargs '{"batch_size": 64}'

CLI --kwargs override any kwargs values from the config file.


status (st)

Show a live summary of a running or completed job.

octorun status <log_dir>
octorun status <log_dir> --alive-threshold 120

log_dir is the same directory you set as log_dir in your config. It contains the *_session_*.log files and the locks/ subdirectory.

Example output:

OctoRun Job Status โ€” ./logs/stage3
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Locks total  : 164
  Completed    : 29
  Active       : 48  (6 sessions)
  Stale locks  : 87  (dead workers, will be auto-reclaimed)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Active sessions (6):
  node1               8 chunks  [25, 26, 28, 32, ...]  (heartbeat 43s ago)
  node2               8 chunks  [16, 18, 20, 21, ...]  (heartbeat 47s ago)
  ...

Dead sessions โ€” stale locks (4 node(s)):
  node3               8 chunks  [161, 162, ...]  (last seen 1h31m ago)
  ...
Field Meaning
Locks total Lock files currently on disk (active + stale)
Completed Chunks that finished successfully (persistent .completed markers)
Active Chunks held by workers with a live heartbeat
Stale locks Locks from crashed/killed workers โ€” reclaimed automatically when the next chunk finishes on any active node

--alive-threshold (default 300 s) sets how long a session can be silent before it is considered dead.


save_config (s)

Write a default config.json to the current directory.

octorun save_config --script ./your_script.py

list_gpus (l)

List available GPUs.

octorun list_gpus
octorun list_gpus --detailed

benchmark (b)

Continuously measure GPU TFLOPs and communication bandwidth.

octorun benchmark
octorun benchmark --gpus 0,1,2 --test-duration 10 --interval 30

Requires octorun[benchmark].

Configuration

Generate a starter config with octorun save_config, then edit as needed:

{
    "script_path": "your_script.py",
    "gpus": "auto",
    "total_chunks": 1000,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": true,
    "max_retries": 2,
    "kwargs": {
        "input_dir": "/data/input",
        "output_dir": "/data/output",
        "batch_size": 32
    }
}
Option Description Default
script_path Path to your Python worker script โ€”
gpus "auto" or a list of GPU IDs "auto"
total_chunks Total number of chunks to divide work into 128
log_dir Directory for session and chunk logs "./logs"
chunk_lock_dir Directory for lock files "./logs/locks"
monitor_interval Seconds between process-status checks 60
restart_failed Retry failed chunks false
max_retries Maximum retries per chunk 3
success_codes Worker exit codes treated as success [0]
kwargs Extra arguments forwarded to your script {}

success_codes

By default a chunk is considered complete only when the worker exits with code 0. Add additional codes here when your worker has a known-benign non-zero exit. For example, if the worker calls sys.exit(0) but the CPython interpreter shutdown phase fails (often 120 due to atexit / stdout-flush errors on slow networked filesystems), the chunk's data is already written and you can safely treat 120 as success:

"success_codes": [0, 120]

Writing a Worker Script

--gpu_id is a local rank, not a device index

OctoRun does not set CUDA_VISIBLE_DEVICES or any other GPU environment variable. --gpu_id is simply the index of this worker among the workers launched on the current machine โ€” i.e. a local_rank. If you launch with gpus: [0, 1, 2, 3], four workers start with --gpu_id 0, 1, 2, 3 respectively, but OctoRun leaves device selection entirely to your script.

Common patterns:

# Pattern A โ€” direct device index (works when no other processes use the GPUs)
torch.cuda.set_device(args.gpu_id)

# Pattern B โ€” set CUDA_VISIBLE_DEVICES before any CUDA init so the process
# only sees one device and always addresses it as cuda:0.
# Do this at the very top of the script, before importing torch/transformers.
import os, argparse
_p = argparse.ArgumentParser(add_help=False)
_p.add_argument("--gpu_id", type=int, default=0)
_early, _ = _p.parse_known_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(_early.gpu_id)
# Now import torch โ€” it will only see one GPU
import torch
model = model.to("cuda:0")

Pattern B is strongly recommended when loading large models (transformers, etc.) because many libraries initialize CUDA at import time and will claim the wrong device if CUDA_VISIBLE_DEVICES is not set beforehand.

Minimal script template

Your script must accept three OctoRun arguments plus any custom ones:

import argparse

def parse_args():
    p = argparse.ArgumentParser()
    # Required by OctoRun โ€” gpu_id is a local_rank, not a device index
    p.add_argument("--gpu_id",       type=int, required=True)
    p.add_argument("--chunk_id",     type=int, required=True)
    p.add_argument("--total_chunks", type=int, required=True)
    # Your own arguments (forwarded via kwargs)
    p.add_argument("--input_dir",    type=str, required=True)
    p.add_argument("--output_dir",   type=str, required=True)
    p.add_argument("--batch_size",   type=int, default=32)
    return p.parse_args()

def main():
    args = parse_args()

    # Shard your dataset by stride
    all_keys = sorted(load_all_keys(args.input_dir))
    my_keys  = all_keys[args.chunk_id::args.total_chunks]

    # Skip if already done (idempotent)
    output_path = get_output_path(args.output_dir, args.chunk_id, args.total_chunks)
    if output_path.exists():
        return

    # Process and write atomically
    results = process(my_keys, gpu=args.gpu_id, batch_size=args.batch_size)
    write_atomic(results, output_path)

if __name__ == "__main__":
    main()

Multi-Machine Usage

Run OctoRun on each machine, pointing at the same shared log_dir and chunk_lock_dir:

# machine A
octorun run --config config.json   # claims chunks 0, 1, 2, ...

# machine B (simultaneously)
octorun run --config config.json   # claims the next available chunks

Lock files on the shared filesystem prevent any chunk from being processed twice. Monitor all machines from anywhere:

octorun status ./logs

Shared filesystem requirements

chunk_lock_dir relies on O_CREAT | O_EXCL having atomic, cross-client semantics. Use a POSIX-compliant filesystem (local disk, NFS, CephFS, etc.).

Do not place chunk_lock_dir on object-storage-backed filesystems such as HDFS-fuse or s3fs โ€” they typically fall back to non-atomic "stat-then-create" sequences and have client-side metadata caching, both of which let two workers acquire the same lock. Output data can still live on HDFS / S3, but locking requires a real filesystem. If you must run on such storage, ensure your worker writes outputs idempotently (tmp + rename) and treat the lock as best-effort rather than a correctness guarantee.

Contributing

Fork the repository, create a feature branch, and open a pull request.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-1.0.1.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-1.0.1-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file octorun-1.0.1.tar.gz.

File metadata

  • Download URL: octorun-1.0.1.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a4fc0d6ff2947ab976b552a485d276ccf6ed09ce330aab3ea76f64b99984879a
MD5 26c2d9966e0d32bd5597200ca0dafbdb
BLAKE2b-256 b78fcb725f3b63d61a223a1cd86f69286a875794b7c93450eed8d5aa5141923f

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.1.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: octorun-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc1546a28c172e3eeb90f5da48370bbf6072d94d0d13b6643fc88482eb909267
MD5 d9d67dab2160ce12670f237a1581b868
BLAKE2b-256 a19a5af262ed440a8bbac014a32fd3e53fdfbee0ccc7f39c137ee51052336194

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.1-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page