Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

๐Ÿ™ OctoRun

Distributed Parallel Execution Made Simple

Run Python scripts across multiple GPUs with intelligent chunk management, failure recovery, and live job monitoring

PyPI version Python CUDA License


Overview

OctoRun dispatches one Python worker process per GPU, divides your dataset into chunks, and coordinates work across multiple machines through shared lock files. Each worker independently claims a chunk, processes it, and writes its output atomically โ€” making it safe to run OctoRun on many nodes simultaneously without any central coordinator.

Key Features

  • Automatic GPU Detection โ€” detects and allocates all available GPUs
  • Chunk-Based Work Distribution โ€” divides keys by stride, not by file, so any total_chunks value works regardless of input sharding
  • Failure Recovery โ€” stale locks (heartbeat timeout 5 min) are automatically reclaimed; configurable retries
  • Multi-Machine Safe โ€” multiple OctoRun instances on different nodes share the same lock directory on a shared filesystem
  • Live Job Monitoring โ€” octorun status shows active sessions, completed chunks, and stale locks at a glance

Installation

# Via uv (recommended)
uv tool install octorun

# In a project venv
uv add octorun

# Via pip
pip install octorun

Optional extras:

# GPU benchmark tooling (requires PyTorch)
pip install "octorun[benchmark]"

Quick Start

# 1. Generate a default config
octorun save_config --script ./your_script.py

# 2. Run across all available GPUs
octorun run --config config.json

# 3. Monitor progress from any machine
octorun status ./logs

Commands

run (r)

Launch workers across all available GPUs.

octorun run --config config.json
octorun run --config config.json --kwargs '{"batch_size": 64}'

CLI --kwargs override any kwargs values from the config file.


status (st)

Show a live summary of a running or completed job.

octorun status <log_dir>
octorun status <log_dir> --alive-threshold 120

log_dir is the same directory you set as log_dir in your config. It contains the *_session_*.log files and the locks/ subdirectory.

Example output:

OctoRun Job Status โ€” ./logs/stage3
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Locks total  : 164
  Completed    : 29
  Active       : 48  (6 sessions)
  Stale locks  : 87  (dead workers, will be auto-reclaimed)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Active sessions (6):
  node1               8 chunks  [25, 26, 28, 32, ...]  (heartbeat 43s ago)
  node2               8 chunks  [16, 18, 20, 21, ...]  (heartbeat 47s ago)
  ...

Dead sessions โ€” stale locks (4 node(s)):
  node3               8 chunks  [161, 162, ...]  (last seen 1h31m ago)
  ...
Field Meaning
Locks total Lock files currently on disk (active + stale)
Completed Chunks that finished successfully (persistent .completed markers)
Active Chunks held by workers with a live heartbeat
Stale locks Locks from crashed/killed workers โ€” reclaimed automatically when the next chunk finishes on any active node

--alive-threshold (default 300 s) sets how long a session can be silent before it is considered dead.


save_config (s)

Write a default config.json to the current directory.

octorun save_config --script ./your_script.py

list_gpus (l)

List available GPUs.

octorun list_gpus
octorun list_gpus --detailed

benchmark (b)

Continuously measure GPU TFLOPs and communication bandwidth.

octorun benchmark
octorun benchmark --gpus 0,1,2 --test-duration 10 --interval 30

Requires octorun[benchmark].

Configuration

Generate a starter config with octorun save_config, then edit as needed:

{
    "script_path": "your_script.py",
    "gpus": "auto",
    "total_chunks": 1000,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": true,
    "max_retries": 2,
    "kwargs": {
        "input_dir": "/data/input",
        "output_dir": "/data/output",
        "batch_size": 32
    }
}
Option Description Default
script_path Path to your Python worker script โ€”
gpus "auto" or a list of GPU IDs "auto"
total_chunks Total number of chunks to divide work into 128
log_dir Directory for session and chunk logs "./logs"
chunk_lock_dir Directory for lock files "./logs/locks"
monitor_interval Seconds between process-status checks 60
restart_failed Retry failed chunks false
max_retries Maximum retries per chunk 3
success_codes Worker exit codes treated as success [0]
kwargs Extra arguments forwarded to your script {}

success_codes

By default a chunk is considered complete only when the worker exits with code 0. Add additional codes here when your worker has a known-benign non-zero exit. For example, if the worker calls sys.exit(0) but the CPython interpreter shutdown phase fails (often 120 due to atexit / stdout-flush errors on slow networked filesystems), the chunk's data is already written and you can safely treat 120 as success:

"success_codes": [0, 120]

Writing a Worker Script

--gpu_id is a local rank, not a device index

OctoRun does not set CUDA_VISIBLE_DEVICES or any other GPU environment variable. --gpu_id is simply the index of this worker among the workers launched on the current machine โ€” i.e. a local_rank. If you launch with gpus: [0, 1, 2, 3], four workers start with --gpu_id 0, 1, 2, 3 respectively, but OctoRun leaves device selection entirely to your script.

Common patterns:

# Pattern A โ€” direct device index (works when no other processes use the GPUs)
torch.cuda.set_device(args.gpu_id)

# Pattern B โ€” set CUDA_VISIBLE_DEVICES before any CUDA init so the process
# only sees one device and always addresses it as cuda:0.
# Do this at the very top of the script, before importing torch/transformers.
import os, argparse
_p = argparse.ArgumentParser(add_help=False)
_p.add_argument("--gpu_id", type=int, default=0)
_early, _ = _p.parse_known_args()
os.environ["CUDA_VISIBLE_DEVICES"] = str(_early.gpu_id)
# Now import torch โ€” it will only see one GPU
import torch
model = model.to("cuda:0")

Pattern B is strongly recommended when loading large models (transformers, etc.) because many libraries initialize CUDA at import time and will claim the wrong device if CUDA_VISIBLE_DEVICES is not set beforehand.

Minimal script template

Your script must accept three OctoRun arguments plus any custom ones:

import argparse

def parse_args():
    p = argparse.ArgumentParser()
    # Required by OctoRun โ€” gpu_id is a local_rank, not a device index
    p.add_argument("--gpu_id",       type=int, required=True)
    p.add_argument("--chunk_id",     type=int, required=True)
    p.add_argument("--total_chunks", type=int, required=True)
    # Your own arguments (forwarded via kwargs)
    p.add_argument("--input_dir",    type=str, required=True)
    p.add_argument("--output_dir",   type=str, required=True)
    p.add_argument("--batch_size",   type=int, default=32)
    return p.parse_args()

def main():
    args = parse_args()

    # Shard your dataset by stride
    all_keys = sorted(load_all_keys(args.input_dir))
    my_keys  = all_keys[args.chunk_id::args.total_chunks]

    # Skip if already done (idempotent)
    output_path = get_output_path(args.output_dir, args.chunk_id, args.total_chunks)
    if output_path.exists():
        return

    # Process and write atomically
    results = process(my_keys, gpu=args.gpu_id, batch_size=args.batch_size)
    write_atomic(results, output_path)

if __name__ == "__main__":
    main()

Multi-Machine Usage

Run OctoRun on each machine, pointing at the same shared log_dir and chunk_lock_dir:

# machine A
octorun run --config config.json   # claims chunks 0, 1, 2, ...

# machine B (simultaneously)
octorun run --config config.json   # claims the next available chunks

Lock files on the shared filesystem prevent any chunk from being processed twice. Monitor all machines from anywhere:

octorun status ./logs

Shared filesystem requirements

chunk_lock_dir relies on O_CREAT | O_EXCL having atomic, cross-client semantics. Use a POSIX-compliant filesystem (local disk, NFS, CephFS, etc.).

Do not place chunk_lock_dir on object-storage-backed filesystems such as HDFS-fuse or s3fs โ€” they typically fall back to non-atomic "stat-then-create" sequences and have client-side metadata caching, both of which let two workers acquire the same lock. Output data can still live on HDFS / S3, but locking requires a real filesystem. If you must run on such storage, ensure your worker writes outputs idempotently (tmp + rename) and treat the lock as best-effort rather than a correctness guarantee.

Contributing

Fork the repository, create a feature branch, and open a pull request.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-1.0.2.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-1.0.2-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file octorun-1.0.2.tar.gz.

File metadata

  • Download URL: octorun-1.0.2.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.2.tar.gz
Algorithm Hash digest
SHA256 86ca4813260b30033948ed08adc849be58f174680034aa828121d3bdfa70f8bd
MD5 44f18174479518081cdcb6f653235fa7
BLAKE2b-256 5d69a18f6722cbea3c6cf92c303e4f974cd7175a11d9c48f395db267fef99b0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.2.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: octorun-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for octorun-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 665455c43668e9a5e83abec3e5381509788a5c54919e55960d9ade8c6f4284d4
MD5 0e60ca4c492c7bdbd0d782e5732ff00d
BLAKE2b-256 91788ae45b1124406e221f7588b974ce0824fc7075a69fddf1441120d24baf43

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-1.0.2-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page