Skip to main content

Zakuro - Distributed computing made simple

Project description

zakuro Logo


License Python

Quick StartInstallationConceptsAdaptive ComputeNotebooksBenchmarksDocs

Zakuro is a context-aware distributed-ML runtime. You decorate a Python function, declare a pool of workers, and the framework routes each call to the worker with the lowest expected time-to-serve — learning from every dispatch, reacting to node failures and performance drift, and never making training wait on things that can be decoupled.

See PRD.md for the product vision and PLAN.md for the measured engineering progress (every number observed, nothing simulated).

Quick start

Run a function on a remote worker — three lines

import zakuro as zk

@zk.fn
def square(x: int) -> int:
    return x * x

# Run on a remote worker (QUIC transport, fast).
result = square.to(zk.Compute(uri="quic://worker:4433"))(7)  # → 49

No worker? Zakuro falls back to running the function in-process so your code keeps working:

# No `uri`, no `host` → standalone fallback; the call runs locally.
result = square.to(zk.Compute())(7)  # → 49

Adapt across a pool — Adam-style allocator

import zakuro as zk

workers = [zk.Worker.spawn(name=f"w{i}") for i in range(3)]
adaptive = zk.AdaptiveCompute(
    workers=[w.compute(verify=False) for w in workers],
    beta1=0.9,
    softmax_temperature=0.02,  # exploration; set 0 for greedy argmin
)
adaptive.warmup(rounds=3)                # auto-calibrate per-worker priors
adaptive.start_health_probes(interval=0.5, max_strikes=2)  # detect drop-outs

@zk.fn
def expensive(x): ...

# The allocator picks the worker with the lowest expected time-to-serve,
# tracks per-worker latency EMA + variance, soft-demotes drifted workers,
# suspends failed ones.
result = expensive.to(adaptive)(42)

Installation

From source (recommended while the 0.3 series is pre-release)

git clone https://github.com/zakuro-ai/zakuro
cd zakuro
uv sync --extra worker         # pulls FastAPI + uvicorn + aioquic for the worker CLI

From PyPI

# Core only (enough to be a client):
pip install zakuro-ai

# Core + worker (needed to run `zakuro-worker`):
pip install 'zakuro-ai[worker]'

Optional extras

extra adds
[worker] FastAPI, uvicorn, aioquic, psutil — needed to run a worker
[ray] Ray processor
[dask] Dask processor
[spark] PySpark processor
[dev] pytest, ruff, mypy

You do not need [worker] just to call into a running worker — import zakuro stays lean on purpose.

zc CLI (optional)

zc is the Rust broker + CLI at zakuro-ai/zc. It's useful for multi-node meshes but not required for a single-process or small-cluster setup.

# macOS / Linux one-liner — installs to /usr/local/bin
curl -sSL https://raw.githubusercontent.com/zakuro-ai/zc/master/scripts/install.sh | bash

Concepts

Compute

A target for remote execution.

How it's constructed Where the call runs
zk.Compute() Standalone — in-process fallback. No network, no workers required.
zk.Compute(uri="quic://host:4433") QUIC worker. Probed at construction; raises ConnectionError if unreachable.
zk.Compute(uri="zakuro://host:3960") HTTP worker. Same probe behaviour.
zk.Compute(uri="ray://head:10001") Ray cluster (requires [ray]).

Resource hints: cpus=, gpus=, memory=. All advisory on standalone; an explicit memory= refuses to run in standalone because we can't enforce it.

@zk.fn / @zk.cls

Decorators that turn a function or class into something remotely callable. Under the hood the callable is cloudpickled and shipped to the worker on each call; zk.cls keeps the instance alive on a specific worker for subsequent method calls.

zk.Worker.spawn()

Convenience wrapper around the zakuro-worker CLI. Takes name=, transport="http"|"quic", port= (ephemeral if omitted). Polls /health until ready, registers an atexit hook so stray subprocesses die with the calling Python.

Transports

scheme default port transport when to use
zakuro:// 3960 HTTP (FastAPI + httpx) interop with any load-balancer / reverse proxy
quic:// 4433 QUIC (aioquic) fastest; connection-level resilience built in
ray:// 10001 Ray existing Ray cluster
dask:// / tcp:// 8786 Dask existing Dask scheduler
spark:// 7077 Spark existing Spark master

QUIC is the default for new work. It handles worker bounces natively (retry + 5 s idle timeout instead of aioquic's 30–60 s default — measured 12× faster detection).

Adaptive compute

zk.AdaptiveCompute is the core differentiator. It tracks per-worker latency with Adam-style EMAs (fast + slow baseline), variance, queue depth, health-probe outcomes, and failure counts — and picks the worker with the lowest expected time-to-serve for every dispatch.

What it does, observed

All numbers come from real subprocess workers running on the Mac (see scripts/bench_*.py):

Feature Measured behaviour
Warmup Auto-derives backpressure_threshold = 1.5 × max(worker p95)29 ms on the 3-worker Mac mesh. No manual tuning.
Greedy vs softmax routing Greedy commits to the 3-ms-faster worker (100 %/0 %). Softmax τ=0.02 keeps all three workers utilised (172/152/176).
Add / remove workers at runtime Removed worker drops to 0 picks within the batch; readmitted worker starts at mesh-median prior and earns traffic immediately.
Health probes Background thread detects SIGKILL in 18 ms (tight config) / 743 ms (loose). Worker suspended; traffic reroutes.
Drift detection Injected 250 ms/call slowdown ⇒ 95 % traffic diversion at t + 0.48 s. Recovery via softmax + health-probe latencies.
QUIC retry In-flight request during worker SIGKILL surfaces ConnectionError in 5 s (vs aioquic's 30–60 s default).

Full numbers in PLAN.md.

Notebooks

Each one runs end-to-end on a laptop and prints observed numbers — no faked data, no simulated failures. Verified by jupyter nbconvert --execute.

notebook path what it shows
Standalone notebooks/standalone_mode.ipynb Compute() without a URI → in-process fallback; advisory resource hints; memory-enforcement refusal
Two workers notebooks/two_worker_demo.ipynb Spawn two workers via zk.Worker.spawn(), chain calls between them
Mesh adaptation tour notebooks/mesh_adaptation_tour.ipynb Every AdaptiveCompute knob — warmup, soft/greedy, add/remove, health, drift
QUIC resilience notebooks/quic_resilience.ipynb Baseline → SIGKILL → respawn; 5 s detection; post-respawn recovery

The sakura repo adds bert_demo/hf_async_features.ipynb — every SakuraHFCallback knob on a real BERT fine-tune.

Benchmarks

All under scripts/ and take a handful of seconds to a minute each. Output is JSON-dumpable via --log <path> where supported.

script scenario headline
bench_mesh_adaptation.py 2-worker warmup → dispatch → remove → readmit 29 ms auto-bp, rebalance within one batch
bench_health_detection.py SIGKILL worker mid-run 18–743 ms detection depending on probe cadence
bench_drift_detection.py Induce sleep(0.25) on one worker drift detected t + 0.48 s, 95 % traffic diverted
bench_quic_retry.py SIGKILL + respawn on same QUIC port 60 s → 5 s dead-connection detection
bench_all.py Runs the four above in one go consolidated summary table
uv run python scripts/bench_all.py --log /tmp/zakuro-bench.json

Docs

Related projects

  • sakura — ML framework integrations (PyTorch Lightning, HuggingFace Trainer, TensorFlow/Keras) that use Zakuro under the hood to hide eval latency behind training
  • zc — Rust broker with QUIC transport, credit-based billing, P2P mesh

Development

# Tests
uv run pytest tests/

# Specific benchmark
uv run python scripts/bench_mesh_adaptation.py --n-workers 3

# Build a wheel
task build:wheel

# Full CI set
task ci:all

Python 3.10+ is required. uv is the recommended package manager (pip install uv).

License

BSD-3-Clause. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zakuro_ai-0.2.4.tar.gz (273.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zakuro_ai-0.2.4-py3-none-any.whl (67.7 kB view details)

Uploaded Python 3

File details

Details for the file zakuro_ai-0.2.4.tar.gz.

File metadata

  • Download URL: zakuro_ai-0.2.4.tar.gz
  • Upload date:
  • Size: 273.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for zakuro_ai-0.2.4.tar.gz
Algorithm Hash digest
SHA256 98547a538c65377e840ab64d40788029a5d618fb3fe23c0a45f0fb5be521f2c8
MD5 18db90fe121581404935b83e0bfb5ace
BLAKE2b-256 332c508b867dbac176435ff76a2e8b07ff5509e5ebef418a95e96734d82cfada

See more details on using hashes here.

File details

Details for the file zakuro_ai-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: zakuro_ai-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 67.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for zakuro_ai-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 165613e7b2f494b58cd1fdeabec4a8fbe269ceb960a83a3f414a38c87d3bb2ce
MD5 9eeed9464684400dd35684fe7a7f6965
BLAKE2b-256 56632617942e5c8653b4043eb3bed635638a60daa49a3d21155b625e0250ea72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page