Skip to main content

Zakuro - Distributed computing made simple

Project description

zakuro Logo


License Python

Quick StartInstallationConceptsAdaptive ComputeNotebooksBenchmarksDocs

Zakuro is a context-aware distributed-ML runtime. You decorate a Python function, declare a pool of workers, and the framework routes each call to the worker with the lowest expected time-to-serve — learning from every dispatch, reacting to node failures and performance drift, and never making training wait on things that can be decoupled.

See PRD.md for the product vision and PLAN.md for the measured engineering progress (every number observed, nothing simulated).

Quick start

Run a function on a remote worker — three lines

import zakuro as zk

@zk.fn
def square(x: int) -> int:
    return x * x

# Run on a remote worker (QUIC transport, fast).
result = square.to(zk.Compute(uri="quic://worker:4433"))(7)  # → 49

No worker? Zakuro falls back to running the function in-process so your code keeps working:

# No `uri`, no `host` → standalone fallback; the call runs locally.
result = square.to(zk.Compute())(7)  # → 49

Adapt across a pool — Adam-style allocator

import zakuro as zk

workers = [zk.Worker.spawn(name=f"w{i}") for i in range(3)]
adaptive = zk.AdaptiveCompute(
    workers=[w.compute(verify=False) for w in workers],
    beta1=0.9,
    softmax_temperature=0.02,  # exploration; set 0 for greedy argmin
)
adaptive.warmup(rounds=3)                # auto-calibrate per-worker priors
adaptive.start_health_probes(interval=0.5, max_strikes=2)  # detect drop-outs

@zk.fn
def expensive(x): ...

# The allocator picks the worker with the lowest expected time-to-serve,
# tracks per-worker latency EMA + variance, soft-demotes drifted workers,
# suspends failed ones.
result = expensive.to(adaptive)(42)

Installation

From source (recommended while the 0.3 series is pre-release)

git clone https://github.com/zakuro-ai/zakuro
cd zakuro
uv sync --extra worker         # pulls FastAPI + uvicorn + aioquic for the worker CLI

From PyPI

# Core only (enough to be a client):
pip install zakuro-ai

# Core + worker (needed to run `zakuro-worker`):
pip install 'zakuro-ai[worker]'

Optional extras

extra adds
[worker] FastAPI, uvicorn, aioquic, psutil — needed to run a worker
[ray] Ray processor
[dask] Dask processor
[spark] PySpark processor
[dev] pytest, ruff, mypy

You do not need [worker] just to call into a running worker — import zakuro stays lean on purpose.

zc CLI (optional)

zc is the Rust broker + CLI at zakuro-ai/zc. It's useful for multi-node meshes but not required for a single-process or small-cluster setup.

# macOS / Linux one-liner — installs to /usr/local/bin
curl -sSL https://raw.githubusercontent.com/zakuro-ai/zc/master/scripts/install.sh | bash

Concepts

Compute

A target for remote execution.

How it's constructed Where the call runs
zk.Compute() Standalone — in-process fallback. No network, no workers required.
zk.Compute(uri="quic://host:4433") QUIC worker. Probed at construction; raises ConnectionError if unreachable.
zk.Compute(uri="zakuro://host:3960") HTTP worker. Same probe behaviour.
zk.Compute(uri="ray://head:10001") Ray cluster (requires [ray]).

Resource hints: cpus=, gpus=, memory=. All advisory on standalone; an explicit memory= refuses to run in standalone because we can't enforce it.

@zk.fn / @zk.cls

Decorators that turn a function or class into something remotely callable. Under the hood the callable is cloudpickled and shipped to the worker on each call; zk.cls keeps the instance alive on a specific worker for subsequent method calls.

zk.Worker.spawn()

Convenience wrapper around the zakuro-worker CLI. Takes name=, transport="http"|"quic", port= (ephemeral if omitted). Polls /health until ready, registers an atexit hook so stray subprocesses die with the calling Python.

Transports

scheme default port transport when to use
zakuro:// 3960 HTTP (FastAPI + httpx) interop with any load-balancer / reverse proxy
quic:// 4433 QUIC (aioquic) fastest; connection-level resilience built in
ray:// 10001 Ray existing Ray cluster
dask:// / tcp:// 8786 Dask existing Dask scheduler
spark:// 7077 Spark existing Spark master

QUIC is the default for new work. It handles worker bounces natively (retry + 5 s idle timeout instead of aioquic's 30–60 s default — measured 12× faster detection).

Adaptive compute

zk.AdaptiveCompute is the core differentiator. It tracks per-worker latency with Adam-style EMAs (fast + slow baseline), variance, queue depth, health-probe outcomes, and failure counts — and picks the worker with the lowest expected time-to-serve for every dispatch.

What it does, observed

All numbers come from real subprocess workers running on the Mac (see scripts/bench_*.py):

Feature Measured behaviour
Warmup Auto-derives backpressure_threshold = 1.5 × max(worker p95)29 ms on the 3-worker Mac mesh. No manual tuning.
Greedy vs softmax routing Greedy commits to the 3-ms-faster worker (100 %/0 %). Softmax τ=0.02 keeps all three workers utilised (172/152/176).
Add / remove workers at runtime Removed worker drops to 0 picks within the batch; readmitted worker starts at mesh-median prior and earns traffic immediately.
Health probes Background thread detects SIGKILL in 18 ms (tight config) / 743 ms (loose). Worker suspended; traffic reroutes.
Drift detection Injected 250 ms/call slowdown ⇒ 95 % traffic diversion at t + 0.48 s. Recovery via softmax + health-probe latencies.
QUIC retry In-flight request during worker SIGKILL surfaces ConnectionError in 5 s (vs aioquic's 30–60 s default).

Full numbers in PLAN.md.

Notebooks

Each one runs end-to-end on a laptop and prints observed numbers — no faked data, no simulated failures. Verified by jupyter nbconvert --execute.

notebook path what it shows
Standalone notebooks/standalone_mode.ipynb Compute() without a URI → in-process fallback; advisory resource hints; memory-enforcement refusal
Two workers notebooks/two_worker_demo.ipynb Spawn two workers via zk.Worker.spawn(), chain calls between them
Mesh adaptation tour notebooks/mesh_adaptation_tour.ipynb Every AdaptiveCompute knob — warmup, soft/greedy, add/remove, health, drift
QUIC resilience notebooks/quic_resilience.ipynb Baseline → SIGKILL → respawn; 5 s detection; post-respawn recovery

The sakura repo adds bert_demo/hf_async_features.ipynb — every SakuraHFCallback knob on a real BERT fine-tune.

Benchmarks

All under scripts/ and take a handful of seconds to a minute each. Output is JSON-dumpable via --log <path> where supported.

script scenario headline
bench_mesh_adaptation.py 2-worker warmup → dispatch → remove → readmit 29 ms auto-bp, rebalance within one batch
bench_health_detection.py SIGKILL worker mid-run 18–743 ms detection depending on probe cadence
bench_drift_detection.py Induce sleep(0.25) on one worker drift detected t + 0.48 s, 95 % traffic diverted
bench_quic_retry.py SIGKILL + respawn on same QUIC port 60 s → 5 s dead-connection detection
bench_all.py Runs the four above in one go consolidated summary table
uv run python scripts/bench_all.py --log /tmp/zakuro-bench.json

Docs

Related projects

  • sakura — ML framework integrations (PyTorch Lightning, HuggingFace Trainer, TensorFlow/Keras) that use Zakuro under the hood to hide eval latency behind training
  • zc — Rust broker with QUIC transport, credit-based billing, P2P mesh

Development

# Tests
uv run pytest tests/

# Specific benchmark
uv run python scripts/bench_mesh_adaptation.py --n-workers 3

# Build a wheel
task build:wheel

# Full CI set
task ci:all

Python 3.10+ is required. uv is the recommended package manager (pip install uv).

Container images

The worker is published in two variants, both built for linux/amd64 and linux/arm64:

Tag suffix Base image Size Use when
(none) — :latest, :cpu python:3.12-slim ~370 MB Default. Has a shell + apt for debugging via docker exec.
-distroless:latest-distroless gcr.io/distroless/python3-debian12:nonroot ~155 MB Production. No shell, no apt, runs as the distroless nonroot user (uid 65532). Minimal CVE surface.

Pull from either Docker Hub or GHCR:

docker pull zakuroai/zakuro-worker:latest          # slim
docker pull zakuroai/zakuro-worker:latest-distroless

docker pull ghcr.io/zakuro-ai/zakuro-worker:latest
docker pull ghcr.io/zakuro-ai/zakuro-worker:latest-distroless

Both variants run the same FastAPI worker on port 3960 with the same in-process /health healthcheck.

Verifying releases

Every published wheel and container image is signed and attested. Detailed flow in docs/security/verifying-releases.md; the one-liners are:

# Wheel — SLSA L3 provenance
slsa-verifier verify-artifact \
    --provenance-path zakuro_ai-X.Y.Z-py3-none-any.whl.intoto.jsonl \
    --source-uri github.com/zakuro-ai/zakuro \
    --source-tag vX.Y.Z \
    zakuro_ai-X.Y.Z-py3-none-any.whl

# Container image — Cosign keyless (signature lives in Rekor)
cosign verify ghcr.io/zakuro-ai/zakuro-worker:X.Y.Z \
    --certificate-identity-regexp '^https://github\.com/zakuro-ai/zakuro/\.github/workflows/publish\.yml@refs/tags/.*$' \
    --certificate-oidc-issuer https://token.actions.githubusercontent.com

If verification fails on what appears to be a real release, do not run the artefact — report to security@zakuro.ai.

License

BSD-3-Clause. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zakuro_ai-0.2.13.tar.gz (329.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zakuro_ai-0.2.13-py3-none-any.whl (73.7 kB view details)

Uploaded Python 3

File details

Details for the file zakuro_ai-0.2.13.tar.gz.

File metadata

  • Download URL: zakuro_ai-0.2.13.tar.gz
  • Upload date:
  • Size: 329.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for zakuro_ai-0.2.13.tar.gz
Algorithm Hash digest
SHA256 84a0d953305ec263441b4bcd6c1469e2b5afdfba75e5c4a1a73781210e2aa781
MD5 111215cf0e768ba458682a68a78ba38e
BLAKE2b-256 381eaa29f450724dd293ba1f73465a115fe8bd5069c8e89674df8b71a4821300

See more details on using hashes here.

File details

Details for the file zakuro_ai-0.2.13-py3-none-any.whl.

File metadata

  • Download URL: zakuro_ai-0.2.13-py3-none-any.whl
  • Upload date:
  • Size: 73.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for zakuro_ai-0.2.13-py3-none-any.whl
Algorithm Hash digest
SHA256 573539a7b515b109930247dfa738b5e504f89369c63c71d5b8c0c65287802c5a
MD5 6680e31569f45381658b505063249db6
BLAKE2b-256 be073541c8dd1bedc16083be6c9a83a4491749a41759a4eecf649448823e3936

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page