Skip to main content

Liquid cluster kit to launch and manage long-lived Slurm services jobs.

Project description

cluster-kit

Cluster toolkit for Liquid AI. First package: cluster_kit.slurm_services — a lightweight SDK to launch and manage long-lived Slurm service jobs (e.g. an inference server used as an LLM judge) whose lifecycle is bound to one or more consumer jobs.

No database, no central worker: the shared filesystem is the registry and squeue is the source of truth for liveness. Multiple consumers can share one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster that schedules by QoS (--gpus-per-node / --cpus-per-gpu), runs ROCm, has uv preinstalled, and exposes a shared $HOME — adjust the resource flags and server entrypoint for your own cluster.

Install

uv add lqck        # or: pip install lqck

The distribution is lqck; the import package is cluster_kit (pip install lqck, then import cluster_kit / python -m cluster_kit.slurm_services). Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: uv sync --extra dev && uv run pytest (no cluster needed).

Run it on the cluster

Everything goes through Slurm. The consumer/driver is itself a job, and the SDK does a nested sbatch from inside it to launch the GPU-backed service job; the two find each other via the registry and talk over HTTP.

cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm

Service stdout (sglang startup) lands in ~/.slurm_services/<name>/service-<jobid>.out.

Preview the exact sbatch without submitting anything:

uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint python \
    --arg=-m --arg=sglang.launch_server \
    --arg=--model-path --arg=LiquidAI/LFM2.5-350M --arg=--host --arg=0.0.0.0 --arg=--port --arg=8000 \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --container-image lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi30x-20260428 \
    --dry-run
# --port (8000), the cluster's --container-mount defaults, --health-path, and the
# warm-reuse window all have defaults; pass them only to override.

Python API

from cluster_kit.slurm_services import Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    # No server script: the SDK runs this command inside the container, publishes
    # host:port, and health-gates it. --host 0.0.0.0 so consumers reach it across nodes.
    entrypoint="python",
    args=["-m", "sglang.launch_server", "--model-path", "LiquidAI/LFM2.5-350M",
          "--host", "0.0.0.0", "--port", "8000"],
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    container_image="lmsysorg/sglang-rocm:v0.5.10.post1-rocm700-mi30x-20260428",
    # port (8000), the cluster's container_mounts, the /health check, and a warm-
    # reuse window default in — override (e.g. port=9000, container_mounts=[...],
    # idle_timeout_s=600) only when the service needs something different.
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss

The CLI (ensure / heartbeat / release) is the same thing for shell-driven *.slurm scripts — see examples/train_with_judge.slurm.

Logging

Verbose by default; set SLURM_SERVICES_LOG to DEBUG / INFO / WARNING. At INFO you always get the exact, copy-pasteable submission and every state change:

INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)

How it works

ensure_service (and the slurm_service context manager over it) runs inside the consumer job and: checks the registry for a healthy same-fingerprint service to reuse; otherwise takes an atomic lock, renders a wrapper around your entrypoint, and sbatches it; polls squeue to RUNNING then /health to 200; registers a lease and returns a Handle. On exit it drops the lease.

Reaping is belt-and-suspenders (since atexit doesn't fire on SIGKILL/node loss):

  • Consumer side — a background thread renews this consumer's lease file.
  • Service side — a watcher inside the service job scancels itself once no lease is live (fresh heartbeat, or the lease's parent job still in squeue), after an idle_timeout_s grace window.

Key design choices:

  • Lease set, not a single parent — the service stays up while ≥1 consumer holds a live lease, so concurrent runs share it; one consumer is just N=1.
  • Fingerprint-gated reuse — a same-name service running a different model raises FingerprintMismatch rather than handing back the wrong endpoint.
  • --export omitted by default so the service inherits the consumer job's modules + venv (set export_env="NONE" for a clean env).
  • Registry location — per-user ~/.slurm_services by default ($HOME is shared on the cluster, so it's reachable from every node); set $SLURM_SERVICES_ROOT to a shared path for team-wide sharing.

Layout

src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs (inline entrypoint + container_image)
  train_with_judge.slurm shell-driven consumer: shared judge + training

Roadmap

Possible future improvements

  • Hetjob co-scheduling if/when the cluster supports --hetjob (today: dependency
    • client-side health gate, which is authoritative regardless).
  • Service restart / endpoint hot-swap (today: dependent consumers fail fast).
  • Optional fire-and-forget status POST for dashboard visibility (never a dependency).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lqck-0.1.1.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lqck-0.1.1-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file lqck-0.1.1.tar.gz.

File metadata

  • Download URL: lqck-0.1.1.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.1.tar.gz
Algorithm Hash digest
SHA256 da03e32b1d6f2857b93cb13f2f49d2d2e8277d8a10a8687db2c4615e4f306206
MD5 971739a2048d29948b1cd3f523157447
BLAKE2b-256 99328f5a8cab1b32b16b5c6b830e617595dd701eb3f7a6d545b8387269af37d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.1.tar.gz:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lqck-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: lqck-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 33.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c9ddf97159de9e1e158215a77fd1604d088034a0ce39e63476c9a296167bbc67
MD5 0022ab4f791a93a64bb3d59e16115c88
BLAKE2b-256 679c0a48a4433b0943d053457ba860b4fc8d2678e10f1fae12b5591270dbde0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.1-py3-none-any.whl:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page