Skip to main content

Liquid cluster kit to launch and manage long-lived Slurm services jobs.

Project description

cluster-kit

Cluster toolkit for Liquid AI. First package: cluster_kit.slurm_services — a lightweight SDK to launch and manage long-lived Slurm service jobs (e.g. an inference server used as an LLM judge) whose lifecycle is bound to one or more consumer jobs.

No database, no central worker: the shared filesystem is the registry and squeue is the source of truth for liveness. Multiple consumers can share one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster that schedules by QoS (--gpus-per-node / --cpus-per-gpu), runs ROCm, has uv preinstalled, and exposes a shared $HOME — adjust the resource flags and server entrypoint for your own cluster.

Install

uv add lqck        # or: pip install lqck

The distribution is lqck; the import package is cluster_kit (pip install lqck, then import cluster_kit / python -m cluster_kit.slurm_services). Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: uv sync --extra dev && uv run pytest (no cluster needed).

Run it on the cluster

Everything goes through Slurm. The consumer/driver is itself a job, and the SDK does a nested sbatch from inside it to launch the GPU-backed service job; the two find each other via the registry and talk over HTTP.

cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm

Service stdout (sglang startup) lands in ~/.slurm_services/<name>/service-<jobid>.out.

Preview the exact sbatch without submitting anything:

uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint "$PWD/examples/start_model_server.sh" \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --port 8000 --env MODEL=LiquidAI/LFM2.5-350M --env PORT=8000 --dry-run

Python API

from cluster_kit.slurm_services import HealthCheck, Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    entrypoint="examples/start_model_server.sh",   # reused as-is; serves an OpenAI API
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    env={"MODEL": "LiquidAI/LFM2.5-350M", "PORT": "8000"},
    port=8000,
    health_check=HealthCheck(path="/health", timeout_s=600),
    idle_timeout_s=600,            # keep warm for the next run to reuse
    fingerprint_keys=["MODEL"],    # only reuse a service running this model
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss

The CLI (ensure / heartbeat / release) is the same thing for shell-driven *.slurm scripts — see examples/train_with_judge.slurm.

Logging

Verbose by default; set SLURM_SERVICES_LOG to DEBUG / INFO / WARNING. At INFO you always get the exact, copy-pasteable submission and every state change:

INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)

How it works

ensure_service (and the slurm_service context manager over it) runs inside the consumer job and: checks the registry for a healthy same-fingerprint service to reuse; otherwise takes an atomic lock, renders a wrapper around your entrypoint, and sbatches it; polls squeue to RUNNING then /health to 200; registers a lease and returns a Handle. On exit it drops the lease.

Reaping is belt-and-suspenders (since atexit doesn't fire on SIGKILL/node loss):

  • Consumer side — a background thread renews this consumer's lease file.
  • Service side — a watcher inside the service job scancels itself once no lease is live (fresh heartbeat, or the lease's parent job still in squeue), after an idle_timeout_s grace window.

Key design choices:

  • Lease set, not a single parent — the service stays up while ≥1 consumer holds a live lease, so concurrent runs share it; one consumer is just N=1.
  • Fingerprint-gated reuse — a same-name service running a different model raises FingerprintMismatch rather than handing back the wrong endpoint.
  • --export omitted by default so the service inherits the consumer job's modules + venv (set export_env="NONE" for a clean env).
  • Registry location — per-user ~/.slurm_services by default ($HOME is shared on the cluster, so it's reachable from every node); set $SLURM_SERVICES_ROOT to a shared path for team-wide sharing.

Layout

src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs
  train_with_judge.slurm shell-driven consumer: shared judge + training
  start_model_server.sh  server entrypoint: sglang-ROCm container (OpenAI API + /health)

Status & roadmap

The SDK is implemented and unit-tested. Remaining work and possible follow-ups:

To do

  • Tag the first release (v0.1.0) so the git-install pin in Install resolves.
  • Validate one real end-to-end run on the AMD cluster — confirm the sglang-ROCm image serves the chosen LFM2 model and that a CPU-only consumer job schedules — then retire the old 2-node judge setup.

Possible future improvements

  • Hetjob co-scheduling if/when the cluster supports --hetjob (today: dependency
    • client-side health gate, which is authoritative regardless).
  • Service restart / endpoint hot-swap (today: dependent consumers fail fast).
  • A private package index, only if git-install friction shows up.
  • Optional fire-and-forget status POST for dashboard visibility (never a dependency).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lqck-0.1.0a2.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lqck-0.1.0a2-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file lqck-0.1.0a2.tar.gz.

File metadata

  • Download URL: lqck-0.1.0a2.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 5e3ebedffd2d56908b71cd92d61358913120778059c584f969b0b9771d9a1c98
MD5 5c1c58fabfc89454484674a590e4ba22
BLAKE2b-256 3c7dd1d004d6e205cebeee58009cf769d465ae8b3cdf3d32db22027a23b02820

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a2.tar.gz:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lqck-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: lqck-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 9bca009af01e3b0f6b2057e7dae0ad0f8987a5d2ae48f2262213da87e7758120
MD5 e95299bb755218a10e51ac7e401dd975
BLAKE2b-256 37eb54f5b3029a3122bae3fa61393891e58ccbd881ff6ac4f50a34af01d24c07

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a2-py3-none-any.whl:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page