Skip to main content

Liquid cluster kit to launch and manage long-lived Slurm services jobs.

Project description

cluster-kit

Cluster toolkit for Liquid AI. First package: cluster_kit.slurm_services — a lightweight SDK to launch and manage long-lived Slurm service jobs (e.g. an inference server used as an LLM judge) whose lifecycle is bound to one or more consumer jobs.

No database, no central worker: the shared filesystem is the registry and squeue is the source of truth for liveness. Multiple consumers can share one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster that schedules by QoS (--gpus-per-node / --cpus-per-gpu), runs ROCm, has uv preinstalled, and exposes a shared $HOME — adjust the resource flags and server entrypoint for your own cluster.

Install

uv add lqck        # or: pip install lqck

The distribution is lqck; the import package is cluster_kit (pip install lqck, then import cluster_kit / python -m cluster_kit.slurm_services). Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: uv sync --extra dev && uv run pytest (no cluster needed).

Run it on the cluster

Everything goes through Slurm. The consumer/driver is itself a job, and the SDK does a nested sbatch from inside it to launch the GPU-backed service job; the two find each other via the registry and talk over HTTP.

cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm

Service stdout (sglang startup) lands in ~/.slurm_services/<name>/service-<jobid>.out.

Preview the exact sbatch without submitting anything:

uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint "$PWD/examples/start_model_server.sh" \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --port 8000 --env MODEL=LiquidAI/LFM2.5-350M --env PORT=8000 --dry-run

Python API

from cluster_kit.slurm_services import HealthCheck, Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    entrypoint="examples/start_model_server.sh",   # reused as-is; serves an OpenAI API
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    env={"MODEL": "LiquidAI/LFM2.5-350M", "PORT": "8000"},
    port=8000,
    health_check=HealthCheck(path="/health", timeout_s=600),
    idle_timeout_s=600,            # keep warm for the next run to reuse
    fingerprint_keys=["MODEL"],    # only reuse a service running this model
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss

The CLI (ensure / heartbeat / release) is the same thing for shell-driven *.slurm scripts — see examples/train_with_judge.slurm.

Logging

Verbose by default; set SLURM_SERVICES_LOG to DEBUG / INFO / WARNING. At INFO you always get the exact, copy-pasteable submission and every state change:

INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)

How it works

ensure_service (and the slurm_service context manager over it) runs inside the consumer job and: checks the registry for a healthy same-fingerprint service to reuse; otherwise takes an atomic lock, renders a wrapper around your entrypoint, and sbatches it; polls squeue to RUNNING then /health to 200; registers a lease and returns a Handle. On exit it drops the lease.

Reaping is belt-and-suspenders (since atexit doesn't fire on SIGKILL/node loss):

  • Consumer side — a background thread renews this consumer's lease file.
  • Service side — a watcher inside the service job scancels itself once no lease is live (fresh heartbeat, or the lease's parent job still in squeue), after an idle_timeout_s grace window.

Key design choices:

  • Lease set, not a single parent — the service stays up while ≥1 consumer holds a live lease, so concurrent runs share it; one consumer is just N=1.
  • Fingerprint-gated reuse — a same-name service running a different model raises FingerprintMismatch rather than handing back the wrong endpoint.
  • --export omitted by default so the service inherits the consumer job's modules + venv (set export_env="NONE" for a clean env).
  • Registry location — per-user ~/.slurm_services by default ($HOME is shared on the cluster, so it's reachable from every node); set $SLURM_SERVICES_ROOT to a shared path for team-wide sharing.

Layout

src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs
  train_with_judge.slurm shell-driven consumer: shared judge + training
  start_model_server.sh  server entrypoint: sglang-ROCm container (OpenAI API + /health)

Status & roadmap

The SDK is implemented and unit-tested. Remaining work and possible follow-ups:

To do

  • Tag the first release (v0.1.0) so the git-install pin in Install resolves.
  • Validate one real end-to-end run on the AMD cluster — confirm the sglang-ROCm image serves the chosen LFM2 model and that a CPU-only consumer job schedules — then retire the old 2-node judge setup.

Possible future improvements

  • Hetjob co-scheduling if/when the cluster supports --hetjob (today: dependency
    • client-side health gate, which is authoritative regardless).
  • Service restart / endpoint hot-swap (today: dependent consumers fail fast).
  • A private package index, only if git-install friction shows up.
  • Optional fire-and-forget status POST for dashboard visibility (never a dependency).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lqck-0.1.0a1.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lqck-0.1.0a1-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file lqck-0.1.0a1.tar.gz.

File metadata

  • Download URL: lqck-0.1.0a1.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 dbca85d06e00ff52c3c454dbf14280ee38be4290b50289f14091197cfb8d5a29
MD5 b51dc4a5bd231a453df2be2cd3bc1434
BLAKE2b-256 c8afd52667179fd2af073ce24adc67a12fc44549370dad199d14227c1285f0d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a1.tar.gz:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lqck-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: lqck-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 e5817e8a4d8c93415cc661c0d1f7f03d346d73fc43e5329189ed15f2c2af1e8b
MD5 0878589a66a4625eaab10d95b5a77f87
BLAKE2b-256 e82db9ab3f6873108784e4f35337fa4a033ea44e3939a302ec21f18dd2cb52ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a1-py3-none-any.whl:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page