Liquid cluster kit to launch and manage long-lived Slurm services jobs.
Project description
cluster-kit
Cluster toolkit for Liquid AI. First package: cluster_kit.slurm_services — a
lightweight SDK to launch and manage long-lived Slurm service jobs (e.g. an
inference server used as an LLM judge) whose lifecycle is bound to one or more
consumer jobs.
No database, no central worker: the shared filesystem is the registry and
squeue is the source of truth for liveness. Multiple consumers can share
one warm service and it self-reaps once nobody is using it.
The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster
that schedules by QoS (--gpus-per-node / --cpus-per-gpu), runs ROCm, has
uv preinstalled, and exposes a shared $HOME — adjust the resource flags and
server entrypoint for your own cluster.
Install
uv add lqck # or: pip install lqck
The distribution is lqck; the import package is cluster_kit
(pip install lqck, then import cluster_kit / python -m cluster_kit.slurm_services).
Runtime is dependency-free (stdlib + the Slurm CLI).
Local dev: uv sync --extra dev && uv run pytest (no cluster needed).
Run it on the cluster
Everything goes through Slurm. The consumer/driver is itself a job, and the
SDK does a nested sbatch from inside it to launch the GPU-backed service
job; the two find each other via the registry and talk over HTTP.
cd cluster-kit && mkdir -p logs
# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log # submit -> RUNNING -> healthy -> response
# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm
Service stdout (sglang startup) lands in ~/.slurm_services/<name>/service-<jobid>.out.
Preview the exact sbatch without submitting anything:
uv run python -m cluster_kit.slurm_services ensure \
--name lfm-hello --entrypoint "$PWD/examples/start_model_server.sh" \
--gpus-per-node 1 --cpus-per-gpu 16 \
--port 8000 --env MODEL=LiquidAI/LFM2.5-350M --env PORT=8000 --dry-run
Python API
from cluster_kit.slurm_services import HealthCheck, Resources, ServiceSpec, slurm_service
spec = ServiceSpec(
name="lfm-hello",
entrypoint="examples/start_model_server.sh", # reused as-is; serves an OpenAI API
resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
env={"MODEL": "LiquidAI/LFM2.5-350M", "PORT": "8000"},
port=8000,
health_check=HealthCheck(path="/health", timeout_s=600),
idle_timeout_s=600, # keep warm for the next run to reuse
fingerprint_keys=["MODEL"], # only reuse a service running this model
)
with slurm_service(spec) as svc: # blocks until /health passes
reply = say_hello(svc.url) # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss
The CLI (ensure / heartbeat / release) is the same thing for shell-driven
*.slurm scripts — see examples/train_with_judge.slurm.
Logging
Verbose by default; set SLURM_SERVICES_LOG to DEBUG / INFO / WARNING. At
INFO you always get the exact, copy-pasteable submission and every state change:
INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)
How it works
ensure_service (and the slurm_service context manager over it) runs inside
the consumer job and: checks the registry for a healthy same-fingerprint service
to reuse; otherwise takes an atomic lock, renders a wrapper around your
entrypoint, and sbatches it; polls squeue to RUNNING then /health to 200;
registers a lease and returns a Handle. On exit it drops the lease.
Reaping is belt-and-suspenders (since atexit doesn't fire on SIGKILL/node loss):
- Consumer side — a background thread renews this consumer's lease file.
- Service side — a watcher inside the service job
scancels itself once no lease is live (fresh heartbeat, or the lease's parent job still insqueue), after anidle_timeout_sgrace window.
Key design choices:
- Lease set, not a single parent — the service stays up while ≥1 consumer holds a live lease, so concurrent runs share it; one consumer is just N=1.
- Fingerprint-gated reuse — a same-name service running a different model
raises
FingerprintMismatchrather than handing back the wrong endpoint. --exportomitted by default so the service inherits the consumer job's modules + venv (setexport_env="NONE"for a clean env).- Registry location — per-user
~/.slurm_servicesby default ($HOMEis shared on the cluster, so it's reachable from every node); set$SLURM_SERVICES_ROOTto a shared path for team-wide sharing.
Layout
src/cluster_kit/slurm_services/
__init__.py slurm_service(), ensure_service(), release_service(), Handle, exceptions
__main__.py CLI: ensure / heartbeat / release (+ --dry-run)
config.py Resources, HealthCheck, ServiceSpec (+ fingerprint)
slurm.py sbatch/squeue/scancel/sacct wrappers
registry.py shared-dir lookup-or-create lock + lease set
health.py HTTP /health polling gate
heartbeat.py lease renewer (consumer) + self-suicide watcher (service)
wrapper.py generated batch wrapper
logutil.py logging
examples/
hello_lfm.slurm `sbatch` this for the smallest end-to-end run
hello_lfm.py the Python driver it runs
train_with_judge.slurm shell-driven consumer: shared judge + training
start_model_server.sh server entrypoint: sglang-ROCm container (OpenAI API + /health)
Status & roadmap
The SDK is implemented and unit-tested. Remaining work and possible follow-ups:
To do
- Tag the first release (
v0.1.0) so the git-install pin in Install resolves. - Validate one real end-to-end run on the AMD cluster — confirm the sglang-ROCm image serves the chosen LFM2 model and that a CPU-only consumer job schedules — then retire the old 2-node judge setup.
Possible future improvements
- Hetjob co-scheduling if/when the cluster supports
--hetjob(today: dependency- client-side health gate, which is authoritative regardless).
- Service restart / endpoint hot-swap (today: dependent consumers fail fast).
- A private package index, only if git-install friction shows up.
- Optional fire-and-forget status POST for dashboard visibility (never a dependency).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lqck-0.1.0a2.tar.gz.
File metadata
- Download URL: lqck-0.1.0a2.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e3ebedffd2d56908b71cd92d61358913120778059c584f969b0b9771d9a1c98
|
|
| MD5 |
5c1c58fabfc89454484674a590e4ba22
|
|
| BLAKE2b-256 |
3c7dd1d004d6e205cebeee58009cf769d465ae8b3cdf3d32db22027a23b02820
|
Provenance
The following attestation bundles were made for lqck-0.1.0a2.tar.gz:
Publisher:
publish.yaml on Liquid4All/cluster-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lqck-0.1.0a2.tar.gz -
Subject digest:
5e3ebedffd2d56908b71cd92d61358913120778059c584f969b0b9771d9a1c98 - Sigstore transparency entry: 1706834514
- Sigstore integration time:
-
Permalink:
Liquid4All/cluster-kit@08345e09ce2c72746c43a29fde92388d4eff4750 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Liquid4All
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@08345e09ce2c72746c43a29fde92388d4eff4750 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file lqck-0.1.0a2-py3-none-any.whl.
File metadata
- Download URL: lqck-0.1.0a2-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bca009af01e3b0f6b2057e7dae0ad0f8987a5d2ae48f2262213da87e7758120
|
|
| MD5 |
e95299bb755218a10e51ac7e401dd975
|
|
| BLAKE2b-256 |
37eb54f5b3029a3122bae3fa61393891e58ccbd881ff6ac4f50a34af01d24c07
|
Provenance
The following attestation bundles were made for lqck-0.1.0a2-py3-none-any.whl:
Publisher:
publish.yaml on Liquid4All/cluster-kit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
lqck-0.1.0a2-py3-none-any.whl -
Subject digest:
9bca009af01e3b0f6b2057e7dae0ad0f8987a5d2ae48f2262213da87e7758120 - Sigstore transparency entry: 1706834548
- Sigstore integration time:
-
Permalink:
Liquid4All/cluster-kit@08345e09ce2c72746c43a29fde92388d4eff4750 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Liquid4All
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@08345e09ce2c72746c43a29fde92388d4eff4750 -
Trigger Event:
workflow_dispatch
-
Statement type: