Liquid cluster kit to launch and manage long-lived Slurm services jobs.

These details have been verified by PyPI

Project links

Owner

Liquid AI

GitHub Statistics

These details have not been verified by PyPI

Project description

cluster-kit

Cluster toolkit for Liquid AI. First package: cluster_kit.slurm_services — a lightweight SDK to launch and manage long-lived Slurm service jobs (e.g. an inference server used as an LLM judge) whose lifecycle is bound to one or more consumer jobs.

No database, no central worker: the shared filesystem is the registry and squeue is the source of truth for liveness. Multiple consumers can share one warm service and it self-reaps once nobody is using it.

The SDK is cluster-agnostic. The bundled examples target a GPU Slurm cluster that schedules by QoS (--gpus-per-node / --cpus-per-gpu), runs ROCm, has uv preinstalled, and exposes a shared $HOME — adjust the resource flags and server entrypoint for your own cluster.

Install

uv add lqck        # or: pip install lqck

The distribution is lqck; the import package is cluster_kit (pip install lqck, then import cluster_kit / python -m cluster_kit.slurm_services). Runtime is dependency-free (stdlib + the Slurm CLI).

Local dev: uv sync --extra dev && uv run pytest (no cluster needed).

Run it on the cluster

Everything goes through Slurm. The consumer/driver is itself a job, and the SDK does a nested sbatch from inside it to launch the GPU-backed service job; the two find each other via the registry and talk over HTTP.

cd cluster-kit && mkdir -p logs

# Smallest end-to-end: bring up LiquidAI/LFM2.5-350M and send one "Hello".
sbatch examples/hello_lfm.slurm
tail -f logs/lfm-hello_*.log        # submit -> RUNNING -> healthy -> response

# Real pattern: a training job that shares a warm judge with other runs.
sbatch examples/train_with_judge.slurm

Service stdout (sglang startup) lands in ~/.slurm_services/<name>/service-<jobid>.out.

Preview the exact sbatch without submitting anything:

uv run python -m cluster_kit.slurm_services ensure \
    --name lfm-hello --entrypoint "$PWD/examples/start_model_server.sh" \
    --gpus-per-node 1 --cpus-per-gpu 16 \
    --port 8000 --env MODEL=LiquidAI/LFM2.5-350M --env PORT=8000 --dry-run

Python API

from cluster_kit.slurm_services import HealthCheck, Resources, ServiceSpec, slurm_service

spec = ServiceSpec(
    name="lfm-hello",
    entrypoint="examples/start_model_server.sh",   # reused as-is; serves an OpenAI API
    resources=Resources(gpus_per_node=1, cpus_per_gpu=16, time_limit="00:30:00"),
    env={"MODEL": "LiquidAI/LFM2.5-350M", "PORT": "8000"},
    port=8000,
    health_check=HealthCheck(path="/health", timeout_s=600),
    idle_timeout_s=600,            # keep warm for the next run to reuse
    fingerprint_keys=["MODEL"],    # only reuse a service running this model
)

with slurm_service(spec) as svc:   # blocks until /health passes
    reply = say_hello(svc.url)     # POST {svc.url}/v1/chat/completions
# released on exit; reaped even on SIGKILL / node loss

The CLI (ensure / heartbeat / release) is the same thing for shell-driven *.slurm scripts — see examples/train_with_judge.slurm.

Logging

Verbose by default; set SLURM_SERVICES_LOG to DEBUG / INFO / WARNING. At INFO you always get the exact, copy-pasteable submission and every state change:

INFO [slurm_service] submitting: sbatch --parsable --job-name=llm-judge ... wrapper.sh
INFO [slurm_service] submitted job 90210
INFO [slurm_service] job 90210 state: PENDING -> RUNNING
INFO [slurm_service] service 'llm-judge' healthy at node01:8000 (job 90210)

How it works

ensure_service (and the slurm_service context manager over it) runs inside the consumer job and: checks the registry for a healthy same-fingerprint service to reuse; otherwise takes an atomic lock, renders a wrapper around your entrypoint, and sbatches it; polls squeue to RUNNING then /health to 200; registers a lease and returns a Handle. On exit it drops the lease.

Reaping is belt-and-suspenders (since atexit doesn't fire on SIGKILL/node loss):

Consumer side — a background thread renews this consumer's lease file.
Service side — a watcher inside the service job scancels itself once no lease is live (fresh heartbeat, or the lease's parent job still in squeue), after an idle_timeout_s grace window.

Key design choices:

Lease set, not a single parent — the service stays up while ≥1 consumer holds a live lease, so concurrent runs share it; one consumer is just N=1.
Fingerprint-gated reuse — a same-name service running a different model raises FingerprintMismatch rather than handing back the wrong endpoint.
--export omitted by default so the service inherits the consumer job's modules + venv (set export_env="NONE" for a clean env).
Registry location — per-user ~/.slurm_services by default ($HOME is shared on the cluster, so it's reachable from every node); set $SLURM_SERVICES_ROOT to a shared path for team-wide sharing.

Layout

src/cluster_kit/slurm_services/
  __init__.py   slurm_service(), ensure_service(), release_service(), Handle, exceptions
  __main__.py   CLI: ensure / heartbeat / release (+ --dry-run)
  config.py     Resources, HealthCheck, ServiceSpec (+ fingerprint)
  slurm.py      sbatch/squeue/scancel/sacct wrappers
  registry.py   shared-dir lookup-or-create lock + lease set
  health.py     HTTP /health polling gate
  heartbeat.py  lease renewer (consumer) + self-suicide watcher (service)
  wrapper.py    generated batch wrapper
  logutil.py    logging
examples/
  hello_lfm.slurm        `sbatch` this for the smallest end-to-end run
  hello_lfm.py           the Python driver it runs
  train_with_judge.slurm shell-driven consumer: shared judge + training
  start_model_server.sh  server entrypoint: sglang-ROCm container (OpenAI API + /health)

Status & roadmap

The SDK is implemented and unit-tested. Remaining work and possible follow-ups:

To do

Tag the first release (v0.1.0) so the git-install pin in Install resolves.
Validate one real end-to-end run on the AMD cluster — confirm the sglang-ROCm image serves the chosen LFM2 model and that a CPU-only consumer job schedules — then retire the old 2-node judge setup.

Possible future improvements

Hetjob co-scheduling if/when the cluster supports --hetjob (today: dependency
- client-side health gate, which is authoritative regardless).
Service restart / endpoint hot-swap (today: dependent consumers fail fast).
A private package index, only if git-install friction shows up.
Optional fire-and-forget status POST for dashboard visibility (never a dependency).

Project details

These details have been verified by PyPI

Project links

Owner

Liquid AI

GitHub Statistics

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jun 3, 2026

This version

0.1.0a2 pre-release

Jun 3, 2026

0.1.0a1 pre-release

Jun 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lqck-0.1.0a2.tar.gz (27.3 kB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lqck-0.1.0a2-py3-none-any.whl (31.5 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file lqck-0.1.0a2.tar.gz.

File metadata

Download URL: lqck-0.1.0a2.tar.gz
Upload date: Jun 3, 2026
Size: 27.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a2.tar.gz
Algorithm	Hash digest
SHA256	`5e3ebedffd2d56908b71cd92d61358913120778059c584f969b0b9771d9a1c98`
MD5	`5c1c58fabfc89454484674a590e4ba22`
BLAKE2b-256	`3c7dd1d004d6e205cebeee58009cf769d465ae8b3cdf3d32db22027a23b02820`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a2.tar.gz:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lqck-0.1.0a2.tar.gz
- Subject digest: 5e3ebedffd2d56908b71cd92d61358913120778059c584f969b0b9771d9a1c98
- Sigstore transparency entry: 1706834514
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: Liquid4All/cluster-kit@08345e09ce2c72746c43a29fde92388d4eff4750
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Liquid4All
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@08345e09ce2c72746c43a29fde92388d4eff4750
- Trigger Event: workflow_dispatch

File details

Details for the file lqck-0.1.0a2-py3-none-any.whl.

File metadata

Download URL: lqck-0.1.0a2-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 31.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lqck-0.1.0a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bca009af01e3b0f6b2057e7dae0ad0f8987a5d2ae48f2262213da87e7758120`
MD5	`e95299bb755218a10e51ac7e401dd975`
BLAKE2b-256	`37eb54f5b3029a3122bae3fa61393891e58ccbd881ff6ac4f50a34af01d24c07`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lqck-0.1.0a2-py3-none-any.whl:

Publisher: publish.yaml on Liquid4All/cluster-kit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lqck-0.1.0a2-py3-none-any.whl
- Subject digest: 9bca009af01e3b0f6b2057e7dae0ad0f8987a5d2ae48f2262213da87e7758120
- Sigstore transparency entry: 1706834548
- Sigstore integration time: Jun 3, 2026
Source repository:
- Permalink: Liquid4All/cluster-kit@08345e09ce2c72746c43a29fde92388d4eff4750
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Liquid4All
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@08345e09ce2c72746c43a29fde92388d4eff4750
- Trigger Event: workflow_dispatch

lqck 0.1.0a2

Navigation

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Project description

cluster-kit

Install

Run it on the cluster

Python API

Logging

How it works

Layout

Status & roadmap

Project details

Verified details

Project links

Owner

GitHub Statistics

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance