Shared Ollama gateway over Slurm — wake a GPU node on demand

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

retostamm

These details have not been verified by PyPI

Project description

slullama

Shared Ollama gateway over Slurm — wake a GPU node on demand.

A slumbering llama that wakes on demand. One package, two roles: a server daemon on the Slurm head node and a client library on team members' laptops. The first API request submits a Slurm job, opens an SSH tunnel to the compute node, and proxies standard Ollama API traffic. When nobody has made a request for a configurable idle period the job is torn down automatically. The next request wakes it all back up.

Multiuser by design — the whole team shares one GPU node and one Ollama process. Authentication is via a shared bearer token.

Architecture

laptop (any team member)       head node                  compute node (GPU)
  │                              │                              │
  │──SSH tunnel (port 11434)───▶│                              │
  │                              │──SSH tunnel (port 19434)───▶│
  │                              │   aiohttp reverse proxy     │   ollama serve
  │                              │   (listens 0.0.0.0:11435)   │   (0.0.0.0:11434)
  │                              │   + SlurmManager             │
  │                              │   + idle watchdog            │
  │                              │                              │

Request flow

Client code (litellm, SlulamaClient, or raw curl) hits localhost:11434 on the laptop.
The SSH tunnel forwards the request to the head node proxy on port 11435.
The proxy checks the bearer token (Authorization: Bearer <token>).
If no Slurm job is running, the proxy: a. Renders the sbatch template and submits it (sbatch). b. Polls scontrol show job until the job reaches RUNNING. c. Opens an SSH tunnel from the head node to the compute node's Ollama port. d. Polls Ollama's /api/tags health endpoint through the tunnel until healthy.
The proxy forwards the request to Ollama (streaming supported).
On every proxied request the Slurm job timeout is extended (or the last-activity timestamp is recorded, depending on keep-alive strategy).
A background watchdog checks idle time every 30 s. If idle ≥ idle_timeout minutes it cancels the job and closes tunnels.
The next request goes back to step 4.

Two SSH hops

Both hops are managed automatically:

Laptop → head node: managed by ClientTunnel (client side). Sync or async. Opened lazily on first request, closed on process exit / context manager exit.
Head node → compute node: managed by SshTunnel (server side). Opened after the Slurm job reaches RUNNING, closed on idle teardown.

Compute nodes are typically not reachable from outside the cluster, which is why both hops are necessary.

Quick start

1. Head node (server)

pip install slullama          # or: uv pip install slullama

mkdir -p ~/.config/slullama

Create ~/.config/slullama/config.toml:

[server]
port = 11435                  # proxy listens here
token = "your-shared-secret"  # bearer token for auth
log_dir = "/tmp/slullama"     # sbatch scripts + job logs

[slurm]
partition = "gpu"
gres = "gpu:1"
mem = ""                      # optional, e.g. "64G"
time = "4:00:00"              # initial job time limit
idle_timeout = 30             # minutes of inactivity before teardown
keep_alive = "extend"         # "extend" or "cancel" (see below)
extra_args = []               # extra #SBATCH lines, e.g. ["--exclusive"]

[ollama]
port = 11434                  # port ollama listens on inside the compute node
binary = "ollama"             # path to ollama binary (must be on compute node)
models_dir = ""               # OLLAMA_MODELS dir; empty = ollama default
pre_pull = []                 # models to pull on cold start, e.g. ["qwen3.5:9b"]

# If the binary is NOT on a shared filesystem, copy it per job:
# copy_binary = true
# copy_source = "/shared/bin/ollama"
# cleanup_binary = true       # rm the copy when the job ends

Start the daemon (foreground; use tmux/screen/systemd for persistence):

slullama serve                       # uses ~/.config/slullama/config.toml
slullama serve --config /etc/slullama.toml   # explicit path
slullama serve --port 9999 --partition gpu-large  # CLI overrides
slullama serve -v                    # debug logging

2. Laptop (client)

pip install slullama                 # base client
pip install "slullama[litellm]"      # with litellm integration

Create ~/.config/slullama/config.toml (client section only is fine):

[client]
host = "youruser@headnode"    # SSH destination
server_port = 11435           # must match server's port
token = "your-shared-secret"  # must match server's token
local_port = 11434            # local Ollama-compatible endpoint

Option A — litellm (recommended)

import slullama   # auto-registers the "slullama/" provider on import
import litellm

# Synchronous
resp = litellm.completion(
    model="slullama/qwen3.5:9b",
    messages=[{"role": "user", "content": "hello"}],
)
print(resp.choices[0].message.content)

# Streaming
for chunk in litellm.completion(
    model="slullama/qwen3.5:9b",
    messages=[{"role": "user", "content": "hello"}],
    stream=True,
):
    print(chunk.choices[0].delta.content, end="")

Under the hood this opens the SSH tunnel on the first call (lazily), proxies through the head node, and returns standard litellm ModelResponse objects.

Option B — Python client (async)

from slullama import SlulamaClient

async with SlulamaClient(host="user@headnode") as client:
    # Ollama-compatible chat
    resp = await client.chat("qwen3.5:9b", messages=[
        {"role": "user", "content": "What is electrochemistry?"},
    ])
    print(resp["message"]["content"])

    # List models
    tags = await client.tags()
    print(tags)

    # Server status (job state, idle time, etc.)
    status = await client.status()
    print(status)

    # Raw Ollama URL for any tool that speaks Ollama
    print(client.ollama_url)   # http://localhost:11434

Option C — Python client (singleton, no context manager)

from slullama import SlulamaClient

client = SlulamaClient.get_default()   # lazy singleton
client.connect_sync()                  # opens SSH tunnel (blocking)
print(client.ollama_url)               # http://localhost:11434
# Tunnel stays open until process exits (atexit hook).

Option D — CLI tunnel (foreground)

slullama connect user@headnode
# Tunnel open: localhost:11434 → headnode:11435
# Press Ctrl+C to close.

# Now any Ollama-speaking tool works:
curl http://localhost:11434/api/tags

3. Check status

# From the head node (no tunnel needed):
slullama status

# From a laptop (uses config or args):
slullama status headnode --port 11435 --token your-shared-secret

Returns JSON with job state, node, idle time, request count, uptime, etc.

Full configuration reference

Config file: ~/.config/slullama/config.toml Override path: --config <path> or SLULLAMA_CONFIG=<path>

`[server]` — head node daemon

Key	Type	Default	Description
`port`	int	`11435`	Port the proxy listens on
`token`	str	`""`	Bearer token for auth (empty = no auth)
`log_dir`	str	`"/tmp/slullama"`	Directory for sbatch scripts and job logs
`job_template`	str	`""`	Path to custom sbatch template (empty = built-in)

`[slurm]` — Slurm job parameters

Key	Type	Default	Description
`partition`	str	`"gpu"`	Slurm partition
`gres`	str	`"gpu:1"`	Generic resources
`mem`	str	`""`	Memory (e.g. `"64G"`); empty = cluster default
`time`	str	`"4:00:00"`	Initial job time limit
`idle_timeout`	int	`30`	Minutes of inactivity before teardown
`keep_alive`	str	`"extend"`	`"extend"` or `"cancel"` (see below)
`extra_args`	list	`[]`	Extra `#SBATCH` directives (e.g. `["--exclusive"]`)

`[ollama]` — Ollama on the compute node

Key	Type	Default	Description
`port`	int	`11434`	Port ollama listens on
`binary`	str	`"ollama"`	Path to ollama binary
`models_dir`	str	`""`	`OLLAMA_MODELS` dir; empty = ollama default
`copy_binary`	bool	`false`	Copy binary to compute node per job
`copy_source`	str	`""`	Source path for copy
`cleanup_binary`	bool	`false`	Delete the copy when the job ends
`pre_pull`	list	`[]`	Models to pull after ollama starts

`[client]` — laptop / workstation

Key	Type	Default	Description
`host`	str	`""`	SSH destination (`user@headnode`)
`server_port`	int	`11435`	Proxy port on head node
`token`	str	`""`	Bearer token
`local_port`	int	`11434`	Local port (appears as Ollama to tools)

Environment variable overrides

Environment variables override the config file. Useful for CI, containers, or quick one-offs.

Variable	Overrides
`SLULLAMA_CONFIG`	Config file path
`SLULLAMA_TOKEN`	`server.token` and `client.token`
`SLULLAMA_HOST`	`client.host`
`SLULLAMA_SERVER_PORT`	`server.port` and `client.server_port`
`SLULLAMA_LOCAL_PORT`	`client.local_port`
`SLULLAMA_PARTITION`	`slurm.partition`
`SLULLAMA_GRES`	`slurm.gres`
`SLULLAMA_IDLE_TIMEOUT`	`slurm.idle_timeout`
`SLULLAMA_OLLAMA_BINARY`	`ollama.binary`
`SLULLAMA_OLLAMA_PORT`	`ollama.port`

Keep-alive strategies

Strategy	How it works	When to use
`extend` (default)	On each request, runs `scontrol update JobId=X TimeLimit=+Nmin` to push the deadline forward.	Clusters that allow job time extensions.
`cancel`	Submits the job with the full `time` limit. On idle timeout, runs `scancel`. Next request resubmits.	Clusters that restrict `scontrol update`.

Set via keep_alive in [slurm] config or try both on your cluster — the server logs a warning if scontrol update fails.

Custom job template

The default sbatch template is in slullama/template.py. To use your own:

[server]
job_template = "/path/to/my_template.sh"

Templates use Python string.Template syntax (${variable_name}). Available variables:

Variable	Value
`${partition}`	Slurm partition
`${gres}`	GRES string
`${time}`	Time limit
`${log_dir}`	Log directory
`${extra_sbatch}`	Extra `#SBATCH` lines (rendered from `mem` + `extra_args`)
`${copy_commands}`	Binary copy commands (or comment if disabled)
`${ollama_port}`	Ollama listen port
`${models_env}`	`export OLLAMA_MODELS=...` (or comment if not set)
`${cleanup_commands}`	Binary cleanup commands (or comment if disabled)
`${ollama_binary}`	Path to ollama binary (may be the copied path)
`${pull_commands}`	`ollama pull` commands (or comment if empty)

Use $$ for literal $ in bash (e.g. $$SLURM_JOB_ID, $$!).

Module map

src/slullama/
├── __init__.py              # Public API: SlulamaClient, Config, auto-registers litellm
├── config.py                # Dataclasses: Config, ServerConfig, ClientConfig, SlurmConfig, OllamaConfig
│                            #   Config.load() reads TOML + env overrides
├── template.py              # DEFAULT_TEMPLATE + render_template(ServerConfig) → str
├── cli.py                   # CLI: slullama serve | connect | status
├── litellm_provider.py      # litellm CustomLLM subclass, registers "slullama/" prefix
├── server/
│   ├── slurm.py             # SlurmManager: submit(), query(), extend_time(), cancel(), wait_for_running()
│   ├── tunnel.py            # SshTunnel: head node → compute node SSH port-forward
│   └── proxy.py             # OllamaProxy: aiohttp reverse proxy, auth, idle watchdog, /slullama/status
└── client/
    ├── tunnel.py            # ClientTunnel: laptop → head node SSH port-forward (sync + async)
    └── client.py            # SlulamaClient: high-level client, context manager, singleton

Key classes

SlurmManager (server/slurm.py): Wraps sbatch, scontrol, scancel as async subprocess calls. Tracks job ID and node. Parses scontrol show job output for state, node, time left.
SshTunnel (server/tunnel.py): Async SSH port-forward via ssh -N -L. Used on the head node to reach the compute node.
OllamaProxy (server/proxy.py): aiohttp web app that:
- Serves /slullama/status (GET) — JSON status endpoint.
- Proxies everything else (/{path:.*}) to Ollama on the compute node.
- Manages the full lifecycle: job submission → tunnel → health poll → proxy → idle watchdog → teardown.
- _boot_lock ensures only one concurrent boot sequence.
- Idle watchdog runs every 30 s, tears down if idle ≥ timeout.
ClientTunnel (client/tunnel.py): SSH port-forward from laptop to head node. Has both sync (open_sync/close_sync for litellm/atexit) and async (open/close) APIs.
SlulamaClient (client/client.py): High-level client.
- Context manager: async with SlulamaClient(...) as c:
- Lazy singleton: SlulamaClient.get_default() for litellm provider use.
- Methods: chat(), tags(), status().
- Property: ollama_url — the local URL that speaks Ollama.
SlulamaLLM (litellm_provider.py): litellm CustomLLM subclass with completion() and streaming(). Registered into litellm.custom_provider_map on import slullama.

Integration with acatome-lambic

The existing LlmClient in acatome-lambic supports provider="ollama" with a configurable ollama_url. Two integration paths:

Via litellm: Set provider="slullama" and model="qwen3.5:9b" in LlmConfig. Since acatome-lambic already uses litellm as a backend, this works if slullama is installed with [litellm] extra.
Direct: Run slullama connect in a terminal to create a local tunnel, then use provider="ollama" with ollama_url="http://localhost:11434". No code changes needed in acatome-lambic.

Known limitations / future work

Cold start latency: First request after idle waits for Slurm job allocation + Ollama startup (30–120 s depending on cluster). The proxy blocks the request until ready; it does not return 503.
Single GPU node: Currently manages one Slurm job. Multiple concurrent jobs / load balancing is out of scope for v1.
No model routing: All requests go to the same Ollama instance. Model loading/unloading is handled by Ollama natively.
scontrol update permissions: Some clusters don't allow users to extend job time. Use keep_alive = "cancel" as a workaround.
No HTTPS: Tunnel traffic is encrypted by SSH; the HTTP proxy itself is plaintext. Fine for a cluster environment.
Ollama binary distribution: If no shared filesystem, the binary must be copied per job (copy_binary = true). This adds cold start time.

CLI reference

slullama serve [--port N] [--token T] [--partition P] [--gres G]
               [--idle-timeout M] [--config PATH] [-v]

    Start the proxy daemon on the head node.

slullama connect [user@headnode] [--port N] [--token T]
                 [--local-port N] [--config PATH] [-v]

    Open an SSH tunnel to the head node (foreground, Ctrl+C to close).

slullama status [headnode] [--port N] [--token T] [--config PATH]

    Query the server's /slullama/status endpoint and print JSON.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

retostamm

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Apr 7, 2026

0.1.1

Mar 25, 2026

0.1.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slullama-0.1.3.tar.gz (166.6 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

slullama-0.1.3-py3-none-any.whl (25.2 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file slullama-0.1.3.tar.gz.

File metadata

Download URL: slullama-0.1.3.tar.gz
Upload date: Apr 7, 2026
Size: 166.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slullama-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`9b7c5881f65f3a1c0c825a9e7a51fedf8db38c4fcf0f0986f81c527775403f4d`
MD5	`7c02653e862c351865115fd4eac27c3e`
BLAKE2b-256	`68e65b4f8fbc6cb01a51e039342f588e7e1a93fbddc103a7d0530b8f296399d9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slullama-0.1.3.tar.gz:

Publisher: publish.yml on retospect/slullama

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slullama-0.1.3.tar.gz
- Subject digest: 9b7c5881f65f3a1c0c825a9e7a51fedf8db38c4fcf0f0986f81c527775403f4d
- Sigstore transparency entry: 1247270356
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: retospect/slullama@d5ce10e5b6190f165d09f1ef776e62e24438729a
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5ce10e5b6190f165d09f1ef776e62e24438729a
- Trigger Event: push

File details

Details for the file slullama-0.1.3-py3-none-any.whl.

File metadata

Download URL: slullama-0.1.3-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 25.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for slullama-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4535ccf2c0b0d076bd96243913570c9e50046e717f35bef07d98112003f7dcb8`
MD5	`0db9350ff41d71697685bbeb82f1a341`
BLAKE2b-256	`fbd8635eadfc85d0a7666b8095a760eb473b183bdf3f7f9e2948d5d8e4d625eb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for slullama-0.1.3-py3-none-any.whl:

Publisher: publish.yml on retospect/slullama

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: slullama-0.1.3-py3-none-any.whl
- Subject digest: 4535ccf2c0b0d076bd96243913570c9e50046e717f35bef07d98112003f7dcb8
- Sigstore transparency entry: 1247270440
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: retospect/slullama@d5ce10e5b6190f165d09f1ef776e62e24438729a
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5ce10e5b6190f165d09f1ef776e62e24438729a
- Trigger Event: push

slullama 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

slullama

Architecture

Request flow

Two SSH hops

Quick start

1. Head node (server)

2. Laptop (client)

Option A — litellm (recommended)

Option B — Python client (async)

Option C — Python client (singleton, no context manager)

Option D — CLI tunnel (foreground)

3. Check status

Full configuration reference

[server] — head node daemon

[slurm] — Slurm job parameters

[ollama] — Ollama on the compute node

[client] — laptop / workstation

Environment variable overrides

Keep-alive strategies

Custom job template

Module map

Key classes

Integration with acatome-lambic

Known limitations / future work

CLI reference

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`[server]` — head node daemon

`[slurm]` — Slurm job parameters

`[ollama]` — Ollama on the compute node

`[client]` — laptop / workstation