Generate Shenron docker-compose deployments from model config files
Project description
Shenron
Shenron is a config-driven toolkit for deploying production LLM inference stacks. It supports two deployment modes:
- Helm chart (recommended) — deploy on any Kubernetes cluster (k3s, microk8s, GKE, EKS, …)
- Docker Compose (legacy) — single-node docker-compose deployments
Helm Deployment (Recommended)
Architecture
All external traffic enters through a single Caddy reverse proxy (the only LoadBalancer service). Caddy routes requests by path prefix to internal ClusterIP services:
Internet → Caddy (LoadBalancer :80/:443)
├── /llm/* → onwards (OpenAI-compatible API gateway)
├── /replica/* → replica-manager (scaling API)
└── /metrics/* → prometheus (metrics)
Caddy uses handle_path directives, which both match and strip the prefix. This means:
GET /llm/v1/models→ forwards asGET /v1/modelsto onwardsGET /metrics→ forwards asGET /to prometheus (which serves its UI at/)POST /replica/v1/models/Qwen%2FQwen3-0.6B/replicas→ forwards to replica-manager
Behind onwards, per-model components are deployed:
onwards → router (per model, cache-aware load balancing)
└── model pod(s) (SGLang/vLLM, GPU workloads)
Prerequisites
Kubernetes cluster with:
- GPU nodes with NVIDIA drivers installed
- NVIDIA device plugin (
nvidia.com/gpuresource available) - A
RuntimeClassnamednvidia - A default StorageClass (for Caddy certificate persistence)
Node Setup (microk8s)
# 1. Disable the built-in ingress addon (it binds to host ports 80/443 and
# intercepts all traffic before Caddy can serve ACME challenges)
sudo microk8s disable ingress
# 2. Enable metallb with the node's PUBLIC IP
# Replace <PUBLIC_IP> with your node's actual public IP address.
# Using a private IP (e.g. 10.x.x.x) will prevent Let's Encrypt from
# reaching Caddy for HTTP-01 challenges.
sudo microk8s enable metallb:<PUBLIC_IP>-<PUBLIC_IP>
# 3. Enable hostpath-storage (provides the default StorageClass for Caddy PVC)
sudo microk8s enable hostpath-storage
# 4. Enable GPU support
sudo microk8s enable gpu
# 5. Create the RuntimeClass
sudo microk8s kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
# 6. Verify GPUs are visible
sudo microk8s kubectl describe node | grep -A5 nvidia.com/gpu
k3s-specific setup (click to expand)
# 1. Disable Traefik (frees ports 80/443 for Caddy)
# Add to /etc/rancher/k3s/config.yaml:
# disable:
# - traefik
# Then: systemctl restart k3s
# 2. Configure containerd for NVIDIA runtime
cat > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl << 'EOF'
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
EOF
systemctl restart k3s
# 3. Create the RuntimeClass
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
# 4. Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
Note: k3s does not need
caddy.hostNetwork: true— klipper-lb handles ACME challenges correctly without it. The default StorageClasslocal-pathis available out of the box.
Quick Start
# 1. Install shenron CLI
uv pip install shenron
# 2. Create a DNS record for your node (optional — skip for HTTP-only)
shenron endpoint setup \
--subdomain my-node \
--public-ip <NODE_PUBLIC_IP> \
--cloudflare-api-token $CF_TOKEN \
--cloudflare-zone-id $CF_ZONE_ID
# 3. Download the Helm chart
shenron get --helm
# 4. Create required secrets
sudo microk8s kubectl create namespace shenron
sudo microk8s kubectl create secret generic system-api-key \
-n shenron \
--from-literal=SYSTEM_API_KEY='your-api-key'
sudo microk8s kubectl create secret generic replica-manager-auth \
-n shenron \
--from-literal=token='your-replica-manager-token'
# 5. Edit `shenron-helm/node-piccolo.yaml` or `shenron-helm/node-chiaotzu.yaml`, plus `shenron-helm/replicas.yaml`:
# - image.tag: latest-sglang-cu130 (match your CUDA version)
# - caddy.fqdn: "my-node.nodes.doubleword.ai"
# - caddy.hostNetwork: true
# - cluster.total_gpus: <number of GPUs on this node>
# - Define model specs in the node-specific file
# - Enable models with replicas > 0 in replicas.yaml
# 6. Deploy
cd shenron-helm
helmfile -f helmfile-piccolo apply
Choosing the Right Image Tag
The image tag in the node-specific values file must match your GPU's CUDA compute capability:
| GPU Family | Compute Capability | Image Tag Suffix |
|---|---|---|
| Ampere (A100, A10G) | sm_80, sm_86 | cu126 |
| Hopper (H100, H200) | sm_90 | cu126 or cu128 |
| Blackwell (B200, RTX PRO 6000) | sm_120 | cu130 |
Format: latest-{engine}-{cuda} — e.g. latest-sglang-cu130, latest-vllm-cu126.
Caddy Routing & TLS
HTTP-only mode (default): When caddy.fqdn is empty, Caddy serves on :80 with no TLS.
HTTPS mode: Set caddy.fqdn to a domain pointing at your node's public IP. Caddy automatically obtains a Let's Encrypt certificate.
caddy:
fqdn: my-node.nodes.doubleword.ai
Use shenron endpoint setup to create a Cloudflare DNS record under *.nodes.doubleword.ai and get the FQDN value.
Certificate persistence: Enabled by default. Caddy stores certificates in a PVC so they survive pod restarts (important — Let's Encrypt rate-limits to 5 duplicate certs per week). The PVC uses whatever default StorageClass the cluster provides.
hostNetwork (microk8s): On microk8s, caddy.hostNetwork: true is required for ACME HTTP-01 challenges to work. metallb uses kube-proxy DNAT rules that interfere with challenge handling — hostNetwork makes Caddy bind directly to the node's ports 80/443, bypassing kube-proxy. k3s does not need this (klipper-lb handles it automatically).
Extra Caddy Routes
Add custom backend routes via caddy.extraRoutes:
caddy:
extraRoutes:
- path: "/custom/*"
service: my-backend-service
port: 8080
Values Reference
Top-Level
| Key | Default | Description |
|---|---|---|
image.repository |
doublewordai/shenron |
Model container image |
image.tag |
latest-vllm-cu130 |
Image tag (engine + CUDA version) |
port |
3000 |
Container port for all model pods |
gpu.runtimeClassName |
nvidia |
Kubernetes RuntimeClass for GPU pods |
cluster.total_gpus |
0 |
Total GPU budget for replica-manager scaling |
Caddy
| Key | Default | Description |
|---|---|---|
caddy.enabled |
true |
Deploy Caddy reverse proxy |
caddy.fqdn |
"" |
FQDN for automatic TLS (empty = HTTP-only on :80) |
caddy.service.type |
LoadBalancer |
Only public-facing service |
caddy.hostNetwork |
false |
Bind directly to host ports 80/443 (required for microk8s) |
caddy.persistence.enabled |
true |
PVC for Let's Encrypt cert storage |
caddy.persistence.storageClass |
"" |
Empty = cluster default StorageClass |
caddy.extraRoutes |
[] |
Additional [{path, service, port}] routes |
Onwards (API Gateway)
| Key | Default | Description |
|---|---|---|
onwards.enabled |
true |
Deploy onwards |
onwards.port |
3000 |
Onwards listen port |
onwards.service.type |
ClusterIP |
ClusterIP when behind Caddy |
onwards.systemApiKeySecret.name |
system-api-key |
Secret with API key |
Replica Manager
| Key | Default | Description |
|---|---|---|
replicaManager.enabled |
true |
Deploy replica-manager |
replicaManager.port |
8081 |
Listen port |
replicaManager.service.type |
ClusterIP |
ClusterIP when behind Caddy |
replicaManager.auth.tokenSecret.name |
replica-manager-auth |
Auth token secret |
replicaManager.helm.chartPath |
/opt/shenron/helm |
Chart path used by replica-manager Helm upgrades |
replicaManager.helm.chartMount.enabled |
false |
Mount a node hostPath chart dir into replica-manager |
replicaManager.helm.chartMount.hostPath |
"" |
Node path to mount when chartMount is enabled |
replicaManager.helm.chartMount.hostPathType |
Directory |
Kubernetes hostPath type for the chart mount |
Prometheus
| Key | Default | Description |
|---|---|---|
prometheus.enabled |
false |
Deploy Prometheus |
prometheus.port |
9090 |
Prometheus port |
prometheus.service.type |
ClusterIP |
ClusterIP when behind Caddy |
Scouter Reporter
| Key | Default | Description |
|---|---|---|
scouterReporter.enabled |
false |
Deploy scouter reporters |
scouterReporter.collector.instanceSecret.name |
scouter-reporter |
Collector instance secret |
scouterReporter.collector.apiKeySecret.name |
scouter-reporter |
Ingest API key secret |
When enabled, create the secret:
sudo microk8s kubectl create secret generic scouter-reporter \
-n shenron \
--from-literal=collector-instance='your-collector-host' \
--from-literal=ingest-api-key='your-ingest-key'
Models
Models are defined in models: as a map keyed by model name:
models:
"Qwen/Qwen3-0.6B":
replicas: 1 # 0 = disabled
num_gpus: 1 # GPUs per replica
command:
- "python"
- "-m"
- "sglang.launch_server"
- "--model-path"
- "Qwen/Qwen3-0.6B"
- "--host"
- "0.0.0.0"
- "--enable-metrics"
shm:
enabled: true
sizeLimit: 4Gi
resources:
requests:
cpu: "8"
memory: "8Gi"
limits:
cpu: "8"
memory: "8Gi"
Note: Do not include
--portincommand— the chart injects it from the top-levelportvalue.
Each model with replicas > 0 gets:
- A
Deploymentwith GPU resources and the HuggingFace cache volume - A headless
Service - An SGLang router
Deployment+Service(whenrouter.enabled) - A scouter reporter
Deployment(whenscouterReporter.enabled)
Replica Manager API
The replica manager provides a REST API for dynamic scaling:
# Health check
curl http://<host>/replica/healthz
# List models and current replicas
curl -H "Authorization: Bearer <token>" http://<host>/replica/v1/models
# Scale a model
curl -X POST -H "Authorization: Bearer <token>" \
-d '{"replicas": 2}' \
http://<host>/replica/v1/models/Qwen%2FQwen3-0.6B/replicas
Scaling is GPU-budget-aware: it validates num_gpus × replicas across all models against cluster.total_gpus.
Upgrading
# After editing a node-specific values file or replicas.yaml:
cd shenron-helm
helmfile -f helmfile-piccolo apply
# Quick override without editing files:
sudo microk8s helm3 upgrade shenron ./shenron-helm -n shenron \
--values ./shenron-helm/node-piccolo.yaml \
--values ./shenron-helm/replicas.yaml \
--set image.tag=latest-sglang-cu130
Debugging
# Check pod status
sudo microk8s kubectl get pods -n shenron
# Model pod logs
sudo microk8s kubectl logs -n shenron deploy/shenron-caddy
sudo microk8s kubectl logs -n shenron deploy/shenron-onwards
# Follow logs for a model pod
sudo microk8s kubectl logs -f -n shenron -l shenron.ai/model-id=<model-id>
# Verify Caddyfile
sudo microk8s kubectl get configmap shenron-caddy-config -n shenron \
-o jsonpath='{.data.Caddyfile}'
# Verify Onwards config
sudo microk8s kubectl get configmap shenron-onwards-config -n shenron \
-o jsonpath='{.data.onwards_config\.json}' | python3 -m json.tool
# Test from inside the cluster
sudo microk8s kubectl run -n shenron curl --rm -it --image=curlimages/curl -- \
curl -s http://shenron-onwards:3000/v1/models
# GPU visibility
sudo microk8s kubectl describe node | grep -A5 nvidia.com/gpu
# Check Caddy has a valid TLS certificate
curl -sv https://my-node.nodes.doubleword.ai/llm/v1/models 2>&1 | grep 'subject:'
Docker Compose (Legacy)
The docker-compose path is maintained for backward compatibility but is not recommended for new deployments. Use the Helm chart instead.
shenron reads a model config YAML and generates:
docker-compose.yml.generated/Caddyfile.generated/prometheus.yml.generated/scouter_reporter.env.generated/engine_start.sh.generated/engine_start_N.sh+.generated/sglangmux_start.shwhenmodels:has 2+ entries
Quick Start
uv pip install shenron
shenron get
docker compose up -d
shenron get reads a per-release config index asset, shows available configs with arrow-key selection, downloads the chosen config, and generates deployment artifacts in the current directory. Using --release latest also rewrites shenron_version in the downloaded config to latest. You can also override config values on download with:
--api-key(writesapi_key)--scouter-api-key(writesscouter_ingest_api_key)--scouter-collector-instance(writesscouter_collector_instance)
shenron . expects exactly one config YAML (*.yml or *.yaml) in the current directory, unless you pass a config file path directly.
Engine Configuration
engine:vllmorsglang(default:vllm)engine_args: engine CLI args appended after core settings.engine_env: top-level default engine environment variables as alternatingKEY, VALUEentries.models[*].engine_envs: per-model engine environment variables as alternatingKEY, VALUEentries.engine_port,engine_host: engine bind settings used for generated scripts and targets.engine_use_cuda_ipc_transport: whentrue, exportsSGLANG_USE_CUDA_IPC_TRANSPORT=1before launching SGLang.models: optional per-model engine config. With 1 entry, Shenron generates a singleengine_start.sh. With 2+ entries, Shenron startssglangmux(requiresengine: sglang).sglangmux_listen_port,sglangmux_host,sglangmux_upstream_timeout_secs,sglangmux_model_ready_timeout_secs,sglangmux_model_switch_timeout_secs,sglangmux_log_dir: optional sglangmux settings.
engine_args, engine_env, and models[*].engine_envs values accept YAML scalars (string/number/bool). If you need to pass a structured value (like --override-generation-config), provide a YAML mapping and it will be JSON-encoded.
Legacy keys (vllm_args, sglang_args, vllm_port, vllm_host, sglang_env, sglang_use_cuda_ipc_transport) are still accepted as aliases.
Multi-Model (sglangmux) Example
engine: sglang
sglangmux_listen_port: 8100
models:
- model_name: Qwen/Qwen3-0.6B
engine_port: 8001
engine_args: [--tp, 1]
- model_name: Qwen/Qwen3-30B-A3B
engine_port: 8002
engine_args: [--tp, 2]
Rules:
- 2+ models requires
engine: sglang - Each
models[*].model_nameandengine_portmust be unique sglangmux_listen_portmust differ from all model ports
Endpoint Setup (Cloudflare DNS)
The shenron endpoint setup command creates a DNS record under *.nodes.doubleword.ai via the Cloudflare API:
shenron endpoint setup \
--subdomain my-node \
--public-ip 1.2.3.4 \
--cloudflare-api-token $CF_TOKEN \
--cloudflare-zone-id $CF_ZONE_ID
This writes .generated/node_endpoint.json with the FQDN and Cloudflare record metadata. The FQDN can then be used in caddy.fqdn (Helm) or is automatically picked up by shenron generate (docker-compose).
Security: All DNS operations are hard-restricted to
*.nodes.doubleword.ai. This constraint is compiled into the binary and cannot be overridden at runtime.
Configs
Starter configs for docker-compose mode are in configs/:
configs/Qwen06B-cu126-TP1.yml/cu129/cu130configs/Qwen30B-A3B-cu126-TP1.yml/cu129-TP1/cu129-TP2/cu130-TP2configs/Qwen235-A22B-cu129-TP2.yml/cu129-TP4/cu130-TP2configs/GPT-OSS-20B-cu126-TP1.yml/cu129-TP1configs/Qwen35-397B-A17B-cu130-TP8-sglang.yml
Development
# Run tests (Rust + CLI + compose checks)
./scripts/ci.sh
# Install local package for manual testing
python3 -m pip install -e .
# Generate from repo config (docker-compose mode)
shenron configs/Qwen06B-cu126-TP1.yml --output-dir /tmp/shenron-test
# Lint the Helm chart
helm lint helm/ --values helm/node-piccolo.yaml --values helm/replicas.yaml
Release Automation
release-assets.yamlpublishes stamped config files (*.yml) as release assets.release-assets.yamlalso publishesconfigs-index.txt, which powersshenron get.release-assets.yamlpackages Helm chart assets asshenron-<version>.tgz+index.yaml(Helm repository format).release-assets.yamlmirrors*.yml,configs-index.txt,shenron-*.tgz, andindex.yamlinto${OWNER}/shenron-configsunder the same tag as the mainshenronrelease.- Set
CONFIGS_REPO_TOKEN(or reuseRELEASE_PLEASE_TOKEN) with write access to the configs repo release assets. python-release.yamlbuilds/publishes theshenronpackage to PyPI on release tags.- Docker image build/push via Depot remains in
ci.yaml.
License
MIT, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shenron-0.23.2.tar.gz.
File metadata
- Download URL: shenron-0.23.2.tar.gz
- Upload date:
- Size: 130.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d28d60e6667735e7619da0ee4d77f4e2f9ad4e897dd6b2c6dab48a9f6182df58
|
|
| MD5 |
43a1fbed1808f2af31870b4d857b8d70
|
|
| BLAKE2b-256 |
b9c80d419e8e225dfc79ec9c38aadbeab927969d4fb6deb187663ad6a3104f8b
|
Provenance
The following attestation bundles were made for shenron-0.23.2.tar.gz:
Publisher:
python-release.yaml on doublewordai/shenron
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
shenron-0.23.2.tar.gz -
Subject digest:
d28d60e6667735e7619da0ee4d77f4e2f9ad4e897dd6b2c6dab48a9f6182df58 - Sigstore transparency entry: 1263692838
- Sigstore integration time:
-
Permalink:
doublewordai/shenron@b70d63fb003c160ba9e78135af0ee579822de064 -
Branch / Tag:
refs/tags/v0.23.2 - Owner: https://github.com/doublewordai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-release.yaml@b70d63fb003c160ba9e78135af0ee579822de064 -
Trigger Event:
release
-
Statement type:
File details
Details for the file shenron-0.23.2-cp38-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: shenron-0.23.2-cp38-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 540.5 kB
- Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98106e34dbd7e86bb56c3e4f9922a87dbc11d71baa6c03e588acace55d1b5631
|
|
| MD5 |
15de37ffe243013b0db9339cc382ad75
|
|
| BLAKE2b-256 |
c743877973be14c673e9acfb5a04e52ba28711294c5b5f72256514d90bc582af
|
Provenance
The following attestation bundles were made for shenron-0.23.2-cp38-abi3-manylinux_2_34_x86_64.whl:
Publisher:
python-release.yaml on doublewordai/shenron
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
shenron-0.23.2-cp38-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
98106e34dbd7e86bb56c3e4f9922a87dbc11d71baa6c03e588acace55d1b5631 - Sigstore transparency entry: 1263692907
- Sigstore integration time:
-
Permalink:
doublewordai/shenron@b70d63fb003c160ba9e78135af0ee579822de064 -
Branch / Tag:
refs/tags/v0.23.2 - Owner: https://github.com/doublewordai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-release.yaml@b70d63fb003c160ba9e78135af0ee579822de064 -
Trigger Event:
release
-
Statement type: