Skip to main content

Generate Shenron docker-compose deployments from model config files

Project description

Shenron

Shenron is a config-driven toolkit for deploying production LLM inference stacks. It supports two deployment modes:

  1. Helm chart (recommended) — deploy on any Kubernetes cluster (k3s, microk8s, GKE, EKS, …)
  2. Docker Compose (legacy) — single-node docker-compose deployments

Helm Deployment (Recommended)

Architecture

All external traffic enters through a single Caddy reverse proxy (the only LoadBalancer service). Caddy routes requests by path prefix to internal ClusterIP services:

Internet → Caddy (LoadBalancer :80/:443)
             ├── /llm/*       → onwards       (OpenAI-compatible API gateway)
             ├── /replica/*   → replica-manager (scaling API)
             └── /metrics/*   → prometheus     (metrics)

Caddy uses handle_path directives, which both match and strip the prefix. This means:

  • GET /llm/v1/models → forwards as GET /v1/models to onwards
  • GET /metrics → forwards as GET / to prometheus (which serves its UI at /)
  • POST /replica/v1/models/Qwen%2FQwen3-0.6B/replicas → forwards to replica-manager

Behind onwards, per-model components are deployed:

onwards → router (per model, cache-aware load balancing)
            └── model pod(s) (SGLang/vLLM, GPU workloads)

Prerequisites

Kubernetes cluster with:

  • GPU nodes with NVIDIA drivers installed
  • NVIDIA device plugin (nvidia.com/gpu resource available)
  • A RuntimeClass named nvidia (for k3s/microk8s with the NVIDIA runtime)
  • A default StorageClass (for Caddy certificate persistence):
    • k3s: local-path (available out of the box)
    • microk8s: microk8s-hostpath (run microk8s enable hostpath-storage)
    • GKE/EKS: available out of the box

k3s-specific setup:

# 1. Disable Traefik (frees ports 80/443 for Caddy)
#    Add to /etc/rancher/k3s/config.yaml:
#      disable:
#        - traefik
#    Then: systemctl restart k3s

# 2. Configure containerd for NVIDIA runtime
cat > /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl << 'EOF'
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
EOF

systemctl restart k3s

# 3. Create the RuntimeClass
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

# 4. Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

Quick Start

# 1. Install shenron CLI
uv pip install shenron

# 2. Download the Helm chart
shenron get --helm

# 3. Create required secrets
kubectl create namespace shenron

kubectl create secret generic system-api-key \
  -n shenron \
  --from-literal=SYSTEM_API_KEY='your-api-key'

kubectl create secret generic replica-manager-auth \
  -n shenron \
  --from-literal=token='your-replica-manager-token'

# 4. Edit models.yaml — set image.tag for your CUDA version and enable models
#    e.g. image.tag: latest-sglang-cu130  (for Blackwell GPUs)
#         image.tag: latest-sglang-cu126  (for Hopper/Ampere GPUs)

# 5. Deploy
helm install shenron ./shenron-helm -n shenron --values ./shenron-helm/models.yaml

Choosing the Right Image Tag

The image tag in models.yaml must match your GPU's CUDA compute capability:

GPU Family Compute Capability Image Tag Suffix
Ampere (A100, A10G) sm_80, sm_86 cu126
Hopper (H100, H200) sm_90 cu126 or cu128
Blackwell (B200, RTX PRO 6000) sm_120 cu130

Format: latest-{engine}-{cuda} — e.g. latest-sglang-cu130, latest-vllm-cu126.

Caddy Routing & TLS

HTTP-only mode (default): When caddy.fqdn is empty, Caddy serves on :80 with no TLS.

HTTPS mode: Set caddy.fqdn to a domain pointing at your node's public IP. Caddy automatically obtains a Let's Encrypt certificate.

caddy:
  fqdn: my-node.nodes.doubleword.ai

Use shenron endpoint setup to create a Cloudflare DNS record under *.nodes.doubleword.ai and get the FQDN value.

Certificate persistence: Enabled by default. Caddy stores certificates in a PVC so they survive pod restarts (important — Let's Encrypt rate-limits to 5 duplicate certs per week). The PVC uses whatever default StorageClass the cluster provides.

Extra Caddy Routes

Add custom backend routes via caddy.extraRoutes:

caddy:
  extraRoutes:
    - path: "/custom/*"
      service: my-backend-service
      port: 8080

Values Reference

Top-Level

Key Default Description
image.repository doublewordai/shenron Model container image
image.tag latest-vllm-cu130 Image tag (engine + CUDA version)
port 3000 Container port for all model pods
gpu.runtimeClassName nvidia Kubernetes RuntimeClass for GPU pods
cluster.total_gpus 0 Total GPU budget for replica-manager scaling

Caddy

Key Default Description
caddy.enabled true Deploy Caddy reverse proxy
caddy.fqdn "" FQDN for automatic TLS (empty = HTTP-only on :80)
caddy.service.type LoadBalancer Only public-facing service
caddy.persistence.enabled true PVC for Let's Encrypt cert storage
caddy.persistence.storageClass "" Empty = cluster default StorageClass
caddy.extraRoutes [] Additional [{path, service, port}] routes

Onwards (API Gateway)

Key Default Description
onwards.enabled true Deploy onwards
onwards.port 3000 Onwards listen port
onwards.service.type ClusterIP ClusterIP when behind Caddy
onwards.systemApiKeySecret.name system-api-key Secret with API key

Replica Manager

Key Default Description
replicaManager.enabled true Deploy replica-manager
replicaManager.port 8081 Listen port
replicaManager.service.type ClusterIP ClusterIP when behind Caddy
replicaManager.auth.tokenSecret.name replica-manager-auth Auth token secret

Prometheus

Key Default Description
prometheus.enabled false Deploy Prometheus
prometheus.port 9090 Prometheus port
prometheus.service.type ClusterIP ClusterIP when behind Caddy

Scouter Reporter

Key Default Description
scouterReporter.enabled false Deploy scouter reporters
scouterReporter.collector.instanceSecret.name scouter-reporter Collector instance secret
scouterReporter.collector.apiKeySecret.name scouter-reporter Ingest API key secret

When enabled, create the secret:

kubectl create secret generic scouter-reporter \
  -n shenron \
  --from-literal=collector-instance='your-collector-host' \
  --from-literal=ingest-api-key='your-ingest-key'

Models

Models are defined in models: as a map keyed by model name:

models:
  "Qwen/Qwen3-0.6B":
    replicas: 1          # 0 = disabled
    num_gpus: 1           # GPUs per replica
    command:
      - "python"
      - "-m"
      - "sglang.launch_server"
      - "--model-path"
      - "Qwen/Qwen3-0.6B"
      - "--host"
      - "0.0.0.0"
      - "--enable-metrics"
    shm:
      enabled: true
      sizeLimit: 4Gi
    resources:
      requests:
        cpu: "8"
        memory: "8Gi"
      limits:
        cpu: "8"
        memory: "8Gi"

Note: Do not include --port in command — the chart injects it from the top-level port value.

Each model with replicas > 0 gets:

  • A Deployment with GPU resources and the HuggingFace cache volume
  • A headless Service
  • An SGLang router Deployment + Service (when router.enabled)
  • A scouter reporter Deployment (when scouterReporter.enabled)

Replica Manager API

The replica manager provides a REST API for dynamic scaling:

# Health check
curl http://<host>/replica/healthz

# List models and current replicas
curl -H "Authorization: Bearer <token>" http://<host>/replica/v1/models

# Scale a model
curl -X POST -H "Authorization: Bearer <token>" \
  -d '{"replicas": 2}' \
  http://<host>/replica/v1/models/Qwen%2FQwen3-0.6B/replicas

Scaling is GPU-budget-aware: it validates num_gpus × replicas across all models against cluster.total_gpus.

Upgrading

# After editing values/models.yaml:
helm upgrade shenron ./shenron-helm -n shenron --values ./shenron-helm/models.yaml

# Quick override without editing files:
helm upgrade shenron ./shenron-helm -n shenron --values ./shenron-helm/models.yaml \
  --set image.tag=latest-sglang-cu130

Debugging

# Check pod status
kubectl get pods -n shenron

# Model pod logs
kubectl logs -n shenron deploy/shenron-qwen-qwen3-0-6b-<hash>

# Caddy logs (routing issues)
kubectl logs -n shenron deploy/shenron-caddy

# Verify Caddyfile
kubectl get configmap shenron-caddy-config -n shenron -o jsonpath='{.data.Caddyfile}'

# Test from inside the cluster
kubectl run -n shenron curl --rm -it --image=curlimages/curl -- \
  curl -s http://shenron-onwards:3000/v1/models

# GPU visibility
kubectl describe node | grep -A5 nvidia.com/gpu

Docker Compose (Legacy)

The docker-compose path is maintained for backward compatibility but is not recommended for new deployments. Use the Helm chart instead.

shenron reads a model config YAML and generates:

  • docker-compose.yml
  • .generated/Caddyfile
  • .generated/prometheus.yml
  • .generated/scouter_reporter.env
  • .generated/engine_start.sh
  • .generated/engine_start_N.sh + .generated/sglangmux_start.sh when models: has 2+ entries

Quick Start

uv pip install shenron
shenron get
docker compose up -d

shenron get reads a per-release config index asset, shows available configs with arrow-key selection, downloads the chosen config, and generates deployment artifacts in the current directory. Using --release latest also rewrites shenron_version in the downloaded config to latest. You can also override config values on download with:

  • --api-key (writes api_key)
  • --scouter-api-key (writes scouter_ingest_api_key)
  • --scouter-collector-instance (writes scouter_collector_instance)

shenron . expects exactly one config YAML (*.yml or *.yaml) in the current directory, unless you pass a config file path directly.

Engine Configuration

  • engine: vllm or sglang (default: vllm)
  • engine_args: engine CLI args appended after core settings.
  • engine_env: top-level default engine environment variables as alternating KEY, VALUE entries.
  • models[*].engine_envs: per-model engine environment variables as alternating KEY, VALUE entries.
  • engine_port, engine_host: engine bind settings used for generated scripts and targets.
  • engine_use_cuda_ipc_transport: when true, exports SGLANG_USE_CUDA_IPC_TRANSPORT=1 before launching SGLang.
  • models: optional per-model engine config. With 1 entry, Shenron generates a single engine_start.sh. With 2+ entries, Shenron starts sglangmux (requires engine: sglang).
  • sglangmux_listen_port, sglangmux_host, sglangmux_upstream_timeout_secs, sglangmux_model_ready_timeout_secs, sglangmux_model_switch_timeout_secs, sglangmux_log_dir: optional sglangmux settings.

engine_args, engine_env, and models[*].engine_envs values accept YAML scalars (string/number/bool). If you need to pass a structured value (like --override-generation-config), provide a YAML mapping and it will be JSON-encoded.

Legacy keys (vllm_args, sglang_args, vllm_port, vllm_host, sglang_env, sglang_use_cuda_ipc_transport) are still accepted as aliases.

Multi-Model (sglangmux) Example

engine: sglang
sglangmux_listen_port: 8100
models:
- model_name: Qwen/Qwen3-0.6B
  engine_port: 8001
  engine_args: [--tp, 1]
- model_name: Qwen/Qwen3-30B-A3B
  engine_port: 8002
  engine_args: [--tp, 2]

Rules:

  • 2+ models requires engine: sglang
  • Each models[*].model_name and engine_port must be unique
  • sglangmux_listen_port must differ from all model ports

Endpoint Setup (Cloudflare DNS)

The shenron endpoint setup command creates a DNS record under *.nodes.doubleword.ai via the Cloudflare API:

shenron endpoint setup \
  --subdomain my-node \
  --public-ip 1.2.3.4 \
  --cloudflare-api-token $CF_TOKEN \
  --cloudflare-zone-id $CF_ZONE_ID

This writes .generated/node_endpoint.json with the FQDN and Cloudflare record metadata. The FQDN can then be used in caddy.fqdn (Helm) or is automatically picked up by shenron generate (docker-compose).

Security: All DNS operations are hard-restricted to *.nodes.doubleword.ai. This constraint is compiled into the binary and cannot be overridden at runtime.


Configs

Starter configs for docker-compose mode are in configs/:

  • configs/Qwen06B-cu126-TP1.yml / cu129 / cu130
  • configs/Qwen30B-A3B-cu126-TP1.yml / cu129-TP1 / cu129-TP2 / cu130-TP2
  • configs/Qwen235-A22B-cu129-TP2.yml / cu129-TP4 / cu130-TP2
  • configs/GPT-OSS-20B-cu126-TP1.yml / cu129-TP1
  • configs/Qwen35-397B-A17B-cu130-TP8-sglang.yml

Development

# Run tests (Rust + CLI + compose checks)
./scripts/ci.sh

# Install local package for manual testing
python3 -m pip install -e .

# Generate from repo config (docker-compose mode)
shenron configs/Qwen06B-cu126-TP1.yml --output-dir /tmp/shenron-test

# Lint the Helm chart
helm lint helm/ --values helm/models.yaml

Release Automation

  • release-assets.yaml publishes stamped config files (*.yml) as release assets.
  • release-assets.yaml also publishes configs-index.txt, which powers shenron get.
  • release-assets.yaml packages Helm chart assets as shenron-<version>.tgz + index.yaml (Helm repository format).
  • release-assets.yaml mirrors *.yml, configs-index.txt, shenron-*.tgz, and index.yaml into ${OWNER}/shenron-configs under the same tag as the main shenron release.
  • Set CONFIGS_REPO_TOKEN (or reuse RELEASE_PLEASE_TOKEN) with write access to the configs repo release assets.
  • python-release.yaml builds/publishes the shenron package to PyPI on release tags.
  • Docker image build/push via Depot remains in ci.yaml.

License

MIT, see LICENSE.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shenron-0.18.1.tar.gz (82.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shenron-0.18.1-cp38-abi3-manylinux_2_34_x86_64.whl (542.4 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file shenron-0.18.1.tar.gz.

File metadata

  • Download URL: shenron-0.18.1.tar.gz
  • Upload date:
  • Size: 82.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for shenron-0.18.1.tar.gz
Algorithm Hash digest
SHA256 8a4ecfa61211be15af5d2ca1c6e1e8850a9e158ccad7a08cc2cb5e1e41c89dfb
MD5 845b5f8231f9b004b4f8b7d25f455482
BLAKE2b-256 168220d812399c75fcf637693cb06cb232e0c6eea44ad0a0da9bceebd95bebea

See more details on using hashes here.

Provenance

The following attestation bundles were made for shenron-0.18.1.tar.gz:

Publisher: python-release.yaml on doublewordai/shenron

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file shenron-0.18.1-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for shenron-0.18.1-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 be69f3c8f0a8da0eb9d8f45db25e553b2a96c02fa29534b044f278fcf8c7dd78
MD5 b3cdee9b8c25506f1ed05f467e818f4f
BLAKE2b-256 398d2f2010f494bdcf7dd6b01a631aa5c8f3b2534847eb0131cc5719ac4e9dad

See more details on using hashes here.

Provenance

The following attestation bundles were made for shenron-0.18.1-cp38-abi3-manylinux_2_34_x86_64.whl:

Publisher: python-release.yaml on doublewordai/shenron

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page