Lightweight Prefill-Decode proxy for disaggregated LLM serving
Project description
xPyD Proxy
A lightweight Prefill-Decode (PD) proxy for disaggregated LLM serving.
Architecture
xPyD Proxy supports two operating modes:
Prefill-Decode (P/D) Disaggregated Mode
Requests are routed through two phases with KV cache transfer:
- Prefill — KV cache preparation on prefill nodes (
max_tokens=1) - Decode — autoregressive token generation on decode nodes (receives KV cache from prefill)
Dual-Role Mode
A single instance handles both prefill and decode in one pass — no KV transfer needed. This simplifies deployment when disaggregation is not required or for smaller-scale setups.
The proxy handles scheduling (load-balanced, round-robin, consistent hash, power-of-two, cache-aware), health monitoring, circuit breaking, and dynamic instance management. Multi-model routing allows serving multiple models through a single proxy with per-model scheduler configuration.
See docs/architecture.md for details.
Installation
pip install .
# Verify
xpyd --version
Quick Start
# Generate a config template
xpyd proxy --init-config
# Edit xpyd.yaml with your model and node addresses, then:
xpyd proxy -c xpyd.yaml
Configuration
All configuration is done via YAML. Three config formats are supported.
Format 1: Legacy (Single Model)
Simple prefill/decode address lists for a single model:
model: /path/to/model
prefill:
- "10.0.0.3:8100"
decode:
- "10.0.0.1:8200"
- "10.0.0.2:8200"
port: 8000
scheduling: loadbalanced
Topology-style config is also supported in Format 1:
model: /path/to/model
port: 8868
prefill:
nodes:
- "10.0.0.1:8100"
tp_size: 8
dp_size: 1
world_size_per_node: 8
decode:
nodes:
- "10.0.0.2:8200"
- "10.0.0.3:8200"
tp_size: 1
dp_size: 16
world_size_per_node: 8
Format 2: Instances (Multi-Model, Per-Instance Role)
Explicit per-instance configuration with role and model assignment. Supports dual role:
instances:
- address: "10.0.0.1:8000"
role: prefill
model: llama-3
- address: "10.0.0.2:8000"
role: decode
model: llama-3
- address: "10.0.0.3:8000"
role: dual
model: qwen-2
port: 8000
scheduling: loadbalanced
Format 3: Models Shorthand (Multi-Model, Per-Model Scheduler)
Compact format with per-model scheduler override and dual shorthand:
models:
- name: llama-3
prefill:
- "10.0.0.1:8000"
decode:
- "10.0.0.2:8000"
scheduler: round_robin
- name: qwen-2
dual:
- "10.0.0.3:8000"
- "10.0.0.4:8000"
scheduler: loadbalanced
port: 8000
Note:
instancesandmodelscannot be combined. Legacyprefill/decodelists cannot be used withinstancesormodels.
See examples/proxy.yaml for a fully-commented example.
CLI Reference
xpyd proxy [OPTIONS]
Options:
--config, -c PATH Path to YAML config (default: ./xpyd.yaml or XPYD_CONFIG env)
--init-config [PATH] Generate a config template and exit
--validate-config PATH Validate a config file and exit
--port PORT Override port from config
--log-level LEVEL Override log level: debug|info|warning|error
--version, -V Show version and exit
xpyd fix-config CONFIG_PATH [OPTIONS]
Auto-fix common config mistakes (typos, missing ports, whitespace).
Arguments:
CONFIG_PATH Path to YAML config file to fix
Options:
--write Write fixes back to file (creates timestamped .bak backup).
Note: does not preserve YAML comments or formatting.
--interactive Prompt for confirmation on ambiguous suggestions
Config resolution order
--config/-cCLI argumentXPYD_CONFIGenvironment variable./xpyd.yamlin the current directory
YAML Config
# Required
model: /path/to/model
decode:
- "10.0.0.1:8200"
- "10.0.0.2:8200"
# Optional
prefill:
- "10.0.0.3:8100"
port: 8000
log_level: warning
scheduling: loadbalanced # roundrobin | loadbalanced | consistent_hash | power_of_two | cache_aware
generator_on_p_node: false
See examples/proxy.yaml for a fully-commented example.
YAML Fields Reference
| Field | Type | Default | Description |
|---|---|---|---|
model |
string | — | Model name / path (required in Format 1) |
port |
int | 8000 | Proxy listen port |
log_level |
string | warning | Log level: debug, info, warning, error |
prefill |
list or topology | [] | Prefill node config (Format 1) |
decode |
list or topology | — | Decode node config (Format 1, required) |
instances |
list | — | Per-instance config (Format 2): {address, role, model} |
models |
list | — | Per-model shorthand (Format 3): {name, prefill, decode, dual, scheduler} |
scheduling |
string | loadbalanced | Global scheduling policy |
scheduling_config |
dict | {} | Policy-specific options |
generator_on_p_node |
bool | false | Generate first token on prefill node |
admin_api_key |
string | — | Admin API key (env ADMIN_API_KEY overrides) |
openai_api_key |
string | — | OpenAI API key (env OPENAI_API_KEY overrides) |
startup.wait_timeout_seconds |
int | 600 | Max wait for nodes at startup |
startup.probe_interval_seconds |
int | 10 | Health probe interval |
Valid role values: prefill, decode, dual
Valid scheduling values: loadbalanced, roundrobin (alias: round_robin), load_balanced, consistent_hash, power_of_two, cache_aware
API
The proxy exposes an OpenAI-compatible API:
POST /v1/chat/completions— Chat completions (streaming and non-streaming)POST /v1/completions— Text completions (streaming and non-streaming)GET /v1/models— List all registered models in OpenAI-compatible format
Startup Node Discovery
The proxy returns 503 on business endpoints until the minimum instance requirement is met: at least 1 prefill + 1 decode node, or 1 dual node per model must respond healthy. Health/status/metrics endpoints are always available.
Docker
# Full local topology (prefill + decode + proxy)
docker compose up --build
# Proxy only, connecting to existing GPU nodes
docker build -t xpyd .
docker run -p 8868:8868 -v ./config.yaml:/app/xpyd.yaml xpyd
See docs/deployment.md for production deployment.
Benchmark
python -m vllm bench serve \
--base-url http://localhost:8868 \
--model DeepSeek-R1 \
--dataset-name sonnet \
--sonnet-input-len 1024 \
--sonnet-output-len 128 \
--num-prompts 100 \
--request-rate 10
Development
# Install in dev mode
pip install -e ".[dev]"
# Run tests
python -m pytest tests/unit/ tests/integration/ -v
# Lint
pre-commit run --all-files
Environment Variables
| Variable | Description |
|---|---|
XPYD_CONFIG |
Default config file path |
ADMIN_API_KEY |
Admin API key (overrides YAML) |
OPENAI_API_KEY |
Bearer token for backend nodes (overrides YAML) |
PREFILL_DELAY_PER_TOKEN |
Simulated prefill latency for dummy nodes (default: 0.001s) |
DECODE_DELAY_PER_TOKEN |
Simulated decode latency for dummy nodes (default: 0.01s) |
Documentation
| Document | Description |
|---|---|
| Architecture | System architecture overview |
| API Reference | HTTP API endpoints |
| Configuration | YAML config reference |
| CLI | xpyd command-line tool |
| Scheduling | Load balancing strategies |
| Resilience | Health checks, circuit breakers, retry |
| Metrics | Prometheus metrics endpoint |
| Deployment | Deployment and Docker guide |
| Contributing | Contribution guidelines |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xpyd_proxy-1.2.0.tar.gz.
File metadata
- Download URL: xpyd_proxy-1.2.0.tar.gz
- Upload date:
- Size: 52.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e33a2895823147c34ec815d997f1854fce5d9cad44aa04ef09c3a6a171b3283d
|
|
| MD5 |
9b1f9435869fe88875198a3dfac5ce06
|
|
| BLAKE2b-256 |
54defc972d029f20615702d20b5711ef640ce08625dac03ea941b557cc08c884
|
Provenance
The following attestation bundles were made for xpyd_proxy-1.2.0.tar.gz:
Publisher:
release.yml on xPyD-hub/xPyD-proxy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xpyd_proxy-1.2.0.tar.gz -
Subject digest:
e33a2895823147c34ec815d997f1854fce5d9cad44aa04ef09c3a6a171b3283d - Sigstore transparency entry: 1239207665
- Sigstore integration time:
-
Permalink:
xPyD-hub/xPyD-proxy@0e12dc36a206cd5f47f515e8502b0042db09fed1 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/xPyD-hub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0e12dc36a206cd5f47f515e8502b0042db09fed1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file xpyd_proxy-1.2.0-py3-none-any.whl.
File metadata
- Download URL: xpyd_proxy-1.2.0-py3-none-any.whl
- Upload date:
- Size: 60.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7976de764a86f7a7b3eecfde99ea1cd0cbcc4a0b4ea234de228b4b2d57a2e703
|
|
| MD5 |
8bd44ffdc8d3b2761bb4e43e6a9cb0a7
|
|
| BLAKE2b-256 |
03f7cdfa23f71250c7cd0a533992c2a08722cc9544b2f0c0aa88297240c43af8
|
Provenance
The following attestation bundles were made for xpyd_proxy-1.2.0-py3-none-any.whl:
Publisher:
release.yml on xPyD-hub/xPyD-proxy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xpyd_proxy-1.2.0-py3-none-any.whl -
Subject digest:
7976de764a86f7a7b3eecfde99ea1cd0cbcc4a0b4ea234de228b4b2d57a2e703 - Sigstore transparency entry: 1239207687
- Sigstore integration time:
-
Permalink:
xPyD-hub/xPyD-proxy@0e12dc36a206cd5f47f515e8502b0042db09fed1 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/xPyD-hub
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0e12dc36a206cd5f47f515e8502b0042db09fed1 -
Trigger Event:
push
-
Statement type: