Lightweight Prefill-Decode proxy for disaggregated LLM serving

Project description

xPyD Proxy

A lightweight Prefill-Decode (PD) proxy for disaggregated LLM serving.

Architecture

xPyD Proxy supports two operating modes:

Prefill-Decode (P/D) Disaggregated Mode

Requests are routed through two phases with KV cache transfer:

Prefill — KV cache preparation on prefill nodes (max_tokens=1)
Decode — autoregressive token generation on decode nodes (receives KV cache from prefill)

Dual-Role Mode

A single instance handles both prefill and decode in one pass — no KV transfer needed. This simplifies deployment when disaggregation is not required or for smaller-scale setups.

The proxy handles scheduling (load-balanced, round-robin, consistent hash, power-of-two, cache-aware), health monitoring, circuit breaking, and dynamic instance management. Multi-model routing allows serving multiple models through a single proxy with per-model scheduler configuration.

See docs/architecture.md for details.

Installation

pip install .

# Verify
xpyd --version

Quick Start

# Generate a config template
xpyd proxy --init-config

# Edit xpyd.yaml with your model and node addresses, then:
xpyd proxy -c xpyd.yaml

Configuration

All configuration is done via YAML. Three config formats are supported.

Format 1: Legacy (Single Model)

Simple prefill/decode address lists for a single model:

model: /path/to/model
prefill:
  - "10.0.0.3:8100"
decode:
  - "10.0.0.1:8200"
  - "10.0.0.2:8200"
port: 8000
scheduling: loadbalanced

Topology-style config is also supported in Format 1:

model: /path/to/model
port: 8868

prefill:
  nodes:
    - "10.0.0.1:8100"
  tp_size: 8
  dp_size: 1
  world_size_per_node: 8

decode:
  nodes:
    - "10.0.0.2:8200"
    - "10.0.0.3:8200"
  tp_size: 1
  dp_size: 16
  world_size_per_node: 8

Format 2: Instances (Multi-Model, Per-Instance Role)

Explicit per-instance configuration with role and model assignment. Supports dual role:

instances:
  - address: "10.0.0.1:8000"
    role: prefill
    model: llama-3
  - address: "10.0.0.2:8000"
    role: decode
    model: llama-3
  - address: "10.0.0.3:8000"
    role: dual
    model: qwen-2
port: 8000
scheduling: loadbalanced

Format 3: Models Shorthand (Multi-Model, Per-Model Scheduler)

Compact format with per-model scheduler override and dual shorthand:

models:
  - name: llama-3
    prefill:
      - "10.0.0.1:8000"
    decode:
      - "10.0.0.2:8000"
    scheduler: round_robin
  - name: qwen-2
    dual:
      - "10.0.0.3:8000"
      - "10.0.0.4:8000"
    scheduler: loadbalanced
port: 8000

Note: instances and models cannot be combined. Legacy prefill/decode lists cannot be used with instances or models.

See examples/proxy.yaml for a fully-commented example.

CLI Reference

xpyd proxy [OPTIONS]

Options:
  --config, -c PATH         Path to YAML config (default: ./xpyd.yaml or XPYD_CONFIG env)
  --init-config [PATH]      Generate a config template and exit
  --validate-config PATH    Validate a config file and exit
  --port PORT               Override port from config
  --log-level LEVEL         Override log level: debug|info|warning|error
  --version, -V             Show version and exit

xpyd fix-config CONFIG_PATH [OPTIONS]

Auto-fix common config mistakes (typos, missing ports, whitespace).

Arguments:
  CONFIG_PATH               Path to YAML config file to fix

Options:
  --write                   Write fixes back to file (creates timestamped .bak backup).
                            Note: does not preserve YAML comments or formatting.
  --interactive             Prompt for confirmation on ambiguous suggestions

Config resolution order

--config / -c CLI argument
XPYD_CONFIG environment variable
./xpyd.yaml in the current directory

YAML Config

# Required
model: /path/to/model
decode:
  - "10.0.0.1:8200"
  - "10.0.0.2:8200"

# Optional
prefill:
  - "10.0.0.3:8100"
port: 8000
log_level: warning
scheduling: loadbalanced   # roundrobin | loadbalanced | consistent_hash | power_of_two | cache_aware
generator_on_p_node: false

See examples/proxy.yaml for a fully-commented example.

YAML Fields Reference

Field	Type	Default	Description
`model`	string	—	Model name / path (required in Format 1)
`port`	int	8000	Proxy listen port
`log_level`	string	warning	Log level: debug, info, warning, error
`prefill`	list or topology	[]	Prefill node config (Format 1)
`decode`	list or topology	—	Decode node config (Format 1, required)
`instances`	list	—	Per-instance config (Format 2): `{address, role, model}`
`models`	list	—	Per-model shorthand (Format 3): `{name, prefill, decode, dual, scheduler}`
`scheduling`	string	loadbalanced	Global scheduling policy
`scheduling_config`	dict	{}	Policy-specific options
`generator_on_p_node`	bool	false	Generate first token on prefill node
`admin_api_key`	string	—	Admin API key (env `ADMIN_API_KEY` overrides)
`openai_api_key`	string	—	OpenAI API key (env `OPENAI_API_KEY` overrides)
`startup.wait_timeout_seconds`	int	600	Max wait for nodes at startup
`startup.probe_interval_seconds`	int	10	Health probe interval

Valid role values: prefill, decode, dual

Valid scheduling values: loadbalanced, roundrobin (alias: round_robin), load_balanced, consistent_hash, power_of_two, cache_aware

API

The proxy exposes an OpenAI-compatible API:

POST /v1/chat/completions — Chat completions (streaming and non-streaming)
POST /v1/completions — Text completions (streaming and non-streaming)
GET /v1/models — List all registered models in OpenAI-compatible format

Startup Node Discovery

The proxy returns 503 on business endpoints until the minimum instance requirement is met: at least 1 prefill + 1 decode node, or 1 dual node per model must respond healthy. Health/status/metrics endpoints are always available.

Docker

# Full local topology (prefill + decode + proxy)
docker compose up --build

# Proxy only, connecting to existing GPU nodes
docker build -t xpyd .
docker run -p 8868:8868 -v ./config.yaml:/app/xpyd.yaml xpyd

See docs/deployment.md for production deployment.

Benchmark

python -m vllm bench serve \
  --base-url http://localhost:8868 \
  --model DeepSeek-R1 \
  --dataset-name sonnet \
  --sonnet-input-len 1024 \
  --sonnet-output-len 128 \
  --num-prompts 100 \
  --request-rate 10

Development

# Install in dev mode
pip install -e ".[dev]"

# Run tests
python -m pytest tests/unit/ tests/integration/ -v

# Lint
pre-commit run --all-files

Environment Variables

Variable	Description
`XPYD_CONFIG`	Default config file path
`ADMIN_API_KEY`	Admin API key (overrides YAML)
`OPENAI_API_KEY`	Bearer token for backend nodes (overrides YAML)
`PREFILL_DELAY_PER_TOKEN`	Simulated prefill latency for dummy nodes (default: 0.001s)
`DECODE_DELAY_PER_TOKEN`	Simulated decode latency for dummy nodes (default: 0.01s)

Documentation

Document	Description
Architecture	System architecture overview
API Reference	HTTP API endpoints
Configuration	YAML config reference
CLI	xpyd command-line tool
Scheduling	Load balancing strategies
Resilience	Health checks, circuit breakers, retry
Metrics	Prometheus metrics endpoint
Deployment	Deployment and Docker guide
Contributing	Contribution guidelines

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

1.4.0

Apr 6, 2026

1.3.0

Apr 6, 2026

1.2.0

Apr 5, 2026

This version

1.1.0

Apr 3, 2026

1.0.0

Apr 1, 2026

0.0.1

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpyd-1.1.0.tar.gz (52.9 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xpyd-1.1.0-py3-none-any.whl (62.5 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file xpyd-1.1.0.tar.gz.

File metadata

Download URL: xpyd-1.1.0.tar.gz
Upload date: Apr 3, 2026
Size: 52.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for xpyd-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4effae4124e07e08d72afa0f6e50e6b497fc274acc71bbada6a5a3dfe5c363ec`
MD5	`87b615d4493c80a2500207fed4c367b4`
BLAKE2b-256	`638015a414d00648ecece517e94f2b60f64b10637fca20cd091042aaf8b1fadb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xpyd-1.1.0.tar.gz:

Publisher: release.yml on xPyD-hub/xPyD-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xpyd-1.1.0.tar.gz
- Subject digest: 4effae4124e07e08d72afa0f6e50e6b497fc274acc71bbada6a5a3dfe5c363ec
- Sigstore transparency entry: 1227699759
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: xPyD-hub/xPyD-proxy@0db5804cf3e9613a660a0f03a62f4d41c8137c9b
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/xPyD-hub
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0db5804cf3e9613a660a0f03a62f4d41c8137c9b
- Trigger Event: push

File details

Details for the file xpyd-1.1.0-py3-none-any.whl.

File metadata

Download URL: xpyd-1.1.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 62.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for xpyd-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25eb6d2d3a2faa098a41c84faedc24260fa203e121203125b74b6fa9842bc157`
MD5	`1528b54abd68190f8d08686c84f4f257`
BLAKE2b-256	`26145e3313eec1dc84e715c09c7963d3f2b315e94cb6a1698588d2a01a48ad37`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xpyd-1.1.0-py3-none-any.whl:

Publisher: release.yml on xPyD-hub/xPyD-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xpyd-1.1.0-py3-none-any.whl
- Subject digest: 25eb6d2d3a2faa098a41c84faedc24260fa203e121203125b74b6fa9842bc157
- Sigstore transparency entry: 1227699764
- Sigstore integration time: Apr 3, 2026
Source repository:
- Permalink: xPyD-hub/xPyD-proxy@0db5804cf3e9613a660a0f03a62f4d41c8137c9b
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/xPyD-hub
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0db5804cf3e9613a660a0f03a62f4d41c8137c9b
- Trigger Event: push

xpyd 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

xPyD Proxy

Architecture

Prefill-Decode (P/D) Disaggregated Mode

Dual-Role Mode

Installation

Quick Start

Configuration

Format 1: Legacy (Single Model)

Format 2: Instances (Multi-Model, Per-Instance Role)

Format 3: Models Shorthand (Multi-Model, Per-Model Scheduler)

CLI Reference

Config resolution order

YAML Config

YAML Fields Reference

API

Startup Node Discovery

Docker

Benchmark

Development

Environment Variables

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance