Skip to main content

Lightweight Prefill-Decode proxy for disaggregated LLM serving

Project description

xPyD Proxy

A lightweight Prefill-Decode (PD) proxy for disaggregated LLM serving.

Architecture

xPyD Proxy supports two operating modes:

Prefill-Decode (P/D) Disaggregated Mode

Requests are routed through two phases with KV cache transfer:

  1. Prefill — KV cache preparation on prefill nodes (max_tokens=1)
  2. Decode — autoregressive token generation on decode nodes (receives KV cache from prefill)

Dual-Role Mode

A single instance handles both prefill and decode in one pass — no KV transfer needed. This simplifies deployment when disaggregation is not required or for smaller-scale setups.

The proxy handles scheduling (load-balanced, round-robin, consistent hash, power-of-two, cache-aware), health monitoring, circuit breaking, and dynamic instance management. Multi-model routing allows serving multiple models through a single proxy with per-model scheduler configuration.

See docs/architecture.md for details.

Installation

pip install .

# Verify
xpyd --version

Quick Start

# Generate a config template
xpyd proxy --init-config

# Edit xpyd.yaml with your model and node addresses, then:
xpyd proxy -c xpyd.yaml

Configuration

All configuration is done via YAML. Three config formats are supported.

Format 1: Legacy (Single Model)

Simple prefill/decode address lists for a single model:

model: /path/to/model
prefill:
  - "10.0.0.3:8100"
decode:
  - "10.0.0.1:8200"
  - "10.0.0.2:8200"
port: 8000
scheduling: loadbalanced

Topology-style config is also supported in Format 1:

model: /path/to/model
port: 8868

prefill:
  nodes:
    - "10.0.0.1:8100"
  tp_size: 8
  dp_size: 1
  world_size_per_node: 8

decode:
  nodes:
    - "10.0.0.2:8200"
    - "10.0.0.3:8200"
  tp_size: 1
  dp_size: 16
  world_size_per_node: 8

Format 2: Instances (Multi-Model, Per-Instance Role)

Explicit per-instance configuration with role and model assignment. Supports dual role:

instances:
  - address: "10.0.0.1:8000"
    role: prefill
    model: llama-3
  - address: "10.0.0.2:8000"
    role: decode
    model: llama-3
  - address: "10.0.0.3:8000"
    role: dual
    model: qwen-2
port: 8000
scheduling: loadbalanced

Format 3: Models Shorthand (Multi-Model, Per-Model Scheduler)

Compact format with per-model scheduler override and dual shorthand:

models:
  - name: llama-3
    prefill:
      - "10.0.0.1:8000"
    decode:
      - "10.0.0.2:8000"
    scheduler: round_robin
  - name: qwen-2
    dual:
      - "10.0.0.3:8000"
      - "10.0.0.4:8000"
    scheduler: loadbalanced
port: 8000

Note: instances and models cannot be combined. Legacy prefill/decode lists cannot be used with instances or models.

See examples/proxy.yaml for a fully-commented example.

CLI Reference

xpyd proxy [OPTIONS]

Options:
  --config, -c PATH         Path to YAML config (default: ./xpyd.yaml or XPYD_CONFIG env)
  --init-config [PATH]      Generate a config template and exit
  --validate-config PATH    Validate a config file and exit
  --port PORT               Override port from config
  --log-level LEVEL         Override log level: debug|info|warning|error
  --version, -V             Show version and exit
xpyd fix-config CONFIG_PATH [OPTIONS]

Auto-fix common config mistakes (typos, missing ports, whitespace).

Arguments:
  CONFIG_PATH               Path to YAML config file to fix

Options:
  --write                   Write fixes back to file (creates timestamped .bak backup).
                            Note: does not preserve YAML comments or formatting.
  --interactive             Prompt for confirmation on ambiguous suggestions

Config resolution order

  1. --config / -c CLI argument
  2. XPYD_CONFIG environment variable
  3. ./xpyd.yaml in the current directory

YAML Config

# Required
model: /path/to/model
decode:
  - "10.0.0.1:8200"
  - "10.0.0.2:8200"

# Optional
prefill:
  - "10.0.0.3:8100"
port: 8000
log_level: warning
scheduling: loadbalanced   # roundrobin | loadbalanced | consistent_hash | power_of_two | cache_aware
generator_on_p_node: false

See examples/proxy.yaml for a fully-commented example.

YAML Fields Reference

Field Type Default Description
model string Model name / path (required in Format 1)
port int 8000 Proxy listen port
log_level string warning Log level: debug, info, warning, error
prefill list or topology [] Prefill node config (Format 1)
decode list or topology Decode node config (Format 1, required)
instances list Per-instance config (Format 2): {address, role, model}
models list Per-model shorthand (Format 3): {name, prefill, decode, dual, scheduler}
scheduling string loadbalanced Global scheduling policy
scheduling_config dict {} Policy-specific options
generator_on_p_node bool false Generate first token on prefill node
admin_api_key string Admin API key (env ADMIN_API_KEY overrides)
openai_api_key string OpenAI API key (env OPENAI_API_KEY overrides)
startup.wait_timeout_seconds int 600 Max wait for nodes at startup
startup.probe_interval_seconds int 10 Health probe interval

Valid role values: prefill, decode, dual

Valid scheduling values: loadbalanced, roundrobin (alias: round_robin), load_balanced, consistent_hash, power_of_two, cache_aware

API

The proxy exposes an OpenAI-compatible API:

  • POST /v1/chat/completions — Chat completions (streaming and non-streaming)
  • POST /v1/completions — Text completions (streaming and non-streaming)
  • GET /v1/models — List all registered models in OpenAI-compatible format

Startup Node Discovery

The proxy returns 503 on business endpoints until the minimum instance requirement is met: at least 1 prefill + 1 decode node, or 1 dual node per model must respond healthy. Health/status/metrics endpoints are always available.

Docker

# Full local topology (prefill + decode + proxy)
docker compose up --build

# Proxy only, connecting to existing GPU nodes
docker build -t xpyd .
docker run -p 8868:8868 -v ./config.yaml:/app/xpyd.yaml xpyd

See docs/deployment.md for production deployment.

Benchmark

python -m vllm bench serve \
  --base-url http://localhost:8868 \
  --model DeepSeek-R1 \
  --dataset-name sonnet \
  --sonnet-input-len 1024 \
  --sonnet-output-len 128 \
  --num-prompts 100 \
  --request-rate 10

Development

# Install in dev mode
pip install -e ".[dev]"

# Run tests
python -m pytest tests/unit/ tests/integration/ -v

# Lint
pre-commit run --all-files

Environment Variables

Variable Description
XPYD_CONFIG Default config file path
ADMIN_API_KEY Admin API key (overrides YAML)
OPENAI_API_KEY Bearer token for backend nodes (overrides YAML)
PREFILL_DELAY_PER_TOKEN Simulated prefill latency for dummy nodes (default: 0.001s)
DECODE_DELAY_PER_TOKEN Simulated decode latency for dummy nodes (default: 0.01s)

Documentation

Document Description
Architecture System architecture overview
API Reference HTTP API endpoints
Configuration YAML config reference
CLI xpyd command-line tool
Scheduling Load balancing strategies
Resilience Health checks, circuit breakers, retry
Metrics Prometheus metrics endpoint
Deployment Deployment and Docker guide
Contributing Contribution guidelines

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpyd-1.1.0.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xpyd-1.1.0-py3-none-any.whl (62.5 kB view details)

Uploaded Python 3

File details

Details for the file xpyd-1.1.0.tar.gz.

File metadata

  • Download URL: xpyd-1.1.0.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for xpyd-1.1.0.tar.gz
Algorithm Hash digest
SHA256 4effae4124e07e08d72afa0f6e50e6b497fc274acc71bbada6a5a3dfe5c363ec
MD5 87b615d4493c80a2500207fed4c367b4
BLAKE2b-256 638015a414d00648ecece517e94f2b60f64b10637fca20cd091042aaf8b1fadb

See more details on using hashes here.

Provenance

The following attestation bundles were made for xpyd-1.1.0.tar.gz:

Publisher: release.yml on xPyD-hub/xPyD-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xpyd-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: xpyd-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 62.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for xpyd-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25eb6d2d3a2faa098a41c84faedc24260fa203e121203125b74b6fa9842bc157
MD5 1528b54abd68190f8d08686c84f4f257
BLAKE2b-256 26145e3313eec1dc84e715c09c7963d3f2b315e94cb6a1698588d2a01a48ad37

See more details on using hashes here.

Provenance

The following attestation bundles were made for xpyd-1.1.0-py3-none-any.whl:

Publisher: release.yml on xPyD-hub/xPyD-proxy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page