Skip to main content

Python client for ModelExpress P2P GPU transfer service

Project description

ModelExpress Python Client

Python client for ModelExpress -- high-performance GPU-to-GPU model weight transfers using NVIDIA NIXL over RDMA/InfiniBand.

Instead of each vLLM instance loading model weights from storage, one "source" instance loads the model and transfers weights directly to "target" instances via GPUDirect RDMA, bypassing the CPU entirely.

Installation

# From PyPI (coming soon)
pip install modelexpress

# Editable install from source
pip install -e .

# With dev dependencies (pytest, grpcio-tools)
pip install -e ".[dev]"

Requirements

  • Python >= 3.10
  • NVIDIA GPUs with RDMA/InfiniBand support
  • NIXL (NVIDIA Interconnect eXchange Library)
  • A running ModelExpress server (Rust gRPC service backed by Redis)

Quick Start with vLLM

ModelExpress integrates with vLLM via custom model loaders. vLLM can discover the package through its vllm.general_plugins entrypoint; set VLLM_PLUGINS=modelexpress if your vLLM deployment requires explicit plugin selection. For manual registration, call register_modelexpress_loaders() in your code.

export MX_SERVER_ADDRESS="modelexpress-server:8001"

vllm serve deepseek-ai/DeepSeek-V3 \
    --load-format modelexpress \
    --tensor-parallel-size 8

Starting the vLLM engine with the modelexpress load format on the source worker will load the weights from disk and register/publish the NIXL and tensor metadata to the MX server. The mx load format is kept as a backward-compatible alias. And on the target worker, it will retrieve these metadata from MX serverand stream weights over RDMA from GPU to GPU.

Programmatic Usage

MxClient

MxClient is a lightweight gRPC client for communicating with the ModelExpress server:

from modelexpress import MxClient

client = MxClient(server_url="modelexpress-server:8001")

# Query for a source model
response = client.get_metadata("deepseek-ai/DeepSeek-V3")
if response.found:
    for worker in response.workers:
        print(f"Worker rank {worker.worker_rank}: {len(worker.tensors)} tensors")

# Wait for source readiness (blocks until ready or timeout)
success, session_id, metadata_hash = client.wait_for_ready(
    model_name="deepseek-ai/DeepSeek-V3",
    worker_id=0,
    timeout_seconds=7200,
)

client.close()

Registering Loaders Manually

from modelexpress import register_modelexpress_loaders

register_modelexpress_loaders()
# Now vLLM recognizes --load-format modelexpress and mx

Environment Variables

Variable Default Description
MX_SERVER_ADDRESS localhost:8001 ModelExpress gRPC server address (recommended)
MODEL_EXPRESS_URL localhost:8001 Deprecated, pending removal in a future release. Still read by all client paths and takes precedence when both are set; keep setting it during the transition.
MX_EXPECTED_WORKERS Auto-detected from TP size Number of GPU workers to coordinate
MX_SYNC_PUBLISH 0 Source: wait for all workers before publishing metadata
MX_SYNC_START 1 Target: wait for all source workers before transferring
MX_POOL_REG 0 Allocation-level NIXL registration (registers cudaMalloc blocks instead of individual tensors)

UCX/NIXL Tuning

Variable Recommended Description
UCX_RNDV_SCHEME get_zcopy Zero-copy RDMA reads
UCX_RNDV_THRESH 0 Force rendezvous for all transfers
NIXL_LOG_LEVEL INFO NIXL logging level

Package Structure

Module Description
modelexpress.client MxClient -- gRPC client for the ModelExpress server
modelexpress.metadata Metadata clients, source identity, heartbeat, and worker manifest serving
modelexpress.engines.vllm.loader MxModelLoader -- vLLM integration
modelexpress.vllm_loader Compatibility shim for the vLLM loader
modelexpress.nixl_transfer NixlTransferManager -- NIXL agent lifecycle and RDMA transfers
modelexpress.types TensorDescriptor, WorkerMetadata -- core data types
modelexpress.vllm_worker Compatibility worker extension for older manual-registration workflows

How It Works

  1. Source loads weights from disk, registers raw tensors with NIXL before FP8 processing, and publishes metadata to the ModelExpress server.
  2. Target creates dummy weights, waits for the source ready flag, then pulls raw tensors via RDMA read.
  3. Both source and target run process_weights_after_loading() independently, producing identical FP8-transformed weights.

This pre-processing transfer strategy is critical for FP8 models (e.g., DeepSeek-V3) where weight_scale_inv tensors are renamed and transformed during processing.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelexpress-0.4.0.tar.gz (136.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

modelexpress-0.4.0-py3-none-any.whl (116.2 kB view details)

Uploaded Python 3

modelexpress-0.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (199.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

modelexpress-0.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl (199.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ ARM64manylinux: glibc 2.28+ ARM64

File details

Details for the file modelexpress-0.4.0.tar.gz.

File metadata

  • Download URL: modelexpress-0.4.0.tar.gz
  • Upload date:
  • Size: 136.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for modelexpress-0.4.0.tar.gz
Algorithm Hash digest
SHA256 4fb9436bf184c3e0a35d7eff0e2997772d10049f43b9cf08ce0392cfad53cd4a
MD5 eb670ff8170cd6786a177663aa62e3de
BLAKE2b-256 f95b64b8c621ecb7176d0478ab85cfea198361aa56aca13f46e1a752944689a6

See more details on using hashes here.

File details

Details for the file modelexpress-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: modelexpress-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 116.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for modelexpress-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6edf8c9437f55fa0227a170a0927cedbb21f1f37fc513733c1b86fdfb069285a
MD5 bf5043da4978a22dd99f5ae40dea6c6a
BLAKE2b-256 7fae11bcb8084809d580e89c28be1a91219ecac76bff55b87742edc94af3e950

See more details on using hashes here.

File details

Details for the file modelexpress-0.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for modelexpress-0.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0189ccad66ee625eb73cecb541bbcbd57ce057491cb065e48e2f17c89b65380f
MD5 17a80a3a4da12af1149df739d2ba0c05
BLAKE2b-256 e136ff15fc6c8026c9a49760105ee126e26b4dbffba83907f2b2913b9b8744ea

See more details on using hashes here.

File details

Details for the file modelexpress-0.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for modelexpress-0.4.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 22a9092fdff85581e62b5a2f595a4385fb0f1305b4d1212a11f4dc9b2781d2bb
MD5 afe60880b4a720d61054c75bc5908189
BLAKE2b-256 67d3e2343f18e696d0aa3c8eab7261519cbefb7beaaa9b718bd71732c284dcc9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page