Skip to main content

Python client for ModelExpress P2P GPU transfer service

Project description

ModelExpress Python Client

Python client for ModelExpress -- high-performance GPU-to-GPU model weight transfers using NVIDIA NIXL over RDMA/InfiniBand.

Instead of each vLLM instance loading model weights from storage, one "source" instance loads the model and transfers weights directly to "target" instances via GPUDirect RDMA, bypassing the CPU entirely.

Installation

# From PyPI (coming soon)
pip install modelexpress

# Editable install from source
pip install -e .

# With dev dependencies (pytest, grpcio-tools)
pip install -e ".[dev]"

Requirements

  • Python >= 3.10
  • NVIDIA GPUs with RDMA/InfiniBand support
  • NIXL (NVIDIA Interconnect eXchange Library)
  • A running ModelExpress server (Rust gRPC service backed by Redis)

Quick Start with vLLM

ModelExpress integrates with vLLM via custom model loaders. Set MX_REGISTER_LOADERS=1 to auto-register them, or call register_modelexpress_loaders() in your code.

export MODEL_EXPRESS_URL="modelexpress-server:8001"

vllm serve deepseek-ai/DeepSeek-V3 \
    --load-format mx \
    --tensor-parallel-size 8 \
    --worker-cls modelexpress.vllm_worker.ModelExpressWorker

Starting the vLLM engine with mx loader on the source worker will load the weights from disk and register/publish the NIXL and tensor metadata to the MX server. And on the target worker, it will retrieve these metadata from MX serverand stream weights over RDMA from GPU to GPU.

Programmatic Usage

MxClient

MxClient is a lightweight gRPC client for communicating with the ModelExpress server:

from modelexpress import MxClient

client = MxClient(server_url="modelexpress-server:8001")

# Query for a source model
response = client.get_metadata("deepseek-ai/DeepSeek-V3")
if response.found:
    for worker in response.workers:
        print(f"Worker rank {worker.worker_rank}: {len(worker.tensors)} tensors")

# Wait for source readiness (blocks until ready or timeout)
success, session_id, metadata_hash = client.wait_for_ready(
    model_name="deepseek-ai/DeepSeek-V3",
    worker_id=0,
    timeout_seconds=7200,
)

client.close()

Registering Loaders Manually

from modelexpress import register_modelexpress_loaders

register_modelexpress_loaders()
# Now vLLM recognizes --load-format mx-source and mx-target

Environment Variables

Variable Default Description
MODEL_EXPRESS_URL localhost:8001 ModelExpress gRPC server address
MX_SERVER_ADDRESS localhost:8001 Backward-compatible alias for MODEL_EXPRESS_URL
MX_REGISTER_LOADERS 1 Auto-register mx loader with vLLM
MX_EXPECTED_WORKERS Auto-detected from TP size Number of GPU workers to coordinate
MX_SYNC_PUBLISH 0 Source: wait for all workers before publishing metadata
MX_SYNC_START 1 Target: wait for all source workers before transferring
MX_CONTIGUOUS_REG 0 Enable contiguous region registration (experimental)

UCX/NIXL Tuning

Variable Recommended Description
UCX_TLS rc_x,rc,dc_x,dc,cuda_copy Transport layers for InfiniBand
UCX_RNDV_SCHEME get_zcopy Zero-copy RDMA reads
UCX_RNDV_THRESH 0 Force rendezvous for all transfers
NIXL_LOG_LEVEL INFO NIXL logging level

Package Structure

Module Description
modelexpress.client MxClient -- gRPC client for the ModelExpress server
modelexpress.vllm_loader MxModelLoader -- vLLM integration
modelexpress.nixl_transfer NixlTransferManager -- NIXL agent lifecycle and RDMA transfers
modelexpress.types TensorDescriptor, WorkerMetadata -- core data types
modelexpress.vllm_worker vLLM worker extensions

How It Works

  1. Source loads weights from disk, registers raw tensors with NIXL before FP8 processing, and publishes metadata to the ModelExpress server.
  2. Target creates dummy weights, waits for the source ready flag, then pulls raw tensors via RDMA read.
  3. Both source and target run process_weights_after_loading() independently, producing identical FP8-transformed weights.

This pre-processing transfer strategy is critical for FP8 models (e.g., DeepSeek-V3) where weight_scale_inv tensors are renamed and transformed during processing.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelexpress-0.3.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file modelexpress-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: modelexpress-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for modelexpress-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f93d8de74903f6c11ebf9b4621fa02e3c493a5e05207c38fb2c98395a774618
MD5 177ea8969d3d769f10acc212f5bf4131
BLAKE2b-256 7cb512f940a41940fd83b9e50ce46124695b68a80feab02612779f8fb584db09

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page