Python client for ModelExpress P2P GPU transfer service

These details have not been verified by PyPI

Project links

Project description

ModelExpress Python Client

Python client for ModelExpress -- high-performance GPU-to-GPU model weight transfers using NVIDIA NIXL over RDMA/InfiniBand.

Instead of each vLLM instance loading model weights from storage, one "source" instance loads the model and transfers weights directly to "target" instances via GPUDirect RDMA, bypassing the CPU entirely.

Installation

# From PyPI (coming soon)
pip install modelexpress

# Editable install from source
pip install -e .

# With dev dependencies (pytest, grpcio-tools)
pip install -e ".[dev]"

Requirements

Python >= 3.10
NVIDIA GPUs with RDMA/InfiniBand support
NIXL (NVIDIA Interconnect eXchange Library)
A running ModelExpress server (Rust gRPC service backed by Redis)

Quick Start with vLLM

ModelExpress integrates with vLLM via custom model loaders. Set MX_REGISTER_LOADERS=1 to auto-register them, or call register_modelexpress_loaders() in your code.

export MODEL_EXPRESS_URL="modelexpress-server:8001"

vllm serve deepseek-ai/DeepSeek-V3 \
    --load-format mx \
    --tensor-parallel-size 8 \
    --worker-cls modelexpress.vllm_worker.ModelExpressWorker

Starting the vLLM engine with mx loader on the source worker will load the weights from disk and register/publish the NIXL and tensor metadata to the MX server. And on the target worker, it will retrieve these metadata from MX serverand stream weights over RDMA from GPU to GPU.

Programmatic Usage

MxClient

MxClient is a lightweight gRPC client for communicating with the ModelExpress server:

from modelexpress import MxClient

client = MxClient(server_url="modelexpress-server:8001")

# Query for a source model
response = client.get_metadata("deepseek-ai/DeepSeek-V3")
if response.found:
    for worker in response.workers:
        print(f"Worker rank {worker.worker_rank}: {len(worker.tensors)} tensors")

# Wait for source readiness (blocks until ready or timeout)
success, session_id, metadata_hash = client.wait_for_ready(
    model_name="deepseek-ai/DeepSeek-V3",
    worker_id=0,
    timeout_seconds=7200,
)

client.close()

Registering Loaders Manually

from modelexpress import register_modelexpress_loaders

register_modelexpress_loaders()
# Now vLLM recognizes --load-format mx-source and mx-target

Environment Variables

Variable	Default	Description
`MODEL_EXPRESS_URL`	`localhost:8001`	ModelExpress gRPC server address
`MX_SERVER_ADDRESS`	`localhost:8001`	Backward-compatible alias for `MODEL_EXPRESS_URL`
`MX_REGISTER_LOADERS`	`1`	Auto-register `mx` loader with vLLM
`MX_EXPECTED_WORKERS`	Auto-detected from TP size	Number of GPU workers to coordinate
`MX_SYNC_PUBLISH`	`0`	Source: wait for all workers before publishing metadata
`MX_SYNC_START`	`1`	Target: wait for all source workers before transferring
`MX_CONTIGUOUS_REG`	`0`	Enable contiguous region registration (experimental)

UCX/NIXL Tuning

Variable	Recommended	Description
`UCX_TLS`	`rc_x,rc,dc_x,dc,cuda_copy`	Transport layers for InfiniBand
`UCX_RNDV_SCHEME`	`get_zcopy`	Zero-copy RDMA reads
`UCX_RNDV_THRESH`	`0`	Force rendezvous for all transfers
`NIXL_LOG_LEVEL`	`INFO`	NIXL logging level

Package Structure

Module	Description
`modelexpress.client`	`MxClient` -- gRPC client for the ModelExpress server
`modelexpress.vllm_loader`	`MxModelLoader` -- vLLM integration
`modelexpress.nixl_transfer`	`NixlTransferManager` -- NIXL agent lifecycle and RDMA transfers
`modelexpress.types`	`TensorDescriptor`, `WorkerMetadata` -- core data types
`modelexpress.vllm_worker`	vLLM worker extensions

How It Works

Source loads weights from disk, registers raw tensors with NIXL before FP8 processing, and publishes metadata to the ModelExpress server.
Target creates dummy weights, waits for the source ready flag, then pulls raw tensors via RDMA read.
Both source and target run process_weights_after_loading() independently, producing identical FP8-transformed weights.

This pre-processing transfer strategy is critical for FP8 models (e.g., DeepSeek-V3) where weight_scale_inv tensors are renamed and transformed during processing.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modelexpress-0.3.0-py3-none-any.whl (44.3 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file modelexpress-0.3.0-py3-none-any.whl.

File metadata

Download URL: modelexpress-0.3.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 44.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for modelexpress-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f93d8de74903f6c11ebf9b4621fa02e3c493a5e05207c38fb2c98395a774618`
MD5	`177ea8969d3d769f10acc212f5bf4131`
BLAKE2b-256	`7cb512f940a41940fd83b9e50ce46124695b68a80feab02612779f8fb584db09`

See more details on using hashes here.

modelexpress 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ModelExpress Python Client

Installation

Requirements

Quick Start with vLLM

Programmatic Usage

MxClient

Registering Loaders Manually

Environment Variables

UCX/NIXL Tuning

Package Structure

How It Works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes