Modular RDMA Interface — GPU communication library for P2P, RDMA/IBGDA, and SDMA
Project description
MORI
News
- [2026/02] 🔥 MORI powers AMD's WideEP and PD disaggregation in SemiAnalysis InferenceX v2 benchmark (PR, InferenceX, blog).
- [2026/01] 🔥 MORI-EP and MORI-IO integrated into SGLang and vLLM for MoE Expert Parallelism and PD Disaggregation on AMD GPUs (sglang & MORI-EP, sglang & MORI-IO, vllm & MORI-EP, vllm & MORI-IO).
- [2025/12] MORI adds support for AMD's AINIC (Pollara) with SOTA performance (AINIC & MORI-EP, AINIC & MORI-IO).
- [2025/09] MORI-EP now seamlessly scales to 64 GPUs with SOTA performance (multiple optimizations, multi-QP support, low-latency kernel).
- [2025/09] MORI adds Broadcom BNXT (Thor2) IBGDA support (PR).
Introduction
MORI (Modular RDMA Interface) is a bottom-up, modular, and composable framework for building high-performance communication applications with a strong focus on RDMA + GPU integration. Inspired by the role of MLIR in compiler infrastructure, MORI provides reusable and extensible building blocks that make it easier for developers to adopt advanced techniques such as IBGDA (Infiniband GPUDirect Async) and GDS (GPUDirect Storage).
To help developers get started quickly, MORI also includes a suite of optimized libraries—MORI-EP (MoE dispatch & combine kernels), MORI-IO (p2p communication for KVCache transfer), and MORI-CCL (collective communication)—that deliver out-of-the-box performance, with support for AMD Pensando DSC, Broadcom Thor2, and NVIDIA Mellanox ConnectX-7 NICs.
Features summary
- Applications
- MORI-EP: intra and inter-node dispatch/combine kernels with SOTA performance.
- MORI-IO: point-to-point communication library with ultra-low overhead
- MORI-CCL: lightweight and flexible collective communication library designed for highly customized use cases such as latency-sensitive or resource-constrained environment
- Framework
- High-performance building blocks for IBGDA / P2P and more
- Modular & composable components for developing communication applications, such as transport management, topology detection and etc.
- Open-Shmem-style APIs
- C++ and Python level APIs
Documentation
| Topic | Description | Guide |
|---|---|---|
| MORI-EP | Dispatch/combine API, kernel types, configuration, usage examples | EP Guide |
| MORI-SHMEM | Symmetric memory APIs, initialization, memory management | Shmem Guide |
| MORI-IR | Device bitcode integration for Triton and other GPU kernel frameworks | IR Guide |
| MORI-IO | P2P communication concepts, engine/backend/session design | IO Guide |
| MORI-VIZ | Warp-level kernel profiler with Perfetto integration | Profiler |
Benchmarks
MORI-EP
Benchmark on DeepSeek V3 model configurations:
Bandwidth (4096 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)
| Hardware | Kernels | Dispatch XGMI | Dispatch RDMA | Combine XGMI | Combine RDMA |
|---|---|---|---|---|---|
| MI300X + CX7 | EP8 | 307 GB/s | x | 330 GB/s | x |
| EP16-V1 | 171 GB/s | 52 GB/s | 219 GB/s | 67 GB/s | |
| EP32-V1 | 103 GB/s* | 57 GB/s* | 91 GB/s* | 50 GB/s* | |
| MI355X + AINIC | EP8 | 345 GB/s | x | 420 GB/s | x |
| EP16-V1 | 179 GB/s | 54 GB/s | 234 GB/s | 71 GB/s | |
| EP32-V1 | 85 GB/s | 46 GB/s | 110 GB/s | 61 GB/s |
Latency (128 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)
| Hardware | Kernels | Dispatch Latency | Dispatch BW | Combine Latency | Combine BW |
|---|---|---|---|---|---|
| MI300X + CX7 | EP8 | 35 us | 134 GB/s | 47 us | 204 GB/s |
| EP16-V1-LL | 76 us | 96 GB/s | 122 us | 121 GB/s | |
| EP32-V1-LL | 157 us* | 48 GB/s* | 280 us* | 55 GB/s* | |
| MI355X + AINIC | EP8 | 31 us | 142 GB/s | 36 us | 276 GB/s |
| EP16-V1-LL | 84 us | 87 GB/s | 108 us | 139 GB/s | |
| EP32-V1-LL | 152 us | 45 GB/s | 187 us | 76 GB/s |
* Stale data from previous kernel version; updated numbers pending re-benchmarking.
MORI-IO
NOTE: This is the preview version of MORI-IO benchmark performance.
GPU Direct RDMA READ, pairwise, 128 consecutive transfers, 1 GPU, MI300X + Thor2:
+--------------------------------------------------------------------------------------------------------+
| Initiator Rank 0 |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| MsgSize (B) | BatchSize | TotalSize (MB) | Max BW (GB/s) | Avg Bw (GB/s) | Min Lat (us) | Avg Lat (us) |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| 8 | 128 | 0.00 | 0.03 | 0.03 | 33.38 | 36.33 |
| 16 | 128 | 0.00 | 0.06 | 0.06 | 34.09 | 36.35 |
| 32 | 128 | 0.00 | 0.12 | 0.11 | 34.57 | 36.33 |
| 64 | 128 | 0.01 | 0.24 | 0.23 | 33.62 | 36.33 |
| 128 | 128 | 0.02 | 0.49 | 0.45 | 33.62 | 36.49 |
| 256 | 128 | 0.03 | 0.94 | 0.89 | 34.81 | 36.99 |
| 512 | 128 | 0.07 | 1.86 | 1.77 | 35.29 | 37.01 |
| 1024 | 128 | 0.13 | 3.84 | 3.53 | 34.09 | 37.09 |
| 2048 | 128 | 0.26 | 7.33 | 6.96 | 35.76 | 37.65 |
| 4096 | 128 | 0.52 | 12.94 | 12.46 | 40.53 | 42.09 |
| 8192 | 128 | 1.05 | 20.75 | 20.12 | 50.54 | 52.11 |
| 16384 | 128 | 2.10 | 29.03 | 28.33 | 72.24 | 74.02 |
| 32768 | 128 | 4.19 | 36.50 | 35.91 | 114.92 | 116.81 |
| 65536 | 128 | 8.39 | 41.74 | 41.39 | 200.99 | 202.70 |
| 131072 | 128 | 16.78 | 45.14 | 44.85 | 371.69 | 374.10 |
| 262144 | 128 | 33.55 | 46.93 | 46.76 | 715.02 | 717.56 |
| 524288 | 128 | 67.11 | 47.94 | 47.81 | 1399.99 | 1403.64 |
| 1048576 | 128 | 134.22 | 48.44 | 48.32 | 2770.90 | 2777.76 |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
Hardware Support Matrix
GPU
| MORI-EP | MORI-IO | MORI-SHMEM | |
|---|---|---|---|
| MI308X | ✅ | ✅ | ✅ |
| MI300X | ✅ | ✅ | ✅ |
| MI325X | ✅ | ✅ | ✅ |
| MI355X | ✅ | ✅ | ✅ |
| MI450X | 🚧 | 🚧 | 🚧 |
NIC
| MORI-EP | MORI-IO | MORI-SHMEM | |
|---|---|---|---|
| Pollara | ✅ | ✅ | ✅ |
| CX7 | ✅ | ✅ | ✅ |
| Thor2 | ✅ | ✅ | ✅ |
| Volcano | 🚧 | 🚧 | 🚧 |
✅ Supported 🚧 Under Development
Installation
Prerequisites
- ROCm >= 6.4 (hipcc needed at runtime for JIT kernel compilation, not at install time)
- System packages:
libpci-dev(see Dockerfile.dev) - Optional:
libopenmpi-dev,openmpi-bin— only needed when building C++ examples (BUILD_EXAMPLES=ON) or enabling MPI bootstrap (MORI_WITH_MPI=ON)
Or build docker image with:
cd mori && docker build -t rocm/mori:dev -f docker/Dockerfile.dev .
IBGDA NIC support (optional, for GPU-direct RDMA — auto-detected, no manual configuration needed):
| NIC | User library | Headers |
|---|---|---|
| AMD Pollara (AINIC) | libionic.so |
— |
| Mellanox ConnectX | libmlx5.so (typically pre-installed) |
— |
| Broadcom Thor2 | libbnxt_re.so |
bnxt_re_dv.h, bnxt_re_hsi.h |
Note: IBGDA requires vendor-specific DV (Direct Verbs) libraries. Mellanox
libmlx5is typically pre-installed with the kernel OFED stack. For Thor2 and Pollara, install the corresponding userspace library and headers from your NIC vendor.
Install
# NOTE: for venv build, add --no-build-isolation at the end
cd mori && pip install .
That's it. No hipcc needed at install time — host code compiles with a standard
C++ compiler. GPU kernels are JIT-compiled on first use and cached to
~/.mori/jit/. If a GPU is detected during install, kernel precompilation
starts automatically in the background.
To manually precompile all kernels (e.g. in a Docker image build):
MORI_PRECOMPILE=1 python -c "import mori"
Verify installation
python -c "import mori; print('OK')"
Testing
Test MORI-EP (dispatch / combine)
cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH
# Test correctness (8 GPUs)
pytest tests/python/ops/test_dispatch_combine.py -q
# Benchmark performance
python tests/python/ops/bench_dispatch_combine.py
Test MORI-IO
cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH
# Correctness tests
pytest tests/python/io/
# Benchmark performance (two nodes)
export GLOO_SOCKET_IFNAME=ens14np0
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --master_addr="10.194.129.65" --master_port=1234 \
tests/python/io/benchmark.py --host="10.194.129.65" --enable-batch-transfer --enable-sess --buffer-size 32768 --transfer-batch-size 128
Test MORI-IR (Triton + shmem integration, guide)
# Basic shmem put (2 GPUs)
torchrun --nproc_per_node=2 examples/shmem/ir/test_triton_shmem.py
# Allreduce (8 GPUs)
torchrun --nproc_per_node=8 examples/shmem/ir/test_triton_allreduce.py
Contribution Guide
Welcome to MORI! We appreciate your interest in contributing. Whether you're fixing bugs, adding features, improving documentation, or sharing feedback, your contributions help make MORI better for everyone.
Code Quality
MORI uses pre-commit hooks to maintain code quality. After cloning the repository:
pip install pre-commit
cd /path/to/mori
pre-commit install
# Run on all files (first time)
pre-commit run --all-files
Pre-commit automatically checks code formatting, linting, license headers, and other quality checks on commit. To skip checks when necessary: git commit --no-verify
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amd_mori-1.0.0-cp312-cp312-manylinux2014_x86_64.whl.
File metadata
- Download URL: amd_mori-1.0.0-cp312-cp312-manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.12
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52a4b3e923282d8842ccb5ea4514d493dcefdcef723f9532e522d9e7b3a2ec1c
|
|
| MD5 |
5bc78710f4590105920ada55f9f34fc0
|
|
| BLAKE2b-256 |
b1a51024577d0e66effbb2cba602a0ebaf83c1f92a10bc37ad786d5f01e8e491
|