Modular RDMA Interface — GPU communication library for P2P, RDMA/IBGDA, and SDMA

These details have not been verified by PyPI

Project links

Project description

MORI

News

[2026/05] 🔥 MORI becomes the primary EP communication library for AMD platforms in Alibaba RTP-LLM (MORI-EP PR).
[2026/05] MORI's SDMA-based AllGather collective is integrated into DeepSpeed for ZeRO-3 optimization on AMD GPUs, delivering up to 10% end-to-end training speedup by offloading AllGather traffic to dedicated SDMA copy engines (example, post).
[2026/04] 🔥 Tencent OpenUCL adopts the Mori ecosystem, using Mori's EP-style dispatch/combine pattern in AMD GPU deployments and leveraging MORI-SHMEM for GPU-initiated communication.
[2026/03] 🔥 MORI-SHMEM powers ByteDance Triton-distributed EP dispatch/combine kernels as the backend, delivering seamless integration and high performance on AMD GPUs (EP Kernels, MORI-SHMEM Integration).
[2026/02] 🔥 MORI powers AMD's WideEP and PD disaggregation in SemiAnalysis InferenceX v2 benchmark (PR, InferenceX, blog).
[2026/01] 🔥 MORI-EP and MORI-IO integrated into SGLang and vLLM for MoE Expert Parallelism and PD Disaggregation on AMD GPUs (sglang & MORI-EP, sglang & MORI-IO, vllm & MORI-EP, vllm & MORI-IO).
[2025/12] MORI adds support for AMD's AINIC (Pollara) with SOTA performance (AINIC & MORI-EP, AINIC & MORI-IO).
[2025/09] MORI-EP now seamlessly scales to 64 GPUs with SOTA performance (multiple optimizations, multi-QP support, low-latency kernel).
[2025/09] MORI adds Broadcom BNXT (Thor2) IBGDA support (PR).

Introduction

MORI (Modular RDMA Interface) is a bottom-up, modular, and composable framework for building high-performance communication applications with a strong focus on RDMA + GPU integration. Inspired by the role of MLIR in compiler infrastructure, MORI provides reusable and extensible building blocks that make it easier for developers to adopt advanced techniques such as IBGDA (Infiniband GPUDirect Async) and GDS (GPUDirect Storage).

To help developers get started quickly, MORI also includes a suite of optimized libraries—MORI-EP (MoE dispatch & combine kernels), MORI-IO (p2p communication for KVCache transfer), and MORI-CCL (collective communication)—that deliver out-of-the-box performance, with support for AMD Pensando DSC, Broadcom Thor2, and NVIDIA Mellanox ConnectX-7 NICs.

Features summary

Applications
- MORI-EP: intra and inter-node dispatch/combine kernels with SOTA performance.
- MORI-IO: point-to-point communication library with ultra-low overhead
- MORI-CCL: lightweight and flexible collective communication library designed for highly customized use cases such as latency-sensitive or resource-constrained environment
- MORI-UMBP: unified memory & bandwidth pool with tiered storage and distributed key-value access for scalable memory management
Framework
- High-performance building blocks for IBGDA / P2P and more
- Modular & composable components for developing communication applications, such as transport management, topology detection and etc.
- Open-Shmem-style APIs
- C++ and Python level APIs

Documentation

Topic	Description	Guide
MORI-EP	Dispatch/combine API, kernel types, configuration, usage examples	EP Guide
MORI-SHMEM	Symmetric memory APIs, initialization, memory management	Shmem Guide
MORI-IR	Device bitcode integration for Triton and other GPU kernel frameworks	IR Guide
MORI-IO	P2P communication concepts, engine/backend/session design	IO Guide
MORI-VIZ	Warp-level kernel profiler with Perfetto integration	Profiler

Benchmarks

MORI-EP

Benchmark on DeepSeek V3 model configurations:

Bandwidth (4096 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware	Kernels	Dispatch XGMI	Dispatch RDMA	Combine XGMI	Combine RDMA
MI300X + CX7	EP8	307 GB/s	x	330 GB/s	x
	EP16-V1	171 GB/s	52 GB/s	219 GB/s	67 GB/s
	EP32-V1	103 GB/s*	57 GB/s*	91 GB/s*	50 GB/s*
MI355X + AINIC	EP8	345 GB/s	x	420 GB/s	x
	EP16-V1	179 GB/s	54 GB/s	234 GB/s	71 GB/s
	EP32-V1	85 GB/s	46 GB/s	110 GB/s	61 GB/s

Latency (128 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware	Kernels	Dispatch Latency	Dispatch BW	Combine Latency	Combine BW
MI300X + CX7	EP8	35 us	134 GB/s	47 us	204 GB/s
	EP16-V1-LL	76 us	96 GB/s	122 us	121 GB/s
	EP32-V1-LL	157 us*	48 GB/s*	280 us*	55 GB/s*
MI355X + AINIC	EP8	31 us	142 GB/s	36 us	276 GB/s
	EP16-V1-LL	84 us	87 GB/s	108 us	139 GB/s
	EP32-V1-LL	152 us	45 GB/s	187 us	76 GB/s

* Stale data from previous kernel version; updated numbers pending re-benchmarking.

MORI-IO

NOTE: This is the preview version of MORI-IO benchmark performance.

GPU Direct RDMA READ, pairwise, 128 consecutive transfers, 1 GPU, MI300X + Thor2:

+--------------------------------------------------------------------------------------------------------+
|                                            Initiator Rank 0                                            |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| MsgSize (B) | BatchSize | TotalSize (MB) | Max BW (GB/s) | Avg Bw (GB/s) | Min Lat (us) | Avg Lat (us) |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
|      8      |    128    |      0.00      |      0.03     |      0.03     |    33.38     |    36.33     |
|      16     |    128    |      0.00      |      0.06     |      0.06     |    34.09     |    36.35     |
|      32     |    128    |      0.00      |      0.12     |      0.11     |    34.57     |    36.33     |
|      64     |    128    |      0.01      |      0.24     |      0.23     |    33.62     |    36.33     |
|     128     |    128    |      0.02      |      0.49     |      0.45     |    33.62     |    36.49     |
|     256     |    128    |      0.03      |      0.94     |      0.89     |    34.81     |    36.99     |
|     512     |    128    |      0.07      |      1.86     |      1.77     |    35.29     |    37.01     |
|     1024    |    128    |      0.13      |      3.84     |      3.53     |    34.09     |    37.09     |
|     2048    |    128    |      0.26      |      7.33     |      6.96     |    35.76     |    37.65     |
|     4096    |    128    |      0.52      |     12.94     |     12.46     |    40.53     |    42.09     |
|     8192    |    128    |      1.05      |     20.75     |     20.12     |    50.54     |    52.11     |
|    16384    |    128    |      2.10      |     29.03     |     28.33     |    72.24     |    74.02     |
|    32768    |    128    |      4.19      |     36.50     |     35.91     |    114.92    |    116.81    |
|    65536    |    128    |      8.39      |     41.74     |     41.39     |    200.99    |    202.70    |
|    131072   |    128    |     16.78      |     45.14     |     44.85     |    371.69    |    374.10    |
|    262144   |    128    |     33.55      |     46.93     |     46.76     |    715.02    |    717.56    |
|    524288   |    128    |     67.11      |     47.94     |     47.81     |   1399.99    |   1403.64    |
|   1048576   |    128    |     134.22     |     48.44     |     48.32     |   2770.90    |   2777.76    |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+

Hardware Support Matrix

GPU

	MORI-EP	MORI-IO	MORI-SHMEM
MI308X	✅	✅	✅
MI300X	✅	✅	✅
MI325X	✅	✅	✅
MI355X	✅	✅	✅
MI450X	🚧	🚧	🚧

NIC

	MORI-EP	MORI-IO	MORI-SHMEM
Pollara	✅	✅	✅
CX7	✅	✅	✅
Thor2	✅	✅	✅
Volcano	🚧	🚧	🚧

✅ Supported 🚧 Under Development

Installation

Prerequisites

ROCm >= 6.4 (hipcc needed at runtime for JIT kernel compilation, not at install time)
System packages: libpci-dev (see Dockerfile.dev)
Optional: libopenmpi-dev, openmpi-bin — only needed when building C++ examples (BUILD_EXAMPLES=ON) or enabling MPI bootstrap (MORI_WITH_MPI=ON)

Or build docker image with:

cd mori && docker build -t rocm/mori:dev -f docker/Dockerfile.dev .

IBGDA NIC support (optional, for GPU-direct RDMA — auto-detected, no manual configuration needed):

NIC	User library
AMD Pollara (AINIC)	`libionic.so`
Mellanox ConnectX	`libmlx5.so` (typically pre-installed)
Broadcom Thor2	`libbnxt_re.so`

Note: IBGDA requires vendor-specific DV (Direct Verbs) libraries. Mellanox libmlx5 is typically pre-installed with the kernel OFED stack. For Thor2 and Pollara, install the corresponding userspace library from your NIC vendor.

Install

MoRI can be installed in three ways: from PyPI (stable), nightly pre-built wheels (latest dev), or from source.

From PyPI (stable release)

pip install amd_mori

Nightly (pre-built, tested daily)

# From PyPI
pip install --pre amd-mori-nightly

# Or from GitHub Pages
pip install --no-index --force-reinstall --find-links https://rocm.github.io/mori/nightly/latest/ amd_mori

Browse all nightly builds: https://rocm.github.io/mori/nightly/

Note: amd-mori and amd-mori-nightly both provide the mori Python module. Do not install both at the same time — uninstall one before installing the other.

From source

# NOTE: for venv build, add --no-build-isolation at the end
cd mori && pip install .

No hipcc needed at install time — host code compiles with a standard C++ compiler. GPU kernels are JIT-compiled on first use and cached to ~/.mori/jit/. If a GPU is detected during install, kernel precompilation starts automatically in the background.

To manually precompile all kernels (e.g. in a Docker image build):

MORI_PRECOMPILE=1 python -c "import mori"

Verify installation

python -c "import mori; print(mori.__version__)"

Testing

Test MORI-EP (dispatch / combine)

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH
python -c "import mori; print(mori.__file__)"

# Test correctness (8 GPUs)
pytest tests/python/ops/test_dispatch_combine_intranode.py -q
pytest tests/python/ops/test_dispatch_combine_async_ll.py -q
pytest tests/python/ops/test_dispatch_combine_internode_v1.py -q

# Benchmark performance
python tests/python/ops/bench_dispatch_combine.py

Test MORI-IO

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH

# Correctness tests
pytest tests/python/io/

# Benchmark performance (two nodes)
export GLOO_SOCKET_IFNAME=ens14np0
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --master_addr="10.194.129.65" --master_port=1234 \
  tests/python/io/benchmark.py --host="10.194.129.65" --enable-batch-transfer --enable-sess --buffer-size 32768 --transfer-batch-size 128

Test MORI-IR (Triton + shmem integration, guide)

# Basic shmem put (2 GPUs)
torchrun --nproc_per_node=2 examples/shmem/ir/test_triton_shmem.py

# Allreduce (8 GPUs)
torchrun --nproc_per_node=8 examples/shmem/ir/test_triton_allreduce.py

Contribution Guide

Welcome to MORI! We appreciate your interest in contributing. Whether you're fixing bugs, adding features, improving documentation, or sharing feedback, your contributions help make MORI better for everyone.

Code Quality

MORI uses pre-commit hooks to maintain code quality. After cloning the repository:

pip install pre-commit
cd /path/to/mori
pre-commit install

# Run on all files (first time)
pre-commit run --all-files

Pre-commit automatically checks code formatting, linting, license headers, and other quality checks on commit. To skip checks when necessary: git commit --no-verify

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.3.dev20260731 pre-release

Jul 31, 2026

1.2.3.dev20260730 pre-release

Jul 30, 2026

1.2.3.dev20260729 pre-release

Jul 30, 2026

1.2.3.dev20260727 pre-release

Jul 28, 2026

1.2.3.dev20260726 pre-release

Jul 26, 2026

1.2.3.dev20260725 pre-release

Jul 25, 2026

1.2.3.dev20260724 pre-release

Jul 24, 2026

1.2.3.dev20260723 pre-release

Jul 23, 2026

1.2.3.dev20260722 pre-release

Jul 22, 2026

1.2.3.dev20260721 pre-release

Jul 21, 2026

1.2.3.dev20260719 pre-release

Jul 19, 2026

1.2.3.dev20260718 pre-release

Jul 18, 2026

1.2.3.dev20260717 pre-release

Jul 17, 2026

1.2.3.dev20260716 pre-release

Jul 17, 2026

1.2.2.dev20260715 pre-release

Jul 16, 2026

1.2.2.dev20260714 pre-release

Jul 14, 2026

1.2.2.dev20260713 pre-release

Jul 13, 2026

1.2.2.dev20260712 pre-release

Jul 13, 2026

1.2.2.dev20260711 pre-release

Jul 11, 2026

1.2.2.dev20260710 pre-release

Jul 10, 2026

1.2.2.dev20260709 pre-release

Jul 10, 2026

1.2.2.dev20260707 pre-release

Jul 7, 2026

1.2.2.dev20260706 pre-release

Jul 6, 2026

1.2.2.dev20260705 pre-release

Jul 6, 2026

1.2.2.dev20260703 pre-release

Jul 3, 2026

1.2.2.dev20260701 pre-release

Jul 1, 2026

1.2.2.dev20260630 pre-release

Jun 30, 2026

1.2.2.dev20260629 pre-release

Jun 29, 2026

1.2.2.dev20260628 pre-release

Jun 29, 2026

1.2.2.dev20260626 pre-release

Jun 26, 2026

1.2.1.dev20260625 pre-release

Jun 25, 2026

1.2.1.dev20260624 pre-release

Jun 25, 2026

1.2.1.dev20260620 pre-release

Jun 22, 2026

1.2.1.dev20260619 pre-release

Jun 19, 2026

1.2.1.dev20260617 pre-release

Jun 17, 2026

1.2.1.dev20260616 pre-release

Jun 16, 2026

1.2.1.dev20260615 pre-release

Jun 16, 2026

1.2.1.dev20260614 pre-release

Jun 14, 2026

1.2.1.dev20260613 pre-release

Jun 13, 2026

1.2.1.dev20260610 pre-release

Jun 10, 2026

1.2.1.dev20260609 pre-release

Jun 9, 2026

1.2.1.dev20260608 pre-release

Jun 8, 2026

1.1.2.dev20260608 pre-release

Jun 8, 2026

1.1.2.dev20260607 pre-release

Jun 8, 2026

1.1.2.dev20260605 pre-release

Jun 5, 2026

1.1.2.dev20260604 pre-release

Jun 4, 2026

1.1.2.dev20260603 pre-release

Jun 4, 2026

1.1.2.dev20260602 pre-release

Jun 2, 2026

1.1.2.dev20260531 pre-release

May 31, 2026

1.1.2.dev20260530 pre-release

Jun 1, 2026

1.1.2.dev20260529 pre-release

May 29, 2026

1.1.2.dev20260528 pre-release

May 28, 2026

1.1.2.dev20260524 pre-release

May 25, 2026

This version

1.1.2.dev20260521 pre-release

May 21, 2026

1.1.1.dev20260521 pre-release

May 21, 2026

1.1.1.dev20260520 pre-release

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amd_mori_nightly-1.1.2.dev20260521-cp312-cp312-manylinux_2_39_x86_64.whl (3.5 MB view details)

Uploaded May 21, 2026 CPython 3.12manylinux: glibc 2.39+ x86-64

amd_mori_nightly-1.1.2.dev20260521-cp310-cp310-manylinux_2_35_x86_64.whl (3.5 MB view details)

Uploaded May 21, 2026 CPython 3.10manylinux: glibc 2.35+ x86-64

File details

Details for the file amd_mori_nightly-1.1.2.dev20260521-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

Download URL: amd_mori_nightly-1.1.2.dev20260521-cp312-cp312-manylinux_2_39_x86_64.whl
Upload date: May 21, 2026
Size: 3.5 MB
Tags: CPython 3.12, manylinux: glibc 2.39+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for amd_mori_nightly-1.1.2.dev20260521-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm	Hash digest
SHA256	`67c2d875df2c837328d8ca1daaa8d5e0593ff80d3bc9426062df9232027f298d`
MD5	`e0f5474fd3ae39eb7601f721c88a53a6`
BLAKE2b-256	`a1383321d34e3cd9f47bba86ccda80b8f53f461ed62e77a7fcda3a32ee49d810`

See more details on using hashes here.

File details

Details for the file amd_mori_nightly-1.1.2.dev20260521-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

Download URL: amd_mori_nightly-1.1.2.dev20260521-cp310-cp310-manylinux_2_35_x86_64.whl
Upload date: May 21, 2026
Size: 3.5 MB
Tags: CPython 3.10, manylinux: glibc 2.35+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for amd_mori_nightly-1.1.2.dev20260521-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm	Hash digest
SHA256	`b5b09cedd72529acb1da9de460a426bb0258b30aa2b4bcaac7b081a29c3500de`
MD5	`716b17da8f088c85b05563e90fcf298e`
BLAKE2b-256	`4491441966782447feb9d63e515438be9d7415bef4efe17b2fd8b62409f62b75`

See more details on using hashes here.

amd-mori-nightly 1.1.2.dev20260521

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

MORI

News

Introduction

Features summary

Documentation

Benchmarks

MORI-EP

MORI-IO

Hardware Support Matrix

Installation

Prerequisites

Install

From PyPI (stable release)

Nightly (pre-built, tested daily)

From source

Verify installation

Testing

Test MORI-EP (dispatch / combine)

Test MORI-IO

Test MORI-IR (Triton + shmem integration, guide)

Contribution Guide

Code Quality

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes