Skip to main content

Modular RDMA Interface — GPU communication library for P2P, RDMA/IBGDA, and SDMA

Project description

MORI

News

Introduction

MORI (Modular RDMA Interface) is a bottom-up, modular, and composable framework for building high-performance communication applications with a strong focus on RDMA + GPU integration. Inspired by the role of MLIR in compiler infrastructure, MORI provides reusable and extensible building blocks that make it easier for developers to adopt advanced techniques such as IBGDA (Infiniband GPUDirect Async) and GDS (GPUDirect Storage).

To help developers get started quickly, MORI also includes a suite of optimized libraries—MORI-EP (MoE dispatch & combine kernels), MORI-IO (p2p communication for KVCache transfer), and MORI-CCL (collective communication)—that deliver out-of-the-box performance, with support for AMD Pensando DSC, Broadcom Thor2, and NVIDIA Mellanox ConnectX-7 NICs.

Features summary

  • Applications
    • MORI-EP: intra and inter-node dispatch/combine kernels with SOTA performance.
    • MORI-IO: point-to-point communication library with ultra-low overhead
    • MORI-CCL: lightweight and flexible collective communication library designed for highly customized use cases such as latency-sensitive or resource-constrained environment
    • MORI-UMBP: unified memory & bandwidth pool with tiered storage and distributed key-value access for scalable memory management
  • Framework
    • High-performance building blocks for IBGDA / P2P and more​
    • Modular & composable components for developing communication applications, such as transport management, topology detection and etc.
    • Open-Shmem-style APIs
    • C++ and Python level APIs

Documentation

Topic Description Guide
MORI-EP Dispatch/combine API, kernel types, configuration, usage examples EP Guide
MORI-SHMEM Symmetric memory APIs, initialization, memory management Shmem Guide
MORI-IR Device bitcode integration for Triton and other GPU kernel frameworks IR Guide
MORI-IO P2P communication concepts, engine/backend/session design IO Guide
MORI-VIZ Warp-level kernel profiler with Perfetto integration Profiler

Benchmarks

MORI-EP

Benchmark on DeepSeek V3 model configurations:

Bandwidth (4096 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware Kernels Dispatch XGMI Dispatch RDMA Combine XGMI Combine RDMA
MI300X + CX7 EP8 307 GB/sx330 GB/sx
EP16-V1 171 GB/s52 GB/s219 GB/s67 GB/s
EP32-V1 103 GB/s*57 GB/s*91 GB/s*50 GB/s*
MI355X + AINIC EP8 345 GB/sx420 GB/sx
EP16-V1 179 GB/s54 GB/s234 GB/s71 GB/s
EP32-V1 85 GB/s46 GB/s110 GB/s61 GB/s

Latency (128 tokens, 7168 hidden, top-8 experts, FP8 dispatch + BF16 combine)

Hardware Kernels Dispatch Latency Dispatch BW Combine Latency Combine BW
MI300X + CX7 EP8 35 us134 GB/s47 us204 GB/s
EP16-V1-LL 76 us96 GB/s122 us121 GB/s
EP32-V1-LL 157 us*48 GB/s*280 us*55 GB/s*
MI355X + AINIC EP8 31 us142 GB/s36 us276 GB/s
EP16-V1-LL 84 us87 GB/s108 us139 GB/s
EP32-V1-LL 152 us45 GB/s187 us76 GB/s

* Stale data from previous kernel version; updated numbers pending re-benchmarking.

MORI-IO

NOTE: This is the preview version of MORI-IO benchmark performance.

GPU Direct RDMA READ, pairwise, 128 consecutive transfers, 1 GPU, MI300X + Thor2:

+--------------------------------------------------------------------------------------------------------+
|                                            Initiator Rank 0                                            |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
| MsgSize (B) | BatchSize | TotalSize (MB) | Max BW (GB/s) | Avg Bw (GB/s) | Min Lat (us) | Avg Lat (us) |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+
|      8      |    128    |      0.00      |      0.03     |      0.03     |    33.38     |    36.33     |
|      16     |    128    |      0.00      |      0.06     |      0.06     |    34.09     |    36.35     |
|      32     |    128    |      0.00      |      0.12     |      0.11     |    34.57     |    36.33     |
|      64     |    128    |      0.01      |      0.24     |      0.23     |    33.62     |    36.33     |
|     128     |    128    |      0.02      |      0.49     |      0.45     |    33.62     |    36.49     |
|     256     |    128    |      0.03      |      0.94     |      0.89     |    34.81     |    36.99     |
|     512     |    128    |      0.07      |      1.86     |      1.77     |    35.29     |    37.01     |
|     1024    |    128    |      0.13      |      3.84     |      3.53     |    34.09     |    37.09     |
|     2048    |    128    |      0.26      |      7.33     |      6.96     |    35.76     |    37.65     |
|     4096    |    128    |      0.52      |     12.94     |     12.46     |    40.53     |    42.09     |
|     8192    |    128    |      1.05      |     20.75     |     20.12     |    50.54     |    52.11     |
|    16384    |    128    |      2.10      |     29.03     |     28.33     |    72.24     |    74.02     |
|    32768    |    128    |      4.19      |     36.50     |     35.91     |    114.92    |    116.81    |
|    65536    |    128    |      8.39      |     41.74     |     41.39     |    200.99    |    202.70    |
|    131072   |    128    |     16.78      |     45.14     |     44.85     |    371.69    |    374.10    |
|    262144   |    128    |     33.55      |     46.93     |     46.76     |    715.02    |    717.56    |
|    524288   |    128    |     67.11      |     47.94     |     47.81     |   1399.99    |   1403.64    |
|   1048576   |    128    |     134.22     |     48.44     |     48.32     |   2770.90    |   2777.76    |
+-------------+-----------+----------------+---------------+---------------+--------------+--------------+

Hardware Support Matrix

GPU

MORI-EP MORI-IO MORI-SHMEM
MI308X
MI300X
MI325X
MI355X
MI450X 🚧 🚧 🚧

NIC

MORI-EP MORI-IO MORI-SHMEM
Pollara
CX7
Thor2
Volcano 🚧 🚧 🚧

✅ Supported   🚧 Under Development

Installation

Prerequisites

  • ROCm >= 6.4 (hipcc needed at runtime for JIT kernel compilation, not at install time)
  • System packages: libpci-dev (see Dockerfile.dev)
  • Optional: libopenmpi-dev, openmpi-bin — only needed when building C++ examples (BUILD_EXAMPLES=ON) or enabling MPI bootstrap (MORI_WITH_MPI=ON)

Or build docker image with:

cd mori && docker build -t rocm/mori:dev -f docker/Dockerfile.dev .

IBGDA NIC support (optional, for GPU-direct RDMA — auto-detected, no manual configuration needed):

NIC User library
AMD Pollara (AINIC) libionic.so
Mellanox ConnectX libmlx5.so (typically pre-installed)
Broadcom Thor2 libbnxt_re.so

Note: IBGDA requires vendor-specific DV (Direct Verbs) libraries. Mellanox libmlx5 is typically pre-installed with the kernel OFED stack. For Thor2 and Pollara, install the corresponding userspace library from your NIC vendor.

Install

MoRI can be installed in three ways: from PyPI (stable), nightly pre-built wheels (latest dev), or from source.

From PyPI (stable release)

pip install amd_mori

Nightly (pre-built, tested daily)

pip install --no-index --force-reinstall --find-links https://rocm.github.io/mori/nightly/latest/ amd_mori

Browse all nightly builds: https://rocm.github.io/mori/nightly/

From source

# NOTE: for venv build, add --no-build-isolation at the end
cd mori && pip install .

No hipcc needed at install time — host code compiles with a standard C++ compiler. GPU kernels are JIT-compiled on first use and cached to ~/.mori/jit/. If a GPU is detected during install, kernel precompilation starts automatically in the background.

To manually precompile all kernels (e.g. in a Docker image build):

MORI_PRECOMPILE=1 python -c "import mori"

Verify installation

python -c "import mori; print(mori.__version__)"

Testing

Test MORI-EP (dispatch / combine)

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH
python -c "import mori; print(mori.__file__)"

# Test correctness (8 GPUs)
pytest tests/python/ops/test_dispatch_combine_intranode.py -q
pytest tests/python/ops/test_dispatch_combine_async_ll.py -q
pytest tests/python/ops/test_dispatch_combine_internode_v1.py -q

# Benchmark performance
python tests/python/ops/bench_dispatch_combine.py

Test MORI-IO

cd /path/to/mori
export PYTHONPATH=/path/to/mori:$PYTHONPATH

# Correctness tests
pytest tests/python/io/

# Benchmark performance (two nodes)
export GLOO_SOCKET_IFNAME=ens14np0
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 --master_addr="10.194.129.65" --master_port=1234 \
  tests/python/io/benchmark.py --host="10.194.129.65" --enable-batch-transfer --enable-sess --buffer-size 32768 --transfer-batch-size 128

Test MORI-IR (Triton + shmem integration, guide)

# Basic shmem put (2 GPUs)
torchrun --nproc_per_node=2 examples/shmem/ir/test_triton_shmem.py

# Allreduce (8 GPUs)
torchrun --nproc_per_node=8 examples/shmem/ir/test_triton_allreduce.py

Contribution Guide

Welcome to MORI! We appreciate your interest in contributing. Whether you're fixing bugs, adding features, improving documentation, or sharing feedback, your contributions help make MORI better for everyone.

Code Quality

MORI uses pre-commit hooks to maintain code quality. After cloning the repository:

pip install pre-commit
cd /path/to/mori
pre-commit install

# Run on all files (first time)
pre-commit run --all-files

Pre-commit automatically checks code formatting, linting, license headers, and other quality checks on commit. To skip checks when necessary: git commit --no-verify

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

amd_mori_nightly-1.1.1.dev20260520-cp312-cp312-manylinux_2_39_x86_64.whl (33.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.39+ x86-64

amd_mori_nightly-1.1.1.dev20260520-cp310-cp310-manylinux_2_35_x86_64.whl (33.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

File details

Details for the file amd_mori_nightly-1.1.1.dev20260520-cp312-cp312-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for amd_mori_nightly-1.1.1.dev20260520-cp312-cp312-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 fc8884c3440182d7e7b7820420b4b6bf78d62f536966c65149958fb5aaf79440
MD5 2f3c86b9ff14c76b4f4ae4c0030940f0
BLAKE2b-256 73cf6a0609b2c34e83e6476b0c47a63cb36e2053f86f91791b8ff3eb34b4ce99

See more details on using hashes here.

File details

Details for the file amd_mori_nightly-1.1.1.dev20260520-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for amd_mori_nightly-1.1.1.dev20260520-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 22fcca4363f8c14b0989327e8947feb4ec9a652b2c33ed36079e6b350c4d1cb2
MD5 bb7ebbbb44c88baed5c70937aafbbfbf
BLAKE2b-256 bf29480e7a21203dfce1784b319f5f629de26ea829332402384ab9a8e776897a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page