DLSlime Transfer Engine

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
- Unix
Programming Language
- C++
- Python :: 3
Topic
- System :: Networking
- System :: Systems Administration

Project description

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.

Assignment Operation .

Here are some examples of DLSlime interface.

P2P Communication

RDMA RC Mode

RDMA RC Read (Sync / Async mode)

python example/python/p2p_rdma_rc_read.py

RDMA RC Read (Coroutine mode)

python example/python/p2p_rdma_rc_read_coroutine.py

RDMA RC Write (Sync / Async mode)

python example/python/p2p_rdma_rc_write.py

RDMA RC Write with immediate data (Sync / Async mode)

python example/python/p2p_rdma_rc_write_with_imm_data.py

RDMA RC Send/Recv

python example/python/p2p_rdma_rc_send_recv.py

python example/python/p2p_rdma_rc_send_recv_gdr.py

DLSlime torch backend

torchrun --nproc_per_node=2 examples/python/p2p_rdma_rc_send_recv_torch.py

NVLink Mode

torchrun --nproc_per_node=2 p2p_nvlink.py

Huawei Ascend Direct Mode

See: Huawei README

[!Caution] DLSlime NVShmem transfer engine and Huawei Ascond Direct mode are in the experimental stage.

Collective Ops

Intra Node

AllGather

torchrun --nnodes 1 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode intra

Inter Node

AllGather

# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter

AllGather Gemm Overlapping

# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py

[!Note] The intra- and inter- examples example above enables CUDA Graph by default. --eager-mode falls back to eager mode.

Install

pip install

pip install dlslime==0.0.1.post10

[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).

Build from source

Python

git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .

CPP

git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..

Build flags

The FLAG can be

Flag	Description	Platform	default
`BUILD_RDMA`	Build RDMA Transfer Engine	Hetero	ON
`BUILD_PYTHON`	Build Python wrapper	Hetero	ON
`BUILD_NVLINK`	Build NVLINK Transfer Engine	GPGPU	OFF
`BUILD_ASCEND_DIRECT`	Build Ascend direct transport	ASCEND	OFF
`BUILD_TORCH_PLUGIN`	Build DLSlime as a torch backend	Hetero	OFF
`BUILD_INTRA_OPS`	Use INTRA Collective OPS	GPGPU	OFF
`BUILD_INTER_OPS`	Use INTER Collective OPS (NVSHMEM)	NVIDIA	OFF

[!Note] Please enable USE_MACA when using DLSlime as a torch backend in Metax platform.

Benchmark

GDRDMA P2P Read/Write

Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	1	2,048	1	1	0.039	52
dlslime	1	4,096	1	1	0.037	111
dlslime	1	8,192	1	1	0.038	216
dlslime	1	16,384	1	1	0.037	442
dlslime	1	32,768	1	1	0.039	836
dlslime	1	65,536	1	1	0.039	1689
dlslime	1	131,072	1	1	0.041	3195
dlslime	1	262,144	1	1	0.043	6059
dlslime	1	524,288	1	1	0.049	10689
dlslime	1	1,048,576	1	1	0.062	17012
dlslime	1	2,097,152	1	1	0.083	25154
dlslime	1	4,194,304	1	1	0.127	33112
dlslime	1	8,388,608	1	1	0.211	39797
dlslime	1	16,777,216	1	1	0.382	43893
dlslime	1	33,554,432	1	1	0.726	46244
dlslime	1	67,108,864	1	1	1.412	47518
dlslime	1	134,217,728	1	1	2.783	48235

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	1	2,048	64	1	0.084	1562
dlslime	1	4,096	64	1	0.082	3213
dlslime	1	8,192	64	1	0.086	6095
dlslime	1	16,384	64	1	0.093	11249
dlslime	1	32,768	64	1	0.115	18193
dlslime	1	65,536	64	1	0.158	26542
dlslime	1	131,072	64	1	0.243	34498
dlslime	1	262,144	64	1	0.414	40549
dlslime	1	524,288	64	1	0.758	44248
dlslime	1	1,048,576	64	1	1.443	46510
dlslime	1	2,097,152	64	1	2.809	47782
dlslime	1	4,194,304	64	1	5.555	48327
dlslime	1	8,388,608	64	1	11.041	48624
dlslime	1	16,777,216	64	1	22.003	48798
dlslime	1	33,554,432	64	1	43.941	48872
dlslime	1	67,108,864	64	1	87.809	48912
dlslime	1	134,217,728	64	1	175.512	48942

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	1	2,048	64	8	0.037	3519
dlslime	1	4,096	64	8	0.038	6948
dlslime	1	8,192	64	8	0.038	13758
dlslime	1	16,384	64	8	0.04	26416
dlslime	1	32,768	64	8	0.057	36997
dlslime	1	65,536	64	8	0.098	42618
dlslime	1	131,072	64	8	0.184	45602
dlslime	1	262,144	64	8	0.356	47148
dlslime	1	524,288	64	8	0.699	47975
dlslime	1	1,048,576	64	8	1.384	48478
dlslime	1	2,097,152	64	8	2.755	48709
dlslime	1	4,194,304	64	8	5.498	48823
dlslime	1	8,388,608	64	8	10.982	48884
dlslime	1	16,777,216	64	8	21.954	48908
dlslime	1	33,554,432	64	8	43.895	48923
dlslime	1	67,108,864	64	8	87.766	48936
dlslime	1	134,217,728	64	8	175.517	48940

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	8	2,048	1	1	0.051	157
dlslime	8	4,096	1	1	0.042	768
dlslime	8	8,192	1	1	0.04	1576
dlslime	8	16,384	1	1	0.054	2929
dlslime	8	32,768	1	1	0.051	5713
dlslime	8	65,536	1	1	0.052	11547
dlslime	8	131,072	1	1	0.055	22039
dlslime	8	262,144	1	1	0.058	42313
dlslime	8	524,288	1	1	0.064	74753
dlslime	8	1,048,576	1	1	0.072	127489
dlslime	8	2,097,152	1	1	0.101	184823
dlslime	8	4,194,304	1	1	0.149	246861
dlslime	8	8,388,608	1	1	0.237	299510
dlslime	8	16,777,216	1	1	0.403	340252
dlslime	8	33,554,432	1	1	0.743	364918
dlslime	8	67,108,864	1	1	1.423	378620
dlslime	8	134,217,728	1	1	2.79	384630

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	8	2,048	64	1	0.091	11690
dlslime	8	4,096	64	1	0.081	24403
dlslime	8	8,192	64	1	0.091	45926
dlslime	8	16,384	64	1	0.098	84092
dlslime	8	32,768	64	1	0.117	138696
dlslime	8	65,536	64	1	0.16	206866
dlslime	8	131,072	64	1	0.241	273976
dlslime	8	262,144	64	1	0.415	320008
dlslime	8	524,288	64	1	0.757	353714
dlslime	8	1,048,576	64	1	1.439	372217
dlslime	8	2,097,152	64	1	2.819	381397
dlslime	8	4,194,304	64	1	5.555	386489
dlslime	8	8,388,608	64	1	11.044	388927
dlslime	8	16,777,216	64	1	22.009	390278
dlslime	8	33,554,432	64	1	43.951	390978
dlslime	8	67,108,864	64	1	87.804	391370
dlslime	8	134,217,728	64	1	175.508	391588

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8

Transfer Engine	#Channels	Message Size (bytes)	Batch Size	Num Concurrency	Avg Latency(ms)	Bandwidth(MB/s)
dlslime	8	2,048	64	8	0.036	28494
dlslime	8	4,096	64	8	0.038	50860
dlslime	8	8,192	64	8	0.048	104545
dlslime	8	16,384	64	8	0.041	207051
dlslime	8	32,768	64	8	0.056	297354
dlslime	8	65,536	64	8	0.099	337571
dlslime	8	131,072	64	8	0.185	363003
dlslime	8	262,144	64	8	0.356	376743
dlslime	8	524,288	64	8	0.701	383701
dlslime	8	1,048,576	64	8	1.386	387629
dlslime	8	2,097,152	64	8	2.757	389493
dlslime	8	4,194,304	64	8	5.5	390523
dlslime	8	8,388,608	64	8	10.984	391043
dlslime	8	16,777,216	64	8	21.955	391291
dlslime	8	33,554,432	64	8	43.891	391407
dlslime	8	67,108,864	64	8	87.771	391480
dlslime	8	134,217,728	64	8	175.518	391530

GDRDMA P2P Send/Recv

SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode send --use-gpu --iterations 100

SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode recv --use-gpu --iterations 100

Message Size (bytes)	Avg Latency	Bandwidth	Device
1,024	0.027 ms	37.65 MB/s	GPU
2,048	0.028 ms	72.17 MB/s	GPU
4,096	0.028 ms	144.81 MB/s	GPU
8,192	0.028 ms	295.98 MB/s	GPU
16,384	0.029 ms	564.15 MB/s	GPU
32,768	0.031 ms	1069.90 MB/s	GPU
65,536	0.031 ms	2083.20 MB/s	GPU
131,072	0.032 ms	4038.17 MB/s	GPU
262,144	0.036 ms	7299.42 MB/s	GPU
524,288	0.042 ms	12495.87 MB/s	GPU
1,048,576	0.053 ms	19961.18 MB/s	GPU
2,097,152	0.075 ms	27924.99 MB/s	GPU
4,194,304	0.117 ms	35716.55 MB/s	GPU
8,388,608	0.212 ms	39637.66 MB/s	GPU
16,777,216	0.387 ms	43386.08 MB/s	GPU
33,554,432	0.871 ms	38532.98 MB/s	GPU
67,108,864	1.665 ms	40298.91 MB/s	GPU
134,217,728	3.159 ms	42487.69 MB/s	GPU
268,435,456	5.643 ms	47572.53 MB/s	GPU
536,870,912	11.137 ms	48204.20 MB/s	GPU

Heterogeneous Interconnection

hardware configs

Device	NIC Model	Bandwidth	PCIe Version	PCIe Lanes
A	Mellanox ConnectX-7 Lx (MT4129)	400 Gbps	PCIe 5.0	x16
B	Mellanox ConnectX-7 Lx (MT4129)	400 Gbps	PCIe 5.0	x8
C	Mellanox ConnectX-7 Lx (MT4129)	200 Gbps	PCIe 5.0	x16
D	Mellanox ConnectX-7 Lx (MT4129)	400 Gbps	PCIe 5.0	x16

experiments configs
- Message Size = 128 MB
- RDMA RC Read(single NIC)
- Under affinity scenario
- RDMA with GPU Direct
Interconnect bandwidth matrix：(MB/s, demonstrates attainment of the theoretical bound).

Throughput (MB/s)	A	B	C	D
A	48967.45	28686.29	24524.29	27676.57
B	28915.72	28275.85	23472.29	27234.60
C	24496.14	24496.51	24513.57	24493.89
D	29317.66	28683.25	24515.30	27491.33

detailed results: bench

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
License
- OSI Approved :: BSD License
Operating System
- POSIX :: Linux
- Unix
Programming Language
- C++
- Python :: 3
Topic
- System :: Networking
- System :: Systems Administration

Release history Release notifications | RSS feed

0.0.3rc2 pre-release

May 6, 2026

This version

0.0.3rc1 pre-release

Mar 31, 2026

0.0.2.post1

Jan 6, 2026

0.0.2

Jan 5, 2026

0.0.2rc1 pre-release

Jan 5, 2026

0.0.1.post10

Sep 22, 2025

0.0.1.post7

May 6, 2025

0.0.1.post6

Apr 29, 2025

0.0.1.post5

Apr 29, 2025

0.0.1.post4

Apr 26, 2025

0.0.1.post3

Apr 24, 2025

0.0.1.post2

Apr 8, 2025

0.0.1.post1

Apr 3, 2025

0.0.1

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dlslime-0.0.3rc1-cp313-cp313-manylinux2014_x86_64.whl (629.1 kB view details)

Uploaded Mar 31, 2026 CPython 3.13

dlslime-0.0.3rc1-cp312-cp312-manylinux2014_x86_64.whl (628.5 kB view details)

Uploaded Mar 31, 2026 CPython 3.12

dlslime-0.0.3rc1-cp311-cp311-manylinux2014_x86_64.whl (630.8 kB view details)

Uploaded Mar 31, 2026 CPython 3.11

dlslime-0.0.3rc1-cp310-cp310-manylinux2014_x86_64.whl (629.5 kB view details)

Uploaded Mar 31, 2026 CPython 3.10

File details

Details for the file dlslime-0.0.3rc1-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

Download URL: dlslime-0.0.3rc1-cp313-cp313-manylinux2014_x86_64.whl
Upload date: Mar 31, 2026
Size: 629.1 kB
Tags: CPython 3.13
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dlslime-0.0.3rc1-cp313-cp313-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`d01bfaec784645c3062b0f25177464cd6ab6ce198b7e82289cf2abfb6cb7b7bf`
MD5	`afb1bef64800b8feb3e0321314e69672`
BLAKE2b-256	`792f0d0e1a33b63d5fcf7ce035dc6e964af19d8c581fdd62766839f64eb40c44`

See more details on using hashes here.

File details

Details for the file dlslime-0.0.3rc1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

Download URL: dlslime-0.0.3rc1-cp312-cp312-manylinux2014_x86_64.whl
Upload date: Mar 31, 2026
Size: 628.5 kB
Tags: CPython 3.12
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dlslime-0.0.3rc1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c952b6a5e355e4ceade31f513edfd934c45853f9ab05a5dbf25d98dbd9d664b6`
MD5	`582d6b53fa5fc56884c2b20a56a45b44`
BLAKE2b-256	`45f3b11d3e1ee33b6976fef0ef27c4c9559aba7b2ffa62d6287c12cc04a2a0a0`

See more details on using hashes here.

File details

Details for the file dlslime-0.0.3rc1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

Download URL: dlslime-0.0.3rc1-cp311-cp311-manylinux2014_x86_64.whl
Upload date: Mar 31, 2026
Size: 630.8 kB
Tags: CPython 3.11
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dlslime-0.0.3rc1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c68fd3318011e9a076ad81c32422499976874b929b1a1fb0a3036a4d1fdd4610`
MD5	`338b1ae12eb414110a58280bf9fab07d`
BLAKE2b-256	`4153120db4ceaca59e6b0a102524c4aad7f57fbcf4893993ed7f6996190b90cc`

See more details on using hashes here.

File details

Details for the file dlslime-0.0.3rc1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: dlslime-0.0.3rc1-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Mar 31, 2026
Size: 629.5 kB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for dlslime-0.0.3rc1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c20d4377483f4d28cfbbbd7e2967c854e638bc2df24cd252a2af7a0ad0edcebd`
MD5	`4b00bc4acfe3b374d3f5c6a8de470f2d`
BLAKE2b-256	`cc73f062165786d1663253c88947d9023c29f8c6205dbd8b1b6ed775e667445e`

See more details on using hashes here.

dlslime 0.0.3rc1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

P2P Communication

RDMA RC Mode

NVLink Mode

Huawei Ascend Direct Mode

Collective Ops

Intra Node

AllGather

Inter Node

AllGather

AllGather Gemm Overlapping

Install

pip install

Build from source

Python

CPP

Build flags

Benchmark

GDRDMA P2P Read/Write

#BS=1, #Concurrency=1

#BS=64, #Concurrency=1

#BS=64, #Concurrency=8

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

#BS=64, #Concurrency=1

#BS=64, #Concurrency=8

GDRDMA P2P Send/Recv

Heterogeneous Interconnection​

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Heterogeneous Interconnection