Skip to main content

DLSlime Transfer Engine

Project description

Roadmap | Slack | WeChat Group | Zhihu

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.

Assignment Operation.

Here are some examples of DLSlime interface.

P2P Communication

RDMA RC Mode

  • RDMA RC Read (Sync / Async mode)
python example/python/p2p_rdma_rc_read.py
  • RDMA RC Read (Coroutine mode)
python example/python/p2p_rdma_rc_read_coroutine.py
  • RDMA RC Write (Sync / Async mode)
python example/python/p2p_rdma_rc_write.py
  • RDMA RC Write with immediate data (Sync / Async mode)
python example/python/p2p_rdma_rc_write_with_imm_data.py
  • RDMA RC Send/Recv
python example/python/p2p_rdma_rc_send_recv.py
python example/python/p2p_rdma_rc_send_recv_gdr.py
  • DLSlime torch backend
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1

NVLink Mode

torchrun --nproc_per_node=2 p2p_nvlink.py

NVShmem Mode

# send
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
# recv
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2

Huawei Ascend Direct Mode

See: Huawei README

[!Caution] DLSlime NVShmem transfer engine and Huawei Ascond Direct mode are in the experimental stage.

Collective Ops

Intra Node

AllGather
torchrun --nnodes 1 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode intra

Inter Node

AllGather
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
AllGather Gemm Overlapping
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py

[!Note] The intra- and inter- examples example above enables CUDA Graph by default. --eager-mode falls back to eager mode.

Install

pip install

pip install dlslime==0.0.1.post10

[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).

Build from source

Python

git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .

CPP

git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..

Build flags

The FLAG can be

Flag Description Platform default
BUILD_RDMA Build RDMA Transfer Engine Hetero ON
BUILD_PYTHON Build Python wrapper Hetero ON
BUILD_NVLINK Build NVLINK Transfer Engine GPGPU OFF
BUILD_NVSHMEM Build NVShmem Transfer Engine NVIDIA OFF
BUILD_ASCEND_DIRECT Build Ascend direct transport ASCEND OFF
BUILD_TORCH_PLUGIN Build DLSlime as a torch backend Hetero OFF
USE_GLOO_BACKEND Use GLOO RDMA Send/Recv torch backend Hetero OFF
BUILD_INTRA_OPS Use INTRA Collective OPS GPGPU OFF
BUILD_INTER_OPS Use INTER Collective OPS (NVSHMEM) NVIDIA OFF

[!Note] Please enable USE_MECA when using DLSlime as a torch backend in Metax platform.

Benchmark

GDRDMA P2P Read/Write

  • Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 1 1 0.039 52
dlslime 1 4,096 1 1 0.037 111
dlslime 1 8,192 1 1 0.038 216
dlslime 1 16,384 1 1 0.037 442
dlslime 1 32,768 1 1 0.039 836
dlslime 1 65,536 1 1 0.039 1689
dlslime 1 131,072 1 1 0.041 3195
dlslime 1 262,144 1 1 0.043 6059
dlslime 1 524,288 1 1 0.049 10689
dlslime 1 1,048,576 1 1 0.062 17012
dlslime 1 2,097,152 1 1 0.083 25154
dlslime 1 4,194,304 1 1 0.127 33112
dlslime 1 8,388,608 1 1 0.211 39797
dlslime 1 16,777,216 1 1 0.382 43893
dlslime 1 33,554,432 1 1 0.726 46244
dlslime 1 67,108,864 1 1 1.412 47518
dlslime 1 134,217,728 1 1 2.783 48235

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 1 0.084 1562
dlslime 1 4,096 64 1 0.082 3213
dlslime 1 8,192 64 1 0.086 6095
dlslime 1 16,384 64 1 0.093 11249
dlslime 1 32,768 64 1 0.115 18193
dlslime 1 65,536 64 1 0.158 26542
dlslime 1 131,072 64 1 0.243 34498
dlslime 1 262,144 64 1 0.414 40549
dlslime 1 524,288 64 1 0.758 44248
dlslime 1 1,048,576 64 1 1.443 46510
dlslime 1 2,097,152 64 1 2.809 47782
dlslime 1 4,194,304 64 1 5.555 48327
dlslime 1 8,388,608 64 1 11.041 48624
dlslime 1 16,777,216 64 1 22.003 48798
dlslime 1 33,554,432 64 1 43.941 48872
dlslime 1 67,108,864 64 1 87.809 48912
dlslime 1 134,217,728 64 1 175.512 48942

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 8 0.037 3519
dlslime 1 4,096 64 8 0.038 6948
dlslime 1 8,192 64 8 0.038 13758
dlslime 1 16,384 64 8 0.04 26416
dlslime 1 32,768 64 8 0.057 36997
dlslime 1 65,536 64 8 0.098 42618
dlslime 1 131,072 64 8 0.184 45602
dlslime 1 262,144 64 8 0.356 47148
dlslime 1 524,288 64 8 0.699 47975
dlslime 1 1,048,576 64 8 1.384 48478
dlslime 1 2,097,152 64 8 2.755 48709
dlslime 1 4,194,304 64 8 5.498 48823
dlslime 1 8,388,608 64 8 10.982 48884
dlslime 1 16,777,216 64 8 21.954 48908
dlslime 1 33,554,432 64 8 43.895 48923
dlslime 1 67,108,864 64 8 87.766 48936
dlslime 1 134,217,728 64 8 175.517 48940

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 1 1 0.051 157
dlslime 8 4,096 1 1 0.042 768
dlslime 8 8,192 1 1 0.04 1576
dlslime 8 16,384 1 1 0.054 2929
dlslime 8 32,768 1 1 0.051 5713
dlslime 8 65,536 1 1 0.052 11547
dlslime 8 131,072 1 1 0.055 22039
dlslime 8 262,144 1 1 0.058 42313
dlslime 8 524,288 1 1 0.064 74753
dlslime 8 1,048,576 1 1 0.072 127489
dlslime 8 2,097,152 1 1 0.101 184823
dlslime 8 4,194,304 1 1 0.149 246861
dlslime 8 8,388,608 1 1 0.237 299510
dlslime 8 16,777,216 1 1 0.403 340252
dlslime 8 33,554,432 1 1 0.743 364918
dlslime 8 67,108,864 1 1 1.423 378620
dlslime 8 134,217,728 1 1 2.79 384630

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 1 0.091 11690
dlslime 8 4,096 64 1 0.081 24403
dlslime 8 8,192 64 1 0.091 45926
dlslime 8 16,384 64 1 0.098 84092
dlslime 8 32,768 64 1 0.117 138696
dlslime 8 65,536 64 1 0.16 206866
dlslime 8 131,072 64 1 0.241 273976
dlslime 8 262,144 64 1 0.415 320008
dlslime 8 524,288 64 1 0.757 353714
dlslime 8 1,048,576 64 1 1.439 372217
dlslime 8 2,097,152 64 1 2.819 381397
dlslime 8 4,194,304 64 1 5.555 386489
dlslime 8 8,388,608 64 1 11.044 388927
dlslime 8 16,777,216 64 1 22.009 390278
dlslime 8 33,554,432 64 1 43.951 390978
dlslime 8 67,108,864 64 1 87.804 391370
dlslime 8 134,217,728 64 1 175.508 391588

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 8 0.036 28494
dlslime 8 4,096 64 8 0.038 50860
dlslime 8 8,192 64 8 0.048 104545
dlslime 8 16,384 64 8 0.041 207051
dlslime 8 32,768 64 8 0.056 297354
dlslime 8 65,536 64 8 0.099 337571
dlslime 8 131,072 64 8 0.185 363003
dlslime 8 262,144 64 8 0.356 376743
dlslime 8 524,288 64 8 0.701 383701
dlslime 8 1,048,576 64 8 1.386 387629
dlslime 8 2,097,152 64 8 2.757 389493
dlslime 8 4,194,304 64 8 5.5 390523
dlslime 8 8,388,608 64 8 10.984 391043
dlslime 8 16,777,216 64 8 21.955 391291
dlslime 8 33,554,432 64 8 43.891 391407
dlslime 8 67,108,864 64 8 87.771 391480
dlslime 8 134,217,728 64 8 175.518 391530

GDRDMA P2P Send/Recv

SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode send --use-gpu --iterations 100
SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode recv --use-gpu --iterations 100
Message Size (bytes) Avg Latency Bandwidth Device
1,024 0.027 ms 37.65 MB/s GPU
2,048 0.028 ms 72.17 MB/s GPU
4,096 0.028 ms 144.81 MB/s GPU
8,192 0.028 ms 295.98 MB/s GPU
16,384 0.029 ms 564.15 MB/s GPU
32,768 0.031 ms 1069.90 MB/s GPU
65,536 0.031 ms 2083.20 MB/s GPU
131,072 0.032 ms 4038.17 MB/s GPU
262,144 0.036 ms 7299.42 MB/s GPU
524,288 0.042 ms 12495.87 MB/s GPU
1,048,576 0.053 ms 19961.18 MB/s GPU
2,097,152 0.075 ms 27924.99 MB/s GPU
4,194,304 0.117 ms 35716.55 MB/s GPU
8,388,608 0.212 ms 39637.66 MB/s GPU
16,777,216 0.387 ms 43386.08 MB/s GPU
33,554,432 0.871 ms 38532.98 MB/s GPU
67,108,864 1.665 ms 40298.91 MB/s GPU
134,217,728 3.159 ms 42487.69 MB/s GPU
268,435,456 5.643 ms 47572.53 MB/s GPU
536,870,912 11.137 ms 48204.20 MB/s GPU

Heterogeneous Interconnection​

  • hardware configs
Device NIC Model Bandwidth PCIe Version PCIe Lanes
A Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
B Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x8
C Mellanox ConnectX-7 Lx (MT4129) 200 Gbps PCIe 5.0 x16
D Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
  • experiments configs

    • Message Size = 128 MB
    • RDMA RC Read(single NIC)
    • Under affinity scenario
    • RDMA with GPU Direct
  • Interconnect bandwidth matrix:(MB/s, demonstrates attainment of the theoretical bound).

Throughput (MB/s) A B C D
A 48967.45 28686.29 24524.29 27676.57
B 28915.72 28275.85 23472.29 27234.60
C 24496.14 24496.51 24513.57 24493.89
D 29317.66 28683.25 24515.30 27491.33

detailed results: bench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlslime-0.0.2-cp313-cp313-manylinux2014_x86_64.whl (831.6 kB view details)

Uploaded CPython 3.13

dlslime-0.0.2-cp312-cp312-manylinux2014_x86_64.whl (830.5 kB view details)

Uploaded CPython 3.12

dlslime-0.0.2-cp311-cp311-manylinux2014_x86_64.whl (832.7 kB view details)

Uploaded CPython 3.11

dlslime-0.0.2-cp310-cp310-manylinux2014_x86_64.whl (830.8 kB view details)

Uploaded CPython 3.10

dlslime-0.0.2-cp39-cp39-manylinux2014_x86_64.whl (831.3 kB view details)

Uploaded CPython 3.9

dlslime-0.0.2-cp38-cp38-manylinux2014_x86_64.whl (830.8 kB view details)

Uploaded CPython 3.8

File details

Details for the file dlslime-0.0.2-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f9153612cb78ae962439c9bf2788b2ae6eb485044c96ce9e0ccc52cece9c4ad3
MD5 65e33fa634a1a751b7656f1240ef5e62
BLAKE2b-256 3c18d791f02d4500a34f1bd251fc83e74dc95d4cc5e37e852f994c68d65aa99b

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cd3546163815fe37c785a1aa0aa342192dd157a7c7ba853aaa93a472fb8ee13e
MD5 5410b1c84d80648f1ca93f2b665f3494
BLAKE2b-256 f912826f6f041382fd0aee6cd549c70f48d124de7b8f04290e8300f334b5c22b

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 711b974ed13e545122fdc45bed0e818d11c1e1b866d5275440357755a3b77df0
MD5 ec69bf48c9041e06c3deba4ff9854d28
BLAKE2b-256 6a49e379ac6404377e934adbcdb996b6ea73853598e95e8093af040d85ddbc16

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5b66df9585a1cc30914bd34c37d5e6adb6ae6dd62b4b41d4a3668056ed9dcbc1
MD5 95ebb14604ab3baa2c992d2f740ef170
BLAKE2b-256 c778b542be1536ce07e98e889221d08cd6a13e36b37c8029cf45ece2379140db

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c947eee5c04d48c6249f5cdf4e72f7c3132e5c2557cbb35d0c4749e397ce7254
MD5 e661eead3a606693273cab93d96537f8
BLAKE2b-256 5bd18f843b3650636b7dd16bcd6df15aaed75cb967e430f9d03d649446a1dabf

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 086cff04f9f3503d1cbbdf9930008623e18e6a30d2ae1768cc48d107d5f1cbc0
MD5 4411598b579a30788a31b55658a5d118
BLAKE2b-256 5668883d996eaeda53178ddc1f57c4ae53d167b9a345fce225c33776f4b6ce0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page