Skip to main content

DLSlime Transfer Engine

Project description

Roadmap | Slack | WeChat Group | Zhihu

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.

Assignment Operation.

Here are some examples of DLSlime interface.

P2P Communication

RDMA RC Mode

  • RDMA RC Read (Sync / Async mode)
python example/python/p2p_rdma_rc_read.py
  • RDMA RC Read (Coroutine mode)
python example/python/p2p_rdma_rc_read_coroutine.py
  • RDMA RC Write (Sync / Async mode)
python example/python/p2p_rdma_rc_write.py
  • RDMA RC Write with immediate data (Sync / Async mode)
python example/python/p2p_rdma_rc_write_with_imm_data.py
  • RDMA RC Send/Recv
python example/python/p2p_rdma_rc_send_recv.py
python example/python/p2p_rdma_rc_send_recv_gdr.py
  • DLSlime torch backend
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1

NVLink Mode

torchrun --nproc_per_node=2 p2p_nvlink.py

NVShmem Mode

# send
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
# recv
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2

Huawei Ascend Direct Mode

See: Huawei README

[!Caution] DLSlime NVShmem transfer engine and Huawei Ascond Direct mode are in the experimental stage.

Collective Ops

Intra Node

AllGather
torchrun --nnodes 1 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode intra

Inter Node

AllGather
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
AllGather Gemm Overlapping
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py

[!Note] The intra- and inter- examples example above enables CUDA Graph by default. --eager-mode falls back to eager mode.

Install

pip install

pip install dlslime==0.0.1.post10

[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).

Build from source

Python

git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .

CPP

git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..

Build flags

The FLAG can be

Flag Description Platform default
BUILD_RDMA Build RDMA Transfer Engine Hetero ON
BUILD_PYTHON Build Python wrapper Hetero ON
BUILD_NVLINK Build NVLINK Transfer Engine GPGPU OFF
BUILD_NVSHMEM Build NVShmem Transfer Engine NVIDIA OFF
BUILD_ASCEND_DIRECT Build Ascend direct transport ASCEND OFF
BUILD_TORCH_PLUGIN Build DLSlime as a torch backend Hetero OFF
USE_GLOO_BACKEND Use GLOO RDMA Send/Recv torch backend Hetero OFF
BUILD_INTRA_OPS Use INTRA Collective OPS GPGPU OFF
BUILD_INTER_OPS Use INTER Collective OPS (NVSHMEM) NVIDIA OFF

[!Note] Please enable USE_MECA when using DLSlime as a torch backend in Metax platform.

Benchmark

GDRDMA P2P Read/Write

  • Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 1 1 0.039 52
dlslime 1 4,096 1 1 0.037 111
dlslime 1 8,192 1 1 0.038 216
dlslime 1 16,384 1 1 0.037 442
dlslime 1 32,768 1 1 0.039 836
dlslime 1 65,536 1 1 0.039 1689
dlslime 1 131,072 1 1 0.041 3195
dlslime 1 262,144 1 1 0.043 6059
dlslime 1 524,288 1 1 0.049 10689
dlslime 1 1,048,576 1 1 0.062 17012
dlslime 1 2,097,152 1 1 0.083 25154
dlslime 1 4,194,304 1 1 0.127 33112
dlslime 1 8,388,608 1 1 0.211 39797
dlslime 1 16,777,216 1 1 0.382 43893
dlslime 1 33,554,432 1 1 0.726 46244
dlslime 1 67,108,864 1 1 1.412 47518
dlslime 1 134,217,728 1 1 2.783 48235

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 1 0.084 1562
dlslime 1 4,096 64 1 0.082 3213
dlslime 1 8,192 64 1 0.086 6095
dlslime 1 16,384 64 1 0.093 11249
dlslime 1 32,768 64 1 0.115 18193
dlslime 1 65,536 64 1 0.158 26542
dlslime 1 131,072 64 1 0.243 34498
dlslime 1 262,144 64 1 0.414 40549
dlslime 1 524,288 64 1 0.758 44248
dlslime 1 1,048,576 64 1 1.443 46510
dlslime 1 2,097,152 64 1 2.809 47782
dlslime 1 4,194,304 64 1 5.555 48327
dlslime 1 8,388,608 64 1 11.041 48624
dlslime 1 16,777,216 64 1 22.003 48798
dlslime 1 33,554,432 64 1 43.941 48872
dlslime 1 67,108,864 64 1 87.809 48912
dlslime 1 134,217,728 64 1 175.512 48942

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 8 0.037 3519
dlslime 1 4,096 64 8 0.038 6948
dlslime 1 8,192 64 8 0.038 13758
dlslime 1 16,384 64 8 0.04 26416
dlslime 1 32,768 64 8 0.057 36997
dlslime 1 65,536 64 8 0.098 42618
dlslime 1 131,072 64 8 0.184 45602
dlslime 1 262,144 64 8 0.356 47148
dlslime 1 524,288 64 8 0.699 47975
dlslime 1 1,048,576 64 8 1.384 48478
dlslime 1 2,097,152 64 8 2.755 48709
dlslime 1 4,194,304 64 8 5.498 48823
dlslime 1 8,388,608 64 8 10.982 48884
dlslime 1 16,777,216 64 8 21.954 48908
dlslime 1 33,554,432 64 8 43.895 48923
dlslime 1 67,108,864 64 8 87.766 48936
dlslime 1 134,217,728 64 8 175.517 48940

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 1 1 0.051 157
dlslime 8 4,096 1 1 0.042 768
dlslime 8 8,192 1 1 0.04 1576
dlslime 8 16,384 1 1 0.054 2929
dlslime 8 32,768 1 1 0.051 5713
dlslime 8 65,536 1 1 0.052 11547
dlslime 8 131,072 1 1 0.055 22039
dlslime 8 262,144 1 1 0.058 42313
dlslime 8 524,288 1 1 0.064 74753
dlslime 8 1,048,576 1 1 0.072 127489
dlslime 8 2,097,152 1 1 0.101 184823
dlslime 8 4,194,304 1 1 0.149 246861
dlslime 8 8,388,608 1 1 0.237 299510
dlslime 8 16,777,216 1 1 0.403 340252
dlslime 8 33,554,432 1 1 0.743 364918
dlslime 8 67,108,864 1 1 1.423 378620
dlslime 8 134,217,728 1 1 2.79 384630

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 1 0.091 11690
dlslime 8 4,096 64 1 0.081 24403
dlslime 8 8,192 64 1 0.091 45926
dlslime 8 16,384 64 1 0.098 84092
dlslime 8 32,768 64 1 0.117 138696
dlslime 8 65,536 64 1 0.16 206866
dlslime 8 131,072 64 1 0.241 273976
dlslime 8 262,144 64 1 0.415 320008
dlslime 8 524,288 64 1 0.757 353714
dlslime 8 1,048,576 64 1 1.439 372217
dlslime 8 2,097,152 64 1 2.819 381397
dlslime 8 4,194,304 64 1 5.555 386489
dlslime 8 8,388,608 64 1 11.044 388927
dlslime 8 16,777,216 64 1 22.009 390278
dlslime 8 33,554,432 64 1 43.951 390978
dlslime 8 67,108,864 64 1 87.804 391370
dlslime 8 134,217,728 64 1 175.508 391588

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 8 0.036 28494
dlslime 8 4,096 64 8 0.038 50860
dlslime 8 8,192 64 8 0.048 104545
dlslime 8 16,384 64 8 0.041 207051
dlslime 8 32,768 64 8 0.056 297354
dlslime 8 65,536 64 8 0.099 337571
dlslime 8 131,072 64 8 0.185 363003
dlslime 8 262,144 64 8 0.356 376743
dlslime 8 524,288 64 8 0.701 383701
dlslime 8 1,048,576 64 8 1.386 387629
dlslime 8 2,097,152 64 8 2.757 389493
dlslime 8 4,194,304 64 8 5.5 390523
dlslime 8 8,388,608 64 8 10.984 391043
dlslime 8 16,777,216 64 8 21.955 391291
dlslime 8 33,554,432 64 8 43.891 391407
dlslime 8 67,108,864 64 8 87.771 391480
dlslime 8 134,217,728 64 8 175.518 391530

GDRDMA P2P Send/Recv

SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode send --use-gpu --iterations 100
SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode recv --use-gpu --iterations 100
Message Size (bytes) Avg Latency Bandwidth Device
1,024 0.027 ms 37.65 MB/s GPU
2,048 0.028 ms 72.17 MB/s GPU
4,096 0.028 ms 144.81 MB/s GPU
8,192 0.028 ms 295.98 MB/s GPU
16,384 0.029 ms 564.15 MB/s GPU
32,768 0.031 ms 1069.90 MB/s GPU
65,536 0.031 ms 2083.20 MB/s GPU
131,072 0.032 ms 4038.17 MB/s GPU
262,144 0.036 ms 7299.42 MB/s GPU
524,288 0.042 ms 12495.87 MB/s GPU
1,048,576 0.053 ms 19961.18 MB/s GPU
2,097,152 0.075 ms 27924.99 MB/s GPU
4,194,304 0.117 ms 35716.55 MB/s GPU
8,388,608 0.212 ms 39637.66 MB/s GPU
16,777,216 0.387 ms 43386.08 MB/s GPU
33,554,432 0.871 ms 38532.98 MB/s GPU
67,108,864 1.665 ms 40298.91 MB/s GPU
134,217,728 3.159 ms 42487.69 MB/s GPU
268,435,456 5.643 ms 47572.53 MB/s GPU
536,870,912 11.137 ms 48204.20 MB/s GPU

Heterogeneous Interconnection​

  • hardware configs
Device NIC Model Bandwidth PCIe Version PCIe Lanes
A Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
B Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x8
C Mellanox ConnectX-7 Lx (MT4129) 200 Gbps PCIe 5.0 x16
D Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
  • experiments configs

    • Message Size = 128 MB
    • RDMA RC Read(single NIC)
    • Under affinity scenario
    • RDMA with GPU Direct
  • Interconnect bandwidth matrix:(MB/s, demonstrates attainment of the theoretical bound).

Throughput (MB/s) A B C D
A 48967.45 28686.29 24524.29 27676.57
B 28915.72 28275.85 23472.29 27234.60
C 24496.14 24496.51 24513.57 24493.89
D 29317.66 28683.25 24515.30 27491.33

detailed results: bench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlslime-0.0.2rc1-cp313-cp313-manylinux2014_x86_64.whl (831.7 kB view details)

Uploaded CPython 3.13

dlslime-0.0.2rc1-cp312-cp312-manylinux2014_x86_64.whl (830.6 kB view details)

Uploaded CPython 3.12

dlslime-0.0.2rc1-cp311-cp311-manylinux2014_x86_64.whl (832.7 kB view details)

Uploaded CPython 3.11

dlslime-0.0.2rc1-cp310-cp310-manylinux2014_x86_64.whl (830.9 kB view details)

Uploaded CPython 3.10

dlslime-0.0.2rc1-cp39-cp39-manylinux2014_x86_64.whl (831.3 kB view details)

Uploaded CPython 3.9

dlslime-0.0.2rc1-cp38-cp38-manylinux2014_x86_64.whl (830.8 kB view details)

Uploaded CPython 3.8

File details

Details for the file dlslime-0.0.2rc1-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 009e565f4a332ff0376fc263a35d49f0cdef781201d7529a8824be2a5fa1f34c
MD5 b9500601bcf0210817645bcd71876290
BLAKE2b-256 a940078d1fae638af597b42f22bfcb946d12cbe369ddffba9d4255483ae45bc2

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2rc1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1de6e672b97c5a923ee0465bb1bb8018554e80cdecb8c2ed1baf59fb0fe15db1
MD5 e04d59e2fc95fe216d0e6aaf6b315ac4
BLAKE2b-256 ce0b3f49ecb7acd4aa3f8a730fd97163576028e752f7560f8ba2c73cf2d67aa6

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2rc1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a512880e1d5bfa1fa9bcc9cbbc6e42bd1b7bed32651b3ff64cdb253cc7ab9b7
MD5 7c0b20d9ebdd5f1294f5021a3dd1b71c
BLAKE2b-256 2c8eac63a8e9daf2142e59302a2be2b1b95a4338aaa32780b20dca8b09dca2fb

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2rc1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a848f28d6af1677f16805615ce4aeca79ea6d8b756790ead05fb9eaab158d469
MD5 aebc9665df70f34fa829de7f9c275df4
BLAKE2b-256 31d1c4f415fadccaf0004e3cad2c2c4d9660a3e2470c48eb0ed33ab8cb629c41

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2rc1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64c85381d8d3daf84c148b4f2509ac788e104d349a82526157cf86f7143d2004
MD5 77c4664150101d20380ebb265f5c69d4
BLAKE2b-256 1e93ebc6db6b0831e29fc0e5db9d6d4dacf92d1ddb76026d0c119556d4ce137c

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2rc1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2rc1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3c874f2332184d112089b4f9c839efb2ebb1d1e206e5b184731b31ab3e4a11cd
MD5 bc554de77d6c3636d0a44a4ff9059cb6
BLAKE2b-256 c4d4b3b51121120a4f5ca36f97f0eaa3ff4ef063af7fee1185180b3cb6316a13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page