Skip to main content

DLSlime Transfer Engine

Project description

Roadmap | Slack | WeChat Group | Zhihu

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.

Assignment Operation.

Here are some examples of DLSlime interface.

P2P Communication

RDMA RC Mode

  • RDMA RC Read (Sync / Async mode)
python example/python/p2p_rdma_rc_read.py
  • RDMA RC Read (Coroutine mode)
python example/python/p2p_rdma_rc_read_coroutine.py
  • RDMA RC Write (Sync / Async mode)
python example/python/p2p_rdma_rc_write.py
  • RDMA RC Write with immediate data (Sync / Async mode)
python example/python/p2p_rdma_rc_write_with_imm_data.py
  • RDMA RC Send/Recv
python example/python/p2p_rdma_rc_send_recv.py
python example/python/p2p_rdma_rc_send_recv_gdr.py
  • DLSlime torch backend
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1

NVLink Mode

torchrun --nproc_per_node=2 p2p_nvlink.py

NVShmem Mode

# send
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
# recv
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2

Huawei Ascend Direct Mode

See: Huawei README

[!Caution] DLSlime NVShmem transfer engine and Huawei Ascond Direct mode are in the experimental stage.

Collective Ops

Intra Node

AllGather
torchrun --nnodes 1 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode intra

Inter Node

AllGather
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_ll.py --mode inter
AllGather Gemm Overlapping
# Node 0
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 0 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py
# Node 1
torchrun --nnodes 2 --master-addr 10.130.8.143 --node-rank 1 --nproc-per-node 8 --master-port 6007 example/python/all_gather_gemm_overlap.py

[!Note] The intra- and inter- examples example above enables CUDA Graph by default. --eager-mode falls back to eager mode.

Install

pip install

pip install dlslime==0.0.1.post10

[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).

Build from source

Python

git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .

CPP

git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..

Build flags

The FLAG can be

Flag Description Platform default
BUILD_RDMA Build RDMA Transfer Engine Hetero ON
BUILD_PYTHON Build Python wrapper Hetero ON
BUILD_NVLINK Build NVLINK Transfer Engine GPGPU OFF
BUILD_NVSHMEM Build NVShmem Transfer Engine NVIDIA OFF
BUILD_ASCEND_DIRECT Build Ascend direct transport ASCEND OFF
BUILD_TORCH_PLUGIN Build DLSlime as a torch backend Hetero OFF
USE_GLOO_BACKEND Use GLOO RDMA Send/Recv torch backend Hetero OFF
BUILD_INTRA_OPS Use INTRA Collective OPS GPGPU OFF
BUILD_INTER_OPS Use INTER Collective OPS (NVSHMEM) NVIDIA OFF

[!Note] Please enable USE_MECA when using DLSlime as a torch backend in Metax platform.

Benchmark

GDRDMA P2P Read/Write

  • Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 1 1 0.039 52
dlslime 1 4,096 1 1 0.037 111
dlslime 1 8,192 1 1 0.038 216
dlslime 1 16,384 1 1 0.037 442
dlslime 1 32,768 1 1 0.039 836
dlslime 1 65,536 1 1 0.039 1689
dlslime 1 131,072 1 1 0.041 3195
dlslime 1 262,144 1 1 0.043 6059
dlslime 1 524,288 1 1 0.049 10689
dlslime 1 1,048,576 1 1 0.062 17012
dlslime 1 2,097,152 1 1 0.083 25154
dlslime 1 4,194,304 1 1 0.127 33112
dlslime 1 8,388,608 1 1 0.211 39797
dlslime 1 16,777,216 1 1 0.382 43893
dlslime 1 33,554,432 1 1 0.726 46244
dlslime 1 67,108,864 1 1 1.412 47518
dlslime 1 134,217,728 1 1 2.783 48235

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 1 0.084 1562
dlslime 1 4,096 64 1 0.082 3213
dlslime 1 8,192 64 1 0.086 6095
dlslime 1 16,384 64 1 0.093 11249
dlslime 1 32,768 64 1 0.115 18193
dlslime 1 65,536 64 1 0.158 26542
dlslime 1 131,072 64 1 0.243 34498
dlslime 1 262,144 64 1 0.414 40549
dlslime 1 524,288 64 1 0.758 44248
dlslime 1 1,048,576 64 1 1.443 46510
dlslime 1 2,097,152 64 1 2.809 47782
dlslime 1 4,194,304 64 1 5.555 48327
dlslime 1 8,388,608 64 1 11.041 48624
dlslime 1 16,777,216 64 1 22.003 48798
dlslime 1 33,554,432 64 1 43.941 48872
dlslime 1 67,108,864 64 1 87.809 48912
dlslime 1 134,217,728 64 1 175.512 48942

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 8 0.037 3519
dlslime 1 4,096 64 8 0.038 6948
dlslime 1 8,192 64 8 0.038 13758
dlslime 1 16,384 64 8 0.04 26416
dlslime 1 32,768 64 8 0.057 36997
dlslime 1 65,536 64 8 0.098 42618
dlslime 1 131,072 64 8 0.184 45602
dlslime 1 262,144 64 8 0.356 47148
dlslime 1 524,288 64 8 0.699 47975
dlslime 1 1,048,576 64 8 1.384 48478
dlslime 1 2,097,152 64 8 2.755 48709
dlslime 1 4,194,304 64 8 5.498 48823
dlslime 1 8,388,608 64 8 10.982 48884
dlslime 1 16,777,216 64 8 21.954 48908
dlslime 1 33,554,432 64 8 43.895 48923
dlslime 1 67,108,864 64 8 87.766 48936
dlslime 1 134,217,728 64 8 175.517 48940

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 1 1 0.051 157
dlslime 8 4,096 1 1 0.042 768
dlslime 8 8,192 1 1 0.04 1576
dlslime 8 16,384 1 1 0.054 2929
dlslime 8 32,768 1 1 0.051 5713
dlslime 8 65,536 1 1 0.052 11547
dlslime 8 131,072 1 1 0.055 22039
dlslime 8 262,144 1 1 0.058 42313
dlslime 8 524,288 1 1 0.064 74753
dlslime 8 1,048,576 1 1 0.072 127489
dlslime 8 2,097,152 1 1 0.101 184823
dlslime 8 4,194,304 1 1 0.149 246861
dlslime 8 8,388,608 1 1 0.237 299510
dlslime 8 16,777,216 1 1 0.403 340252
dlslime 8 33,554,432 1 1 0.743 364918
dlslime 8 67,108,864 1 1 1.423 378620
dlslime 8 134,217,728 1 1 2.79 384630

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 1 0.091 11690
dlslime 8 4,096 64 1 0.081 24403
dlslime 8 8,192 64 1 0.091 45926
dlslime 8 16,384 64 1 0.098 84092
dlslime 8 32,768 64 1 0.117 138696
dlslime 8 65,536 64 1 0.16 206866
dlslime 8 131,072 64 1 0.241 273976
dlslime 8 262,144 64 1 0.415 320008
dlslime 8 524,288 64 1 0.757 353714
dlslime 8 1,048,576 64 1 1.439 372217
dlslime 8 2,097,152 64 1 2.819 381397
dlslime 8 4,194,304 64 1 5.555 386489
dlslime 8 8,388,608 64 1 11.044 388927
dlslime 8 16,777,216 64 1 22.009 390278
dlslime 8 33,554,432 64 1 43.951 390978
dlslime 8 67,108,864 64 1 87.804 391370
dlslime 8 134,217,728 64 1 175.508 391588

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 8 0.036 28494
dlslime 8 4,096 64 8 0.038 50860
dlslime 8 8,192 64 8 0.048 104545
dlslime 8 16,384 64 8 0.041 207051
dlslime 8 32,768 64 8 0.056 297354
dlslime 8 65,536 64 8 0.099 337571
dlslime 8 131,072 64 8 0.185 363003
dlslime 8 262,144 64 8 0.356 376743
dlslime 8 524,288 64 8 0.701 383701
dlslime 8 1,048,576 64 8 1.386 387629
dlslime 8 2,097,152 64 8 2.757 389493
dlslime 8 4,194,304 64 8 5.5 390523
dlslime 8 8,388,608 64 8 10.984 391043
dlslime 8 16,777,216 64 8 21.955 391291
dlslime 8 33,554,432 64 8 43.891 391407
dlslime 8 67,108,864 64 8 87.771 391480
dlslime 8 134,217,728 64 8 175.518 391530

GDRDMA P2P Send/Recv

SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode send --use-gpu --iterations 100
SLIME_QP_NUM=2 python bench/python/dlslime_torch_dist_sendrecv_bench.py --mode recv --use-gpu --iterations 100
Message Size (bytes) Avg Latency Bandwidth Device
1,024 0.027 ms 37.65 MB/s GPU
2,048 0.028 ms 72.17 MB/s GPU
4,096 0.028 ms 144.81 MB/s GPU
8,192 0.028 ms 295.98 MB/s GPU
16,384 0.029 ms 564.15 MB/s GPU
32,768 0.031 ms 1069.90 MB/s GPU
65,536 0.031 ms 2083.20 MB/s GPU
131,072 0.032 ms 4038.17 MB/s GPU
262,144 0.036 ms 7299.42 MB/s GPU
524,288 0.042 ms 12495.87 MB/s GPU
1,048,576 0.053 ms 19961.18 MB/s GPU
2,097,152 0.075 ms 27924.99 MB/s GPU
4,194,304 0.117 ms 35716.55 MB/s GPU
8,388,608 0.212 ms 39637.66 MB/s GPU
16,777,216 0.387 ms 43386.08 MB/s GPU
33,554,432 0.871 ms 38532.98 MB/s GPU
67,108,864 1.665 ms 40298.91 MB/s GPU
134,217,728 3.159 ms 42487.69 MB/s GPU
268,435,456 5.643 ms 47572.53 MB/s GPU
536,870,912 11.137 ms 48204.20 MB/s GPU

Heterogeneous Interconnection​

  • hardware configs
Device NIC Model Bandwidth PCIe Version PCIe Lanes
A Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
B Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x8
C Mellanox ConnectX-7 Lx (MT4129) 200 Gbps PCIe 5.0 x16
D Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
  • experiments configs

    • Message Size = 128 MB
    • RDMA RC Read(single NIC)
    • Under affinity scenario
    • RDMA with GPU Direct
  • Interconnect bandwidth matrix:(MB/s, demonstrates attainment of the theoretical bound).

Throughput (MB/s) A B C D
A 48967.45 28686.29 24524.29 27676.57
B 28915.72 28275.85 23472.29 27234.60
C 24496.14 24496.51 24513.57 24493.89
D 29317.66 28683.25 24515.30 27491.33

detailed results: bench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlslime-0.0.2.post1-cp313-cp313-manylinux2014_x86_64.whl (684.2 kB view details)

Uploaded CPython 3.13

dlslime-0.0.2.post1-cp312-cp312-manylinux2014_x86_64.whl (683.2 kB view details)

Uploaded CPython 3.12

dlslime-0.0.2.post1-cp311-cp311-manylinux2014_x86_64.whl (685.1 kB view details)

Uploaded CPython 3.11

dlslime-0.0.2.post1-cp310-cp310-manylinux2014_x86_64.whl (683.3 kB view details)

Uploaded CPython 3.10

dlslime-0.0.2.post1-cp39-cp39-manylinux2014_x86_64.whl (683.8 kB view details)

Uploaded CPython 3.9

dlslime-0.0.2.post1-cp38-cp38-manylinux2014_x86_64.whl (683.3 kB view details)

Uploaded CPython 3.8

File details

Details for the file dlslime-0.0.2.post1-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a5edf75729db22f764de6820de28017730696518837d9637835dcd57b81ad2f
MD5 a7ad122a180349043df7c59ed97b90e0
BLAKE2b-256 683d7a34146bc739d0814d01a47d667fd7006e945b9e2761a8e1f85ae61c05ab

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2.post1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fad9386529b657ea52e9c733e504a65006ef0fa80600b09a6b4e6f1b7d939723
MD5 ebaeecf96937b45d0170303bec4667a4
BLAKE2b-256 cb5917b0454193c8986cf8512677aa2d2b139f8e529a46bb880a4c5118feadc2

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2.post1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4284c461f670b1bd5e06cd228cbabdc681b1190df05166fb301ea5750e94d25b
MD5 8118bb9b3e69913ee5ac92841baf43d9
BLAKE2b-256 b6c6c64769584a0e00cd656888ccb738a3ccce524e2be065d2d50346958b1225

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2.post1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7488350048ffc8285cb291e1f5f6639e7bc9b2a1451468352ae1a18aed20329f
MD5 4314d17be7d24a90797e12fe82859d99
BLAKE2b-256 eb5535a122938581c7c42d1a1e35f455ce1d56e0bcdd640037c3010bd5e804ae

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2.post1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 987731620f601a8de80374a38ee9bef863ba2c271536423b2639c32816dc4be5
MD5 ecc1b0efd164478e1ac4d4031a05d46f
BLAKE2b-256 8926859ab97131725c9c8f0fab06471cba6343ff48ad39c8de26f6c370bac107

See more details on using hashes here.

File details

Details for the file dlslime-0.0.2.post1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.2.post1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7e16c80637ee4651adddb2255e05157367f38b3dc003edaa39284c30bba30267
MD5 d19335e6ac8f404824b3c07ad7a6e1dd
BLAKE2b-256 e8d4c2a3e90ffe7a82cc46778a382761a2c862793f54654eae526abaa707a73c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page