Skip to main content

DLSlime Transfer Engine

Project description

Slack | WeChat Group

Flexible & Efficient Heterogeneous Transfer Toolkit

Getting Started

DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.

Assignment Operation.

Here are some examples of DLSlime interface.

RDMA RC Mode

  • RDMA RC Read (Sync / Async mode)
python example/python/p2p_rdma_rc_read.py
  • RDMA RC Read (Coroutine mode)
python example/python/p2p_rdma_rc_read_coroutine.py
  • RDMA RC Write (Sync / Async mode)
python example/python/p2p_rdma_rc_write.py
  • RDMA RC Write with immediate data (Sync / Async mode)
python example/python/p2p_rdma_rc_write_with_imm_data.py
  • RDMA RC Send/Recv
python example/python/p2p_rdma_rc_send_recv.py
python example/python/p2p_rdma_rc_send_recv_gdr.py
  • DLSlime torch backend
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1

NVLink Mode

# initiator
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role initiator
# target
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role target

NVShmem Mode

# send
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
# recv
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2

[!Caution] DLSlime NVShmem transfer engine is in the experimental stage.

Install

pip install

pip install dlslime==0.0.1.post8

[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).

Build from source

Python

git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .

CPP

git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..

Build flags

The FLAG can be

Flag Description Platform default
BUILD_RDMA Build RDMA Transfer Engine Hetero ON
BUILD_PYTHON Build Python wrapper Hetero ON
BUILD_NVLINK Build NVLINK Transfer Engine GPGPU OFF
BUILD_NVSHMEM Build NVShmem Transfer Engine NVIDIA OFF
BUILD_TORCH_PLUGIN Build DLSlime as a torch backend Hetero OFF
USE_GLOO_BACKEND Use GLOO RDMA Send/Recv torch backend Hetero OFF

[!Note] Please enable USE_MECA when using DLSlime as a torch backend in Metax platform.

Benchmark

GDRDMA P2P Read/Write

  • Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 1 1 0.039 52
dlslime 1 4,096 1 1 0.037 111
dlslime 1 8,192 1 1 0.038 216
dlslime 1 16,384 1 1 0.037 442
dlslime 1 32,768 1 1 0.039 836
dlslime 1 65,536 1 1 0.039 1689
dlslime 1 131,072 1 1 0.041 3195
dlslime 1 262,144 1 1 0.043 6059
dlslime 1 524,288 1 1 0.049 10689
dlslime 1 1,048,576 1 1 0.062 17012
dlslime 1 2,097,152 1 1 0.083 25154
dlslime 1 4,194,304 1 1 0.127 33112
dlslime 1 8,388,608 1 1 0.211 39797
dlslime 1 16,777,216 1 1 0.382 43893
dlslime 1 33,554,432 1 1 0.726 46244
dlslime 1 67,108,864 1 1 1.412 47518
dlslime 1 134,217,728 1 1 2.783 48235

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 1 0.084 1562
dlslime 1 4,096 64 1 0.082 3213
dlslime 1 8,192 64 1 0.086 6095
dlslime 1 16,384 64 1 0.093 11249
dlslime 1 32,768 64 1 0.115 18193
dlslime 1 65,536 64 1 0.158 26542
dlslime 1 131,072 64 1 0.243 34498
dlslime 1 262,144 64 1 0.414 40549
dlslime 1 524,288 64 1 0.758 44248
dlslime 1 1,048,576 64 1 1.443 46510
dlslime 1 2,097,152 64 1 2.809 47782
dlslime 1 4,194,304 64 1 5.555 48327
dlslime 1 8,388,608 64 1 11.041 48624
dlslime 1 16,777,216 64 1 22.003 48798
dlslime 1 33,554,432 64 1 43.941 48872
dlslime 1 67,108,864 64 1 87.809 48912
dlslime 1 134,217,728 64 1 175.512 48942

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 1 2,048 64 8 0.037 3519
dlslime 1 4,096 64 8 0.038 6948
dlslime 1 8,192 64 8 0.038 13758
dlslime 1 16,384 64 8 0.04 26416
dlslime 1 32,768 64 8 0.057 36997
dlslime 1 65,536 64 8 0.098 42618
dlslime 1 131,072 64 8 0.184 45602
dlslime 1 262,144 64 8 0.356 47148
dlslime 1 524,288 64 8 0.699 47975
dlslime 1 1,048,576 64 8 1.384 48478
dlslime 1 2,097,152 64 8 2.755 48709
dlslime 1 4,194,304 64 8 5.498 48823
dlslime 1 8,388,608 64 8 10.982 48884
dlslime 1 16,777,216 64 8 21.954 48908
dlslime 1 33,554,432 64 8 43.895 48923
dlslime 1 67,108,864 64 8 87.766 48936
dlslime 1 134,217,728 64 8 175.517 48940

GDRDMA Aggregated Bandwidth

#BS=1, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 1 1 0.051 157
dlslime 8 4,096 1 1 0.042 768
dlslime 8 8,192 1 1 0.04 1576
dlslime 8 16,384 1 1 0.054 2929
dlslime 8 32,768 1 1 0.051 5713
dlslime 8 65,536 1 1 0.052 11547
dlslime 8 131,072 1 1 0.055 22039
dlslime 8 262,144 1 1 0.058 42313
dlslime 8 524,288 1 1 0.064 74753
dlslime 8 1,048,576 1 1 0.072 127489
dlslime 8 2,097,152 1 1 0.101 184823
dlslime 8 4,194,304 1 1 0.149 246861
dlslime 8 8,388,608 1 1 0.237 299510
dlslime 8 16,777,216 1 1 0.403 340252
dlslime 8 33,554,432 1 1 0.743 364918
dlslime 8 67,108,864 1 1 1.423 378620
dlslime 8 134,217,728 1 1 2.79 384630

#BS=64, #Concurrency=1

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 1 0.091 11690
dlslime 8 4,096 64 1 0.081 24403
dlslime 8 8,192 64 1 0.091 45926
dlslime 8 16,384 64 1 0.098 84092
dlslime 8 32,768 64 1 0.117 138696
dlslime 8 65,536 64 1 0.16 206866
dlslime 8 131,072 64 1 0.241 273976
dlslime 8 262,144 64 1 0.415 320008
dlslime 8 524,288 64 1 0.757 353714
dlslime 8 1,048,576 64 1 1.439 372217
dlslime 8 2,097,152 64 1 2.819 381397
dlslime 8 4,194,304 64 1 5.555 386489
dlslime 8 8,388,608 64 1 11.044 388927
dlslime 8 16,777,216 64 1 22.009 390278
dlslime 8 33,554,432 64 1 43.951 390978
dlslime 8 67,108,864 64 1 87.804 391370
dlslime 8 134,217,728 64 1 175.508 391588

#BS=64, #Concurrency=8

torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
Transfer Engine #Channels Message Size (bytes) Batch Size Num Concurrency Avg Latency(ms) Bandwidth(MB/s)
dlslime 8 2,048 64 8 0.036 28494
dlslime 8 4,096 64 8 0.038 50860
dlslime 8 8,192 64 8 0.048 104545
dlslime 8 16,384 64 8 0.041 207051
dlslime 8 32,768 64 8 0.056 297354
dlslime 8 65,536 64 8 0.099 337571
dlslime 8 131,072 64 8 0.185 363003
dlslime 8 262,144 64 8 0.356 376743
dlslime 8 524,288 64 8 0.701 383701
dlslime 8 1,048,576 64 8 1.386 387629
dlslime 8 2,097,152 64 8 2.757 389493
dlslime 8 4,194,304 64 8 5.5 390523
dlslime 8 8,388,608 64 8 10.984 391043
dlslime 8 16,777,216 64 8 21.955 391291
dlslime 8 33,554,432 64 8 43.891 391407
dlslime 8 67,108,864 64 8 87.771 391480
dlslime 8 134,217,728 64 8 175.518 391530

Heterogeneous Interconnection​

  • hardware configs
Device NIC Model Bandwidth PCIe Version PCIe Lanes
A Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
B Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x8
C Mellanox ConnectX-7 Lx (MT4129) 200 Gbps PCIe 5.0 x16
D Mellanox ConnectX-7 Lx (MT4129) 400 Gbps PCIe 5.0 x16
  • experiments configs

    • Message Size = 128 MB
    • RDMA RC Read(single NIC)
    • Under affinity scenario
    • RDMA with GPU Direct
  • Interconnect bandwidth matrix:(MB/s, demonstrates attainment of the theoretical bound).

Throughput (MB/s) A B C D
A 48967.45 28686.29 24524.29 27676.57
B 28915.72 28275.85 23472.29 27234.60
C 24496.14 24496.51 24513.57 24493.89
D 29317.66 28683.25 24515.30 27491.33

detailed results: bench

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlslime-0.0.1.post10-cp313-cp313-manylinux2014_x86_64.whl (569.5 kB view details)

Uploaded CPython 3.13

dlslime-0.0.1.post10-cp312-cp312-manylinux2014_x86_64.whl (343.9 kB view details)

Uploaded CPython 3.12

dlslime-0.0.1.post10-cp311-cp311-manylinux2014_x86_64.whl (344.8 kB view details)

Uploaded CPython 3.11

dlslime-0.0.1.post10-cp310-cp310-manylinux2014_x86_64.whl (342.9 kB view details)

Uploaded CPython 3.10

dlslime-0.0.1.post10-cp39-cp39-manylinux2014_x86_64.whl (344.1 kB view details)

Uploaded CPython 3.9

dlslime-0.0.1.post10-cp38-cp38-manylinux2014_x86_64.whl (343.4 kB view details)

Uploaded CPython 3.8

File details

Details for the file dlslime-0.0.1.post10-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 02946255a493b3444aface908afa97427d0d1579dcbe3461f8d8cf47effdc3e6
MD5 74e6d39037ef46948859524f1f938dcf
BLAKE2b-256 90d095edf1ffde98417ce1c4bff1785826265a4494395df9361ec01f6973533b

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post10-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e950cb0286e1d73e5d6309f714cbb98d1aa8e8a24f391d4564473bf8dc2e297
MD5 aa7dcd9703b26b27543517078458f098
BLAKE2b-256 ec865721e096eb80ded54cdd3b6a107dcf232206fa17273e11f7eb748e0e611e

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post10-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5516e6b75b509a7d767403b9fa7c59e7f21a3f1aa04e20dbd100205104b33634
MD5 ed8982c8e24752eda5e008e256d8d649
BLAKE2b-256 8cf85a06ef7346ae4004f8bb99b57df5af0e2039f42c5c914819de9f8c168292

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post10-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec669cdbc29cbf21a9302f1def46fe91be6bd99b1650c1931a17ba8d5701b342
MD5 a9fd1056ac8cdc42bd5a2bab8d106751
BLAKE2b-256 0be6d3a1a2892ea80515f99f81a8ab200a8b5226c58d7b1f6ce8bdf879ce756d

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post10-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 79c71adee8cb78baac7dd2055db4ce8df32c629010637ba43010c566abdb0aac
MD5 97a07759df3daf39fa4ed7dbbf232eea
BLAKE2b-256 9ec521ff194cb48824c29f35a59d3bd32d9a5f46086e70ae076368e7d97737bc

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post10-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post10-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4069fe24ba2a384d4ce3e578fce90c7f0237af5518857ab4435973f4222c5409
MD5 d5f1b527f36b469ad171460713fb7d8f
BLAKE2b-256 88454b96b15422d986b8034021b45d9920e9be7948fd964a67e56171196137e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page