DLSlime Transfer Engine
Project description
Flexible & Efficient Heterogeneous Transfer Toolkit
Getting Started
DLSlime offers a set of peer-to-peer communication interfaces. For instance, consider the task of batched slice assignment from a remote tensor to a local tensor. You can accomplish this using the following APIs.
.
Here are some examples of DLSlime interface.
RDMA RC Mode
- RDMA RC Read (Sync / Async mode)
python example/python/p2p_rdma_rc_read.py
- RDMA RC Read (Coroutine mode)
python example/python/p2p_rdma_rc_read_coroutine.py
- RDMA RC Write (Sync / Async mode)
python example/python/p2p_rdma_rc_write.py
- RDMA RC Write with immediate data (Sync / Async mode)
python example/python/p2p_rdma_rc_write_with_imm_data.py
- RDMA RC Send/Recv
python example/python/p2p_rdma_rc_send_recv.py
python example/python/p2p_rdma_rc_send_recv_gdr.py
- DLSlime torch backend
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 0
python example/python/p2p_rdma_rc_send_recv_torch.py --rank 1
NVLink Mode
# initiator
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role initiator
# target
python example/python/p2p_nvlink.py --initiator-url "127.0.0.1:6006" --target-url "127.0.0.1:6007" --role target
NVShmem Mode
# send
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 0 --world-size 2
# recv
python example/python/p2p_nvshmem_ibgda_sendrecv.py --rank 1 --world-size 2
[!Caution] DLSlime NVShmem transfer engine is in the experimental stage.
Install
pip install
pip install dlslime==0.0.1.post8
[!Note] The DLSlime pip version is built with default FLAGS (see Build from source for details).
Build from source
Python
git clone https://github.com/deeplink-org/DLSlime.git
FLAG=<ON|OFF> pip install -v --no-build-isolation -e .
CPP
git clone https://github.com/deeplink-org/DLSlime.git
mkdir -p DLSlime/build && cmake -DFLAG=<ON|OFF> ..
Build flags
The FLAG can be
| Flag | Description | Platform | default |
|---|---|---|---|
BUILD_RDMA |
Build RDMA Transfer Engine | Hetero | ON |
BUILD_PYTHON |
Build Python wrapper | Hetero | ON |
BUILD_NVLINK |
Build NVLINK Transfer Engine | GPGPU | OFF |
BUILD_NVSHMEM |
Build NVShmem Transfer Engine | NVIDIA | OFF |
BUILD_TORCH_PLUGIN |
Build DLSlime as a torch backend | Hetero | OFF |
USE_GLOO_BACKEND |
Use GLOO RDMA Send/Recv torch backend | Hetero | OFF |
[!Note] Please enable
USE_MECAwhen using DLSlime as a torch backend in Metax platform.
Benchmark
GDRDMA P2P Read/Write
- Platform: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option; RoCE v2.
#BS=1, #Concurrency=1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 1 | 2,048 | 1 | 1 | 0.039 | 52 |
| dlslime | 1 | 4,096 | 1 | 1 | 0.037 | 111 |
| dlslime | 1 | 8,192 | 1 | 1 | 0.038 | 216 |
| dlslime | 1 | 16,384 | 1 | 1 | 0.037 | 442 |
| dlslime | 1 | 32,768 | 1 | 1 | 0.039 | 836 |
| dlslime | 1 | 65,536 | 1 | 1 | 0.039 | 1689 |
| dlslime | 1 | 131,072 | 1 | 1 | 0.041 | 3195 |
| dlslime | 1 | 262,144 | 1 | 1 | 0.043 | 6059 |
| dlslime | 1 | 524,288 | 1 | 1 | 0.049 | 10689 |
| dlslime | 1 | 1,048,576 | 1 | 1 | 0.062 | 17012 |
| dlslime | 1 | 2,097,152 | 1 | 1 | 0.083 | 25154 |
| dlslime | 1 | 4,194,304 | 1 | 1 | 0.127 | 33112 |
| dlslime | 1 | 8,388,608 | 1 | 1 | 0.211 | 39797 |
| dlslime | 1 | 16,777,216 | 1 | 1 | 0.382 | 43893 |
| dlslime | 1 | 33,554,432 | 1 | 1 | 0.726 | 46244 |
| dlslime | 1 | 67,108,864 | 1 | 1 | 1.412 | 47518 |
| dlslime | 1 | 134,217,728 | 1 | 1 | 2.783 | 48235 |
#BS=64, #Concurrency=1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 1 | 2,048 | 64 | 1 | 0.084 | 1562 |
| dlslime | 1 | 4,096 | 64 | 1 | 0.082 | 3213 |
| dlslime | 1 | 8,192 | 64 | 1 | 0.086 | 6095 |
| dlslime | 1 | 16,384 | 64 | 1 | 0.093 | 11249 |
| dlslime | 1 | 32,768 | 64 | 1 | 0.115 | 18193 |
| dlslime | 1 | 65,536 | 64 | 1 | 0.158 | 26542 |
| dlslime | 1 | 131,072 | 64 | 1 | 0.243 | 34498 |
| dlslime | 1 | 262,144 | 64 | 1 | 0.414 | 40549 |
| dlslime | 1 | 524,288 | 64 | 1 | 0.758 | 44248 |
| dlslime | 1 | 1,048,576 | 64 | 1 | 1.443 | 46510 |
| dlslime | 1 | 2,097,152 | 64 | 1 | 2.809 | 47782 |
| dlslime | 1 | 4,194,304 | 64 | 1 | 5.555 | 48327 |
| dlslime | 1 | 8,388,608 | 64 | 1 | 11.041 | 48624 |
| dlslime | 1 | 16,777,216 | 64 | 1 | 22.003 | 48798 |
| dlslime | 1 | 33,554,432 | 64 | 1 | 43.941 | 48872 |
| dlslime | 1 | 67,108,864 | 64 | 1 | 87.809 | 48912 |
| dlslime | 1 | 134,217,728 | 64 | 1 | 175.512 | 48942 |
#BS=64, #Concurrency=8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 1 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 1 | 2,048 | 64 | 8 | 0.037 | 3519 |
| dlslime | 1 | 4,096 | 64 | 8 | 0.038 | 6948 |
| dlslime | 1 | 8,192 | 64 | 8 | 0.038 | 13758 |
| dlslime | 1 | 16,384 | 64 | 8 | 0.04 | 26416 |
| dlslime | 1 | 32,768 | 64 | 8 | 0.057 | 36997 |
| dlslime | 1 | 65,536 | 64 | 8 | 0.098 | 42618 |
| dlslime | 1 | 131,072 | 64 | 8 | 0.184 | 45602 |
| dlslime | 1 | 262,144 | 64 | 8 | 0.356 | 47148 |
| dlslime | 1 | 524,288 | 64 | 8 | 0.699 | 47975 |
| dlslime | 1 | 1,048,576 | 64 | 8 | 1.384 | 48478 |
| dlslime | 1 | 2,097,152 | 64 | 8 | 2.755 | 48709 |
| dlslime | 1 | 4,194,304 | 64 | 8 | 5.498 | 48823 |
| dlslime | 1 | 8,388,608 | 64 | 8 | 10.982 | 48884 |
| dlslime | 1 | 16,777,216 | 64 | 8 | 21.954 | 48908 |
| dlslime | 1 | 33,554,432 | 64 | 8 | 43.895 | 48923 |
| dlslime | 1 | 67,108,864 | 64 | 8 | 87.766 | 48936 |
| dlslime | 1 | 134,217,728 | 64 | 8 | 175.517 | 48940 |
GDRDMA Aggregated Bandwidth
#BS=1, #Concurrency=1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 1 --num-iteration 100 --num-concurrency 1
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 8 | 2,048 | 1 | 1 | 0.051 | 157 |
| dlslime | 8 | 4,096 | 1 | 1 | 0.042 | 768 |
| dlslime | 8 | 8,192 | 1 | 1 | 0.04 | 1576 |
| dlslime | 8 | 16,384 | 1 | 1 | 0.054 | 2929 |
| dlslime | 8 | 32,768 | 1 | 1 | 0.051 | 5713 |
| dlslime | 8 | 65,536 | 1 | 1 | 0.052 | 11547 |
| dlslime | 8 | 131,072 | 1 | 1 | 0.055 | 22039 |
| dlslime | 8 | 262,144 | 1 | 1 | 0.058 | 42313 |
| dlslime | 8 | 524,288 | 1 | 1 | 0.064 | 74753 |
| dlslime | 8 | 1,048,576 | 1 | 1 | 0.072 | 127489 |
| dlslime | 8 | 2,097,152 | 1 | 1 | 0.101 | 184823 |
| dlslime | 8 | 4,194,304 | 1 | 1 | 0.149 | 246861 |
| dlslime | 8 | 8,388,608 | 1 | 1 | 0.237 | 299510 |
| dlslime | 8 | 16,777,216 | 1 | 1 | 0.403 | 340252 |
| dlslime | 8 | 33,554,432 | 1 | 1 | 0.743 | 364918 |
| dlslime | 8 | 67,108,864 | 1 | 1 | 1.423 | 378620 |
| dlslime | 8 | 134,217,728 | 1 | 1 | 2.79 | 384630 |
#BS=64, #Concurrency=1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 1
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 8 | 2,048 | 64 | 1 | 0.091 | 11690 |
| dlslime | 8 | 4,096 | 64 | 1 | 0.081 | 24403 |
| dlslime | 8 | 8,192 | 64 | 1 | 0.091 | 45926 |
| dlslime | 8 | 16,384 | 64 | 1 | 0.098 | 84092 |
| dlslime | 8 | 32,768 | 64 | 1 | 0.117 | 138696 |
| dlslime | 8 | 65,536 | 64 | 1 | 0.16 | 206866 |
| dlslime | 8 | 131,072 | 64 | 1 | 0.241 | 273976 |
| dlslime | 8 | 262,144 | 64 | 1 | 0.415 | 320008 |
| dlslime | 8 | 524,288 | 64 | 1 | 0.757 | 353714 |
| dlslime | 8 | 1,048,576 | 64 | 1 | 1.439 | 372217 |
| dlslime | 8 | 2,097,152 | 64 | 1 | 2.819 | 381397 |
| dlslime | 8 | 4,194,304 | 64 | 1 | 5.555 | 386489 |
| dlslime | 8 | 8,388,608 | 64 | 1 | 11.044 | 388927 |
| dlslime | 8 | 16,777,216 | 64 | 1 | 22.009 | 390278 |
| dlslime | 8 | 33,554,432 | 64 | 1 | 43.951 | 390978 |
| dlslime | 8 | 67,108,864 | 64 | 1 | 87.804 | 391370 |
| dlslime | 8 | 134,217,728 | 64 | 1 | 175.508 | 391588 |
#BS=64, #Concurrency=8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 1 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
torchrun --master-addr 10.130.8.145 --master-port 6006 --nnodes 2 --nproc-per-node 8 --node-rank 0 bench/python/agg_transfer_bench_spmd.py --qp-num 8 --transfer-engine dlslime --batch-size 64 --num-iteration 100 --num-concurrency 8
| Transfer Engine | #Channels | Message Size (bytes) | Batch Size | Num Concurrency | Avg Latency(ms) | Bandwidth(MB/s) |
|---|---|---|---|---|---|---|
| dlslime | 8 | 2,048 | 64 | 8 | 0.036 | 28494 |
| dlslime | 8 | 4,096 | 64 | 8 | 0.038 | 50860 |
| dlslime | 8 | 8,192 | 64 | 8 | 0.048 | 104545 |
| dlslime | 8 | 16,384 | 64 | 8 | 0.041 | 207051 |
| dlslime | 8 | 32,768 | 64 | 8 | 0.056 | 297354 |
| dlslime | 8 | 65,536 | 64 | 8 | 0.099 | 337571 |
| dlslime | 8 | 131,072 | 64 | 8 | 0.185 | 363003 |
| dlslime | 8 | 262,144 | 64 | 8 | 0.356 | 376743 |
| dlslime | 8 | 524,288 | 64 | 8 | 0.701 | 383701 |
| dlslime | 8 | 1,048,576 | 64 | 8 | 1.386 | 387629 |
| dlslime | 8 | 2,097,152 | 64 | 8 | 2.757 | 389493 |
| dlslime | 8 | 4,194,304 | 64 | 8 | 5.5 | 390523 |
| dlslime | 8 | 8,388,608 | 64 | 8 | 10.984 | 391043 |
| dlslime | 8 | 16,777,216 | 64 | 8 | 21.955 | 391291 |
| dlslime | 8 | 33,554,432 | 64 | 8 | 43.891 | 391407 |
| dlslime | 8 | 67,108,864 | 64 | 8 | 87.771 | 391480 |
| dlslime | 8 | 134,217,728 | 64 | 8 | 175.518 | 391530 |
Heterogeneous Interconnection
- hardware configs
| Device | NIC Model | Bandwidth | PCIe Version | PCIe Lanes |
|---|---|---|---|---|
| A | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x16 |
| B | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x8 |
| C | Mellanox ConnectX-7 Lx (MT4129) | 200 Gbps | PCIe 5.0 | x16 |
| D | Mellanox ConnectX-7 Lx (MT4129) | 400 Gbps | PCIe 5.0 | x16 |
-
experiments configs
- Message Size = 128 MB
- RDMA RC Read(single NIC)
- Under affinity scenario
- RDMA with GPU Direct
-
Interconnect bandwidth matrix:(MB/s, demonstrates attainment of the theoretical bound).
| Throughput (MB/s) | A | B | C | D |
|---|---|---|---|---|
| A | 48967.45 | 28686.29 | 24524.29 | 27676.57 |
| B | 28915.72 | 28275.85 | 23472.29 | 27234.60 |
| C | 24496.14 | 24496.51 | 24513.57 | 24493.89 |
| D | 29317.66 | 28683.25 | 24515.30 | 27491.33 |
detailed results: bench
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dlslime-0.0.1.post10-cp313-cp313-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp313-cp313-manylinux2014_x86_64.whl
- Upload date:
- Size: 569.5 kB
- Tags: CPython 3.13
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02946255a493b3444aface908afa97427d0d1579dcbe3461f8d8cf47effdc3e6
|
|
| MD5 |
74e6d39037ef46948859524f1f938dcf
|
|
| BLAKE2b-256 |
90d095edf1ffde98417ce1c4bff1785826265a4494395df9361ec01f6973533b
|
File details
Details for the file dlslime-0.0.1.post10-cp312-cp312-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp312-cp312-manylinux2014_x86_64.whl
- Upload date:
- Size: 343.9 kB
- Tags: CPython 3.12
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e950cb0286e1d73e5d6309f714cbb98d1aa8e8a24f391d4564473bf8dc2e297
|
|
| MD5 |
aa7dcd9703b26b27543517078458f098
|
|
| BLAKE2b-256 |
ec865721e096eb80ded54cdd3b6a107dcf232206fa17273e11f7eb748e0e611e
|
File details
Details for the file dlslime-0.0.1.post10-cp311-cp311-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp311-cp311-manylinux2014_x86_64.whl
- Upload date:
- Size: 344.8 kB
- Tags: CPython 3.11
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5516e6b75b509a7d767403b9fa7c59e7f21a3f1aa04e20dbd100205104b33634
|
|
| MD5 |
ed8982c8e24752eda5e008e256d8d649
|
|
| BLAKE2b-256 |
8cf85a06ef7346ae4004f8bb99b57df5af0e2039f42c5c914819de9f8c168292
|
File details
Details for the file dlslime-0.0.1.post10-cp310-cp310-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp310-cp310-manylinux2014_x86_64.whl
- Upload date:
- Size: 342.9 kB
- Tags: CPython 3.10
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec669cdbc29cbf21a9302f1def46fe91be6bd99b1650c1931a17ba8d5701b342
|
|
| MD5 |
a9fd1056ac8cdc42bd5a2bab8d106751
|
|
| BLAKE2b-256 |
0be6d3a1a2892ea80515f99f81a8ab200a8b5226c58d7b1f6ce8bdf879ce756d
|
File details
Details for the file dlslime-0.0.1.post10-cp39-cp39-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 344.1 kB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79c71adee8cb78baac7dd2055db4ce8df32c629010637ba43010c566abdb0aac
|
|
| MD5 |
97a07759df3daf39fa4ed7dbbf232eea
|
|
| BLAKE2b-256 |
9ec521ff194cb48824c29f35a59d3bd32d9a5f46086e70ae076368e7d97737bc
|
File details
Details for the file dlslime-0.0.1.post10-cp38-cp38-manylinux2014_x86_64.whl.
File metadata
- Download URL: dlslime-0.0.1.post10-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 343.4 kB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4069fe24ba2a384d4ce3e578fce90c7f0237af5518857ab4435973f4222c5409
|
|
| MD5 |
d5f1b527f36b469ad171460713fb7d8f
|
|
| BLAKE2b-256 |
88454b96b15422d986b8034021b45d9920e9be7948fd964a67e56171196137e8
|