Skip to main content

DLSlime Transfer Engine

Project description

DLSlime Transfer Engine

A Peer to Peer RDMA Transfer Engine.

Usage

RDMA READ

devices = available_nic()
assert devices, "No RDMA devices."

# Initialize RDMA endpoint
initiator = RDMAEndpoint(device_name=devices[0], ib_port=1, link_type="RoCE")
# Register local GPU memory with RDMA subsystem
local_tensor = torch.tensor(...)
initiator.register_memory_region("buffer", local_tensor...)

# Initialize target endpoint on different NIC
target = RDMAEndpoint(device_name=devices[-1], ib_port=1, link_type="RoCE")
# Register target's GPU memory
remote_tensor = torch.tensor(...)
target.register_memory_region("buffer", remote_tensor...)

# Establish bidirectional RDMA connection:
# 1. Target connects to initiator's endpoint information
# 2. Initiator connects to target's endpoint information
# Note: Real-world scenarios typically use out-of-band exchange (e.g., via TCP)
target.connect(initiator.local_endpoint_info)
initiator.connect(target.local_endpoint_info)

# Execute asynchronous batch read operation:
asyncio.run(initiator.async_read_batch("buffer", [0], [8], 8))

SendRecv

Sender

# RDMA init and RDMA Connect just like RDMA Read
...

# RDMA Send
ones = torch.ones([16], dtype=torch.uint8)
endpoint.register_memory_region("buffer", ones.data_ptr(), 16)
asyncio.run(endpoint.send_async("buffer", 0, 8))

Receiver

# RDMA init and RDMA Connect just like RDMA Read
...

# RDMA Recv
zeros = torch.zeros([16], dtype=torch.uint8)
endpoint.register_memory_region("buffer", zeros.data_ptr(), 16)
asyncio.run(endpoint.recv_async("buffer", 8, 8))

Build

# on CentOS
sudo yum install cppzmq-devel gflags-devel  cmake

# on Ubuntu
sudo apt install libzmq-dev libgflags-dev cmake

# build from source
mkdir build; cd build
cmake -DBUILD_BENCH=ON -DBUILD_PYTHON=ON ..; make

Benchmark

# Target
./bench/transfer_bench                \
  --remote-endpoint=10.130.8.138:8000 \
  --local-endpoint=10.130.8.139:8000  \
  --device-name="mlx5_bond_0"         \
  --mode target                       \
  --block-size=2048000                \
  --batch-size=160

# Initiator
./bench/transfer_bench                \
  --remote-endpoint=10.130.8.139:8000 \
  --local-endpoint=10.130.8.138:8000  \
  --device-name="mlx5_bond_0"         \
  --mode initiator                    \
  --block-size=16384                  \
  --batch-size=16                     \
  --duration 10

Cross node performance

Single Device

  • NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe 5.0 x16 with x16 PCIe extension option;
  • RoCE v2.
Block Size Batch Size Total Trips Total Transferred (MiB) Duration (seconds) Average Latency (ms/trip) Throughput (MiB/s)
8192 160 249920 312400 10.0006 0.0400154 31238
16384 160 153300 383250 10.0012 0.0652392 38320.5
32768 160 85280 426400 10.0013 0.117276 42634.4
65536 160 44680 446800 10.0033 0.223887 44665.3
128000 160 23340 455859 10.0023 0.428546 45575.6
128000 160 23340 455859 10.0028 0.428571 45573
256000 160 11820 461718 10.0135 0.847166 46109.6
512000 160 5940 464062 10.002 1.68384 46396.8
1024000 160 2980 465625 10.0049 3.35735 46539.6
2048000 160 1500 468750 10.0555 6.70364 46616.5

Aggregated Transfer

  • NVIDIA ConnectX-7 HHHL Adapter Card * 8
  • RoCE v2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlslime-0.0.1.post7.tar.gz (169.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dlslime-0.0.1.post7-cp312-cp312-manylinux2014_x86_64.whl (243.2 kB view details)

Uploaded CPython 3.12

dlslime-0.0.1.post7-cp310-cp310-manylinux2014_x86_64.whl (291.4 kB view details)

Uploaded CPython 3.10

File details

Details for the file dlslime-0.0.1.post7.tar.gz.

File metadata

  • Download URL: dlslime-0.0.1.post7.tar.gz
  • Upload date:
  • Size: 169.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dlslime-0.0.1.post7.tar.gz
Algorithm Hash digest
SHA256 357802cec6be31bad198a0648e3dc092915be04e6d546e7154bd4567186b6b60
MD5 88c9e0d2e1e31feeedf1996436ad91fd
BLAKE2b-256 d231ee132e04babdde2905d688bda797aec5d41e9fbcd31270ca162fe67ef0b4

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post7-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post7-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1db1a4aa4394c8bb1bc99e7eeee6a56be84cbcc8e00a35da361ca4ee22d64713
MD5 2d65b87fe502069959dc5cfd05338ad5
BLAKE2b-256 c3f57c1683652ca068620dac8385e612fcb2ccd0fa8ebbf0988e5ad9d30cdb57

See more details on using hashes here.

File details

Details for the file dlslime-0.0.1.post7-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dlslime-0.0.1.post7-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 90ccef746004eedd208e9833bb1461b5f55f40af8c43fc9a47d869a9d8381481
MD5 d88f9d7bb954cc04db03096695828914
BLAKE2b-256 4ddca6141986f5fad12b406dacc4fecf26f0754ed4cb5ae106a3808c2e4b78cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page