Skip to main content

An ultra-fast, distributed Safetensors loader

Project description

InstantTensor

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines:

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: tensor points to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.

See Usage for more details (multi-file and distributed usage).

Why InstantTensor?

  • Fast weight loading:
    • Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
    • Tuned I/O size and concurrency: Maximize hardware throughput.
    • Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
  • Distributed loading: Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
  • Minimal device buffer: ≤ ~3× largest-tensor size; far below single-file size.
  • Multiple I/O backends:
    • GPUDirect Storage
    • Legacy Storage
    • Memory-based Storage

Installation

First, we need a Linux environment with CUDA driver installed. The typical installation steps are as follows:

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

cd ./instanttensor
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file mode (recommended)

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instanttensor-0.1.5.tar.gz (9.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instanttensor-0.1.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file instanttensor-0.1.5.tar.gz.

File metadata

  • Download URL: instanttensor-0.1.5.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for instanttensor-0.1.5.tar.gz
Algorithm Hash digest
SHA256 cc1ea70bb4734d35fdbda118364f795cee7df87b1736efb08fce41169852c322
MD5 cee14e0e1a71c8d048a16b78b5b505bd
BLAKE2b-256 27727f9f01558c2159164dc6b3954ed288869ec6d051c9c632a71cce18a1fd8f

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.5-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 39b9b3a0f70c5bf306a3782032105c0707f8aca596ecdbd2ebc7fc6fe011f0c3
MD5 a2618db054adb278d792e68007b8ca5e
BLAKE2b-256 858b21cd6316979d6b3916b1caf85fa16af351de9b43bfe1e86f563b33901783

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.5-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d910b72d40f2dbab4e571a8fafd70edf2a38d8925954c01793bf332916921424
MD5 20d2705835ab741c43a6e85c8eb90bac
BLAKE2b-256 1a56499a0d39301cce8a64824f20a0330cb7e443044e7b28b403de026e9c0371

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.5-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 72442b0ac1b512a6e23be7456461cab51c68612c0ba13065dff44b29ec357b7a
MD5 ec80151722a032d2d6999368c25254fa
BLAKE2b-256 85d8f715c385b62e47de1f521fb6d6f8723faac65418cdb48fc5fafb858484c5

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.5-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1e9cad7f2a97513815100a9c33f2cc8d150095176a71fa16ea469e9d7e354266
MD5 c7490fa246ee107b48cde1ad25138022
BLAKE2b-256 b4a7b98ede1bd77af7f89f7056dc12aa49716f444ae1e81c50d67d85b8061245

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page