Skip to main content

An ultra-fast, distributed Safetensors loader

Project description

InstantTensor

PyPI Downloads License

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: tensor points to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.

See Usage for more details (multi-file and distributed usage).

Used by

Why InstantTensor?

  • Fast weight loading
    • Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
    • Tuned I/O size and concurrency: Maximize hardware throughput.
    • Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
  • Distributed loading
    • Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
  • Multiple I/O backends
    • Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.

When to Use InstantTensor

InstantTensor is recommended if any of the following conditions are met:

  • High storage bandwidth (>= 5 GB/s).
  • Unable to keep the model cached in host memory, for example:
    • Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
    • Infrequent model loading, where Linux page cache is less effective.
    • Model switching, where multiple models cannot be cached in memory simultaneously.
  • The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
  • Loading from tmpfs.

Installation

Requirements

  • GPU platforms: CUDA, ROCm
  • Framework: PyTorch

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file loading

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instanttensor-0.1.8.tar.gz (9.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instanttensor-0.1.8-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.8-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.8-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.8-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file instanttensor-0.1.8.tar.gz.

File metadata

  • Download URL: instanttensor-0.1.8.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for instanttensor-0.1.8.tar.gz
Algorithm Hash digest
SHA256 9127341f3c331c6f520f8663da620ff5a6bade2725917f4e76f7f146bdd36b87
MD5 6df32805aa595a23645aa18ab778b5a9
BLAKE2b-256 dde121dabb75dba00f5bf53959fe40576f9ec768eb8c8aa926f847178d797a65

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.8-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.8-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 67b74411a9b354aff78571f09f2ded15817ed5f88f7085cb2b4362e271670235
MD5 d3257ada569f9395c5f72c7a5a21b913
BLAKE2b-256 2bbcdeefcf7b7d8e73a38a5da8d2b56f8c4720b117f06e177954847fff2455ca

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.8-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.8-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7cb580ae0e016610b3b8bf8fcf54450968dbd82b1e31c12be9753d9bd68716f3
MD5 cf017fed23f2c3dfcd26c40b102b9ec3
BLAKE2b-256 88ddf484c127eb55c9dc188db1d87be70e14867fcb985e461fcd92f91fc402a6

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.8-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.8-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 921bae6611dce2fff0c07e6de9a8e5071e52ba87fc35bf179a4e570735655832
MD5 fd887d14b596adf6b3971f5050d75576
BLAKE2b-256 122ed81b53ae30fd4589223a6cd921acb79f095dc0eb0da9cccedc54d10c462d

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.8-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.8-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 52196dc21ea698a02754d19773b2eeac06fbdab2bc6e181a59419e68532c8430
MD5 cdd49f16f138b4b6e1a791b95712d0b9
BLAKE2b-256 d1246ec7b34a4a1751ddf0709ffb5b492b0db379e23ccb2d1e6596ec6b6e137d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page