Skip to main content

An ultra-fast, distributed Safetensors loader

Project description

InstantTensor

PyPI Downloads License

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: tensor points to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.

See Usage for more details (multi-file and distributed usage).

Used by

Why InstantTensor?

  • Fast weight loading
    • Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
    • Tuned I/O size and concurrency: Maximize hardware throughput.
    • Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
  • Distributed loading
    • Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
  • Multiple I/O backends
    • Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.

When to Use InstantTensor

InstantTensor is recommended if any of the following conditions are met:

  • High storage bandwidth (>= 5 GB/s).
  • Unable to keep the model cached in host memory, for example:
    • Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
    • Infrequent model loading, where Linux page cache is less effective.
    • Model switching, where multiple models cannot be cached in memory simultaneously.
  • The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
  • Loading from tmpfs.

Installation

Requirements

  • GPU platforms: CUDA, ROCm
  • Framework: PyTorch

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file loading

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1dfd00632586d3e60da2951599835cfaf28d474db5eedb66eb2cb891a7c52c1f
MD5 40f4643bb20997d402bd0debcfc3a503
BLAKE2b-256 07737d761efe4c25e4d4d778347ccdb2e0cebabdb9a796aa37e88211895e113c

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3ba3e5dbbe1c8285a5b6a8b1098d6fde16583e54b63edea9dc68adf69be935e9
MD5 1d79715ca59f4f70f127a266fcbfc7df
BLAKE2b-256 dfda98d397396d162a38f1324e2ac831b0220e890192a3b75cfb9ae6071480bb

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d8cc3bc0feca3b1d477fa4f94a94ee87b7a113356b304d045e0582fa01f720c9
MD5 fd4e673891a12ba9e191cca6b2dbf274
BLAKE2b-256 5f71ea7b5faaf8d65dcc1565d7972d2986b34b71a376378b3cddb07ea5373017

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bb1a40b8bd1ab5529c5941ff345832ae3a24c36a142cc8297d3d7a4e61003d8f
MD5 6b146435c46140e9b6044d9494628658
BLAKE2b-256 c9ce254e0d9c951708afff6a024561ab3e3c0cb0a1ec99e84082b47cefb9960b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page