An ultra-fast, distributed Safetensors loader

These details have not been verified by PyPI

Project description

InstantTensor

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines

Model	GPU	Backend	Load Time (s)	Throughput (GB/s)	Speedup
Qwen3-30B-A3B	1*H200	Safetensors	57.4	1.1	1x
Qwen3-30B-A3B	1*H200	InstantTensor	1.77	35	32.4x
DeepSeek-R1	8*H200	Safetensors	160	4.3	1x
DeepSeek-R1	8*H200	InstantTensor	15.3	45	10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: tensor points to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.

See Usage for more details (multi-file and distributed usage).

Used by

vLLM: Loading Model Weights with InstantTensor

Why InstantTensor?

Fast weight loading
- Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
- Tuned I/O size and concurrency: Maximize hardware throughput.
- Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
Distributed loading
- Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
Multiple I/O backends
- Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.

When to Use InstantTensor

InstantTensor is recommended if any of the following conditions are met:

High storage bandwidth (>= 5 GB/s).
Unable to keep the model cached in host memory, for example:
- Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
- Infrequent model loading, where Linux page cache is less effective.
- Model switching, where multiple models cannot be cached in memory simultaneously.
The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
Loading from tmpfs.

Installation

Requirements

GPU platforms: CUDA, ROCm
Framework: PyTorch

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file loading

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.8

Apr 5, 2026

This version

0.1.7

Mar 25, 2026

0.1.6

Mar 12, 2026

0.1.5

Mar 10, 2026

0.1.4 yanked

Mar 6, 2026

0.1.3 yanked

Mar 3, 2026

0.1.2 yanked

Mar 2, 2026

0.1.1 yanked

Mar 2, 2026

Reason this release was yanked:

If you need this version, install libboost-dev first: 'sudo apt install libboost-dev'.

0.1.0 yanked

Mar 2, 2026

Reason this release was yanked:

If you need this version, install libboost-dev first: 'sudo apt install libboost-dev'.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded Mar 25, 2026 CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (2.0 MB view details)

Uploaded Mar 25, 2026 CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded Mar 25, 2026 CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded Mar 25, 2026 CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Upload date: Mar 25, 2026
Size: 2.0 MB
Tags: CPython 3.13, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for instanttensor-0.1.7-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`1dfd00632586d3e60da2951599835cfaf28d474db5eedb66eb2cb891a7c52c1f`
MD5	`40f4643bb20997d402bd0debcfc3a503`
BLAKE2b-256	`07737d761efe4c25e4d4d778347ccdb2e0cebabdb9a796aa37e88211895e113c`

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Upload date: Mar 25, 2026
Size: 2.0 MB
Tags: CPython 3.12, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for instanttensor-0.1.7-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`3ba3e5dbbe1c8285a5b6a8b1098d6fde16583e54b63edea9dc68adf69be935e9`
MD5	`1d79715ca59f4f70f127a266fcbfc7df`
BLAKE2b-256	`dfda98d397396d162a38f1324e2ac831b0220e890192a3b75cfb9ae6071480bb`

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Upload date: Mar 25, 2026
Size: 1.9 MB
Tags: CPython 3.11, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for instanttensor-0.1.7-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`d8cc3bc0feca3b1d477fa4f94a94ee87b7a113356b304d045e0582fa01f720c9`
MD5	`fd4e673891a12ba9e191cca6b2dbf274`
BLAKE2b-256	`5f71ea7b5faaf8d65dcc1565d7972d2986b34b71a376378b3cddb07ea5373017`

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Upload date: Mar 25, 2026
Size: 1.9 MB
Tags: CPython 3.10, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for instanttensor-0.1.7-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`bb1a40b8bd1ab5529c5941ff345832ae3a24c36a142cc8297d3d7a4e61003d8f`
MD5	`6b146435c46140e9b6044d9494628658`
BLAKE2b-256	`c9ce254e0d9c951708afff6a024561ab3e3c0cb0a1ec99e84082b47cefb9960b`

See more details on using hashes here.

instanttensor 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

InstantTensor

Model loading benchmark on inference engines

Quickstart

Used by

Why InstantTensor?

When to Use InstantTensor

Installation

Requirements

Method 1: Install from pip

Method 2: Build from source

Usage

Multi-file loading

Distributed loading

API reference

Thanks

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes