An ultra-fast, distributed Safetensors loader
Project description
InstantTensor
InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.
Model loading benchmark on inference engines
| Model | GPU | Backend | Load Time (s) | Throughput (GB/s) | Speedup |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | 1*H200 | Safetensors | 57.4 | 1.1 | 1x |
| Qwen3-30B-A3B | 1*H200 | InstantTensor | 1.77 | 35 | 32.4x |
| DeepSeek-R1 | 8*H200 | Safetensors | 160 | 4.3 | 1x |
| DeepSeek-R1 | 8*H200 | InstantTensor | 15.3 | 45 | 10.5x |
See Benchmark for full benchmarks.
Quickstart
from instanttensor import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
for name, tensor in f.tensors():
tensors[name] = tensor
Yielded tensors own their memory by default (copy=True). For zero-copy
streaming into preallocated storage, see Zero-copy mode.
See Usage for multi-file and distributed usage.
Used by
Why InstantTensor?
- Fast weight loading
- Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
- Tuned I/O size and concurrency: Maximize hardware throughput.
- Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
- Distributed loading
- Use
torch.distributed(NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
- Use
- Multiple I/O backends
- Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.
When to Use InstantTensor
InstantTensor is recommended if any of the following conditions are met:
- High storage bandwidth (>= 5 GB/s).
- Unable to keep the model cached in host memory, for example:
- Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
- Infrequent model loading, where Linux page cache is less effective.
- Model switching, where multiple models cannot be cached in memory simultaneously.
- The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
- Loading from
tmpfs.
Installation
Requirements
- GPU platforms: CUDA, ROCm
- Framework: PyTorch
Method 1: Install from pip
pip install instanttensor
Method 2: Build from source
git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"
Usage
Multi-file loading
Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:
from instanttensor import safe_open
files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
for name, tensor in f.tensors():
tensors[name] = tensor
Distributed loading
InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.
import torch
import torch.distributed as dist
from instanttensor import safe_open
dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD
files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
for name, tensor in f.tensors():
tensors[name] = tensor
NOTE: You can also load weights using a subgroup created via
dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.
Buffered I/O
For regular disk files, InstantTensor defaults to Direct I/O to prioritize cold load performance. This is usually the right choice when a model is loaded once for a long-running workload.
If the same model is loaded repeatedly within a short period, Buffered I/O can
be faster because later reads may benefit from the page cache. Enable it with
the backend=BackendPolicy.BUFFERED argument to safe_open, or, when
backend=None, by setting the INSTANTTENSOR_BACKEND=BUFFERED environment
variable.
Zero-copy mode
Pass copy=False to skip the per-tensor clone and yield views into the
internal ring buffer:
with safe_open(files, framework="pt", device=0, copy=False) as f:
for name, tensor in f.tensors():
model_param[name].copy_(tensor)
Two rules:
- Consume each tensor before the next is yielded —
list(f.tensors())and similar patterns silently corrupt data whenbuffer_size < total_tensor_size. - Do not keep references past the
withblock — the buffer is freed on exit.
A UserWarning fires when copy=False and buffer_size < total_tensor_size.
Both attributes are public on the safe_open object.
See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).
Backend selection
InstantTensor selects an I/O backend automatically by default. You can provide
one or more backend candidates with the backend argument to safe_open, or
with the INSTANTTENSOR_BACKEND environment variable when backend=None.
InstantTensor tries the candidates in order and uses the first backend that is
supported by the file system and available on the current system.
Supported backend values are Backend.AIO, Backend.AIO_BUFFERED,
Backend.URING, Backend.URING_BUFFERED, Backend.CUFILE, and
Backend.MMAP. The backend argument accepts a single Backend or a list of
Backend/BackendPolicy values:
from instanttensor import Backend, BackendPolicy, safe_open
safe_open("model.safetensors", framework="pt", device=0, backend=Backend.URING)
safe_open("model.safetensors", framework="pt", device=0, backend=[Backend.URING, Backend.AIO])
safe_open("model.safetensors", framework="pt", device=0, backend=BackendPolicy.BUFFERED)
BackendPolicy.BUFFERED expands to [Backend.URING_BUFFERED, Backend.AIO_BUFFERED, Backend.MMAP]. This is a good choice when you want Buffered I/O.
INSTANTTENSOR_BACKEND accepts comma-separated backend or policy names:
INSTANTTENSOR_BACKEND=URING,AIO
INSTANTTENSOR_BACKEND=BUFFERED
Backends are used in different file-system and I/O scenarios:
- In-memory file systems (available backends:
MMAP,URING_BUFFERED,AIO_BUFFERED): when model files are stored on tmpfs or ramfs,MMAPprovides the best compatibility and performance for this case. The other backends are usually slower for in-memory files. - Regular file systems: InstantTensor can use either Direct I/O or Buffered
I/O.
- Direct I/O (available backends:
AIO,URING,CUFILE) is best when a model is loaded once for a long-running workload. It avoids page-cache cold-start effects and reduces page-cache pollution. When choosing manually,URINGmay be faster on newer platforms.AIOhas the broadest platform compatibility.CUFILErequires GPUDirect Storage support, and its higher throughput can be offset by cuFile initialization overhead. - Buffered I/O (available backends:
AIO_BUFFERED,URING_BUFFERED,MMAP) is best when the same model is loaded repeatedly within a short period. Later reads can benefit from the page cache, though the first read is usually slower than Direct I/O.URING_BUFFEREDis preferred on platforms with io_uring support;AIO_BUFFEREDprovides a more compatible option, whileMMAPis available but usually not preferred.
- Direct I/O (available backends:
If no backend is specified, InstantTensor tries [URING, AIO] for regular disk
files. For tmpfs/ramfs files, it uses MMAP. If none of the requested
candidates can be used, InstantTensor raises an error listing why each candidate
was rejected.
API reference
Thanks
Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file instanttensor-0.1.9.tar.gz.
File metadata
- Download URL: instanttensor-0.1.9.tar.gz
- Upload date:
- Size: 9.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8692b97991c1a5fb2db7905b9a6ae90a7f967c7ddd853d35e41caa146750c02
|
|
| MD5 |
b3a3c911ffb817b7005f3ce5225e6af4
|
|
| BLAKE2b-256 |
3769a4dc4e0f0018a0e558b716e90bcb19cb6a8c7506f68e89a1493608cf5e62
|
File details
Details for the file instanttensor-0.1.9-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: instanttensor-0.1.9-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.14, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41725abe03e4282ea3cba5f9f0ed63c9fe2b4776f7d6e9ccba8edee5570749f1
|
|
| MD5 |
a726499871cba61b5dc748856c6b706c
|
|
| BLAKE2b-256 |
297cb5cb0ae191bac6de43ea574b7d99a33ad9e680a44741b4f13b15e5e22e2f
|
File details
Details for the file instanttensor-0.1.9-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: instanttensor-0.1.9-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.13, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f2b2447959d22b9682d45ae75d7a9260485ed5a6763006db9a3735ceced3efa
|
|
| MD5 |
a2de478c91eea39c87cbe48c1cd1b22b
|
|
| BLAKE2b-256 |
4190b50182538f64d59f4711dd902e60d84cdfb55a03e72a4619121a83260f28
|
File details
Details for the file instanttensor-0.1.9-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: instanttensor-0.1.9-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.12, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d957728161ec0d22a74bad3c0feb8fb9105736ab2585b48a0a0da8d330c6ff3
|
|
| MD5 |
1b5d3722c29490853400393483e773a9
|
|
| BLAKE2b-256 |
9701667438b2c7b9caad3be48fe1363032574bb4a85d8d90b7a6e3ddf9979f53
|
File details
Details for the file instanttensor-0.1.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: instanttensor-0.1.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.9 MB
- Tags: CPython 3.11, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c07d7a5a8960746ca5732e18afe9da377937601a2fa3eab96124a0f44bc16605
|
|
| MD5 |
5b4cc6d76faa7caac821731bbdfdd027
|
|
| BLAKE2b-256 |
5ae58d56ef0a9b5caa6fa4e6d430ad4273c112033bf13f1a4cc2dfcc7e729d36
|
File details
Details for the file instanttensor-0.1.9-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: instanttensor-0.1.9-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.8 MB
- Tags: CPython 3.10, manylinux: glibc 2.24+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2e8a2ce8b63f2aadd551bfa35400c439ffae9fcb3df4651c8c7dc39a4de32e0
|
|
| MD5 |
4c5f9fc8e91622c633df9e88f6f414c9
|
|
| BLAKE2b-256 |
be0c3c757c4030097a70c3de5dc86e3458fef154f6f2433338b0f36dd1f3595e
|