Skip to main content

An ultra-fast, distributed Safetensors loader

Project description

InstantTensor

PyPI Downloads License

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor

Yielded tensors own their memory by default (copy=True). For zero-copy streaming into preallocated storage, see Zero-copy mode.

See Usage for multi-file and distributed usage.

Used by

Why InstantTensor?

  • Fast weight loading
    • Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
    • Tuned I/O size and concurrency: Maximize hardware throughput.
    • Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
  • Distributed loading
    • Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
  • Multiple I/O backends
    • Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.

When to Use InstantTensor

InstantTensor is recommended if any of the following conditions are met:

  • High storage bandwidth (>= 5 GB/s).
  • Unable to keep the model cached in host memory, for example:
    • Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
    • Infrequent model loading, where Linux page cache is less effective.
    • Model switching, where multiple models cannot be cached in memory simultaneously.
  • The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
  • Loading from tmpfs.

Installation

Requirements

  • GPU platforms: CUDA, ROCm
  • Framework: PyTorch

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file loading

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

Buffered I/O

For regular disk files, InstantTensor defaults to Direct I/O to prioritize cold load performance. This is usually the right choice when a model is loaded once for a long-running workload.

If the same model is loaded repeatedly within a short period, Buffered I/O can be faster because later reads may benefit from the page cache. Enable it with the backend=BackendPolicy.BUFFERED argument to safe_open, or, when backend=None, by setting the INSTANTTENSOR_BACKEND=BUFFERED environment variable.

Zero-copy mode

Pass copy=False to skip the per-tensor clone and yield views into the internal ring buffer:

with safe_open(files, framework="pt", device=0, copy=False) as f:
    for name, tensor in f.tensors():
        model_param[name].copy_(tensor)

Two rules:

  1. Consume each tensor before the next is yielded — list(f.tensors()) and similar patterns silently corrupt data when buffer_size < total_tensor_size.
  2. Do not keep references past the with block — the buffer is freed on exit.

A UserWarning fires when copy=False and buffer_size < total_tensor_size. Both attributes are public on the safe_open object.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

Backend selection

InstantTensor selects an I/O backend automatically by default. You can provide one or more backend candidates with the backend argument to safe_open, or with the INSTANTTENSOR_BACKEND environment variable when backend=None. InstantTensor tries the candidates in order and uses the first backend that is supported by the file system and available on the current system.

Supported backend values are Backend.AIO, Backend.AIO_BUFFERED, Backend.URING, Backend.URING_BUFFERED, Backend.CUFILE, and Backend.MMAP. The backend argument accepts a single Backend or a list of Backend/BackendPolicy values:

from instanttensor import Backend, BackendPolicy, safe_open

safe_open("model.safetensors", framework="pt", device=0, backend=Backend.URING)
safe_open("model.safetensors", framework="pt", device=0, backend=[Backend.URING, Backend.AIO])
safe_open("model.safetensors", framework="pt", device=0, backend=BackendPolicy.BUFFERED)

BackendPolicy.BUFFERED expands to [Backend.URING_BUFFERED, Backend.AIO_BUFFERED, Backend.MMAP]. This is a good choice when you want Buffered I/O.

INSTANTTENSOR_BACKEND accepts comma-separated backend or policy names:

INSTANTTENSOR_BACKEND=URING,AIO
INSTANTTENSOR_BACKEND=BUFFERED

Backends are used in different file-system and I/O scenarios:

  • In-memory file systems (available backends: MMAP, URING_BUFFERED, AIO_BUFFERED): when model files are stored on tmpfs or ramfs, MMAP provides the best compatibility and performance for this case. The other backends are usually slower for in-memory files.
  • Regular file systems: InstantTensor can use either Direct I/O or Buffered I/O.
    • Direct I/O (available backends: AIO, URING, CUFILE) is best when a model is loaded once for a long-running workload. It avoids page-cache cold-start effects and reduces page-cache pollution. When choosing manually, URING may be faster on newer platforms. AIO has the broadest platform compatibility. CUFILE requires GPUDirect Storage support, and its higher throughput can be offset by cuFile initialization overhead.
    • Buffered I/O (available backends: AIO_BUFFERED, URING_BUFFERED, MMAP) is best when the same model is loaded repeatedly within a short period. Later reads can benefit from the page cache, though the first read is usually slower than Direct I/O. URING_BUFFERED is preferred on platforms with io_uring support; AIO_BUFFERED provides a more compatible option, while MMAP is available but usually not preferred.

If no backend is specified, InstantTensor tries [URING, AIO] for regular disk files. For tmpfs/ramfs files, it uses MMAP. If none of the requested candidates can be used, InstantTensor raises an error listing why each candidate was rejected.

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instanttensor-0.1.9.tar.gz (9.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

instanttensor-0.1.9-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.9-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.9-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

instanttensor-0.1.9-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.24+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file instanttensor-0.1.9.tar.gz.

File metadata

  • Download URL: instanttensor-0.1.9.tar.gz
  • Upload date:
  • Size: 9.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for instanttensor-0.1.9.tar.gz
Algorithm Hash digest
SHA256 d8692b97991c1a5fb2db7905b9a6ae90a7f967c7ddd853d35e41caa146750c02
MD5 b3a3c911ffb817b7005f3ce5225e6af4
BLAKE2b-256 3769a4dc4e0f0018a0e558b716e90bcb19cb6a8c7506f68e89a1493608cf5e62

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.9-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.9-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 41725abe03e4282ea3cba5f9f0ed63c9fe2b4776f7d6e9ccba8edee5570749f1
MD5 a726499871cba61b5dc748856c6b706c
BLAKE2b-256 297cb5cb0ae191bac6de43ea574b7d99a33ad9e680a44741b4f13b15e5e22e2f

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.9-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.9-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1f2b2447959d22b9682d45ae75d7a9260485ed5a6763006db9a3735ceced3efa
MD5 a2de478c91eea39c87cbe48c1cd1b22b
BLAKE2b-256 4190b50182538f64d59f4711dd902e60d84cdfb55a03e72a4619121a83260f28

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.9-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.9-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0d957728161ec0d22a74bad3c0feb8fb9105736ab2585b48a0a0da8d330c6ff3
MD5 1b5d3722c29490853400393483e773a9
BLAKE2b-256 9701667438b2c7b9caad3be48fe1363032574bb4a85d8d90b7a6e3ddf9979f53

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.9-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c07d7a5a8960746ca5732e18afe9da377937601a2fa3eab96124a0f44bc16605
MD5 5b4cc6d76faa7caac821731bbdfdd027
BLAKE2b-256 5ae58d56ef0a9b5caa6fa4e6d430ad4273c112033bf13f1a4cc2dfcc7e729d36

See more details on using hashes here.

File details

Details for the file instanttensor-0.1.9-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for instanttensor-0.1.9-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d2e8a2ce8b63f2aadd551bfa35400c439ffae9fcb3df4651c8c7dc39a4de32e0
MD5 4c5f9fc8e91622c633df9e88f6f414c9
BLAKE2b-256 be0c3c757c4030097a70c3de5dc86e3458fef154f6f2433338b0f36dd1f3595e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page