High-performance safetensors model loader

These details have not been verified by PyPI

Project links

Repository

Project description

fastsafetensors is an efficient safetensors model loader. This library is tested with Python 3.10-3.13 and PyTorch 2.1-2.7.

Disclaimer: This repository contains a research prototype. It should be used with caution.

Features

We introduced three major features to optimize model loading performance:

Batched, lazy tensor instantiation.
GPU offloading for sharding, type conversions, and device pointer alignment.
GPU Direct Storage enablement for file loading from storage to GPU memory.

A major design difference from the original safetensors file loader is that fastsafetensors does NOT use mmap. The original loader loads tensors on demand from memory-mapped files, but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs. Therefore, we asynchronously transfer files in parallel to saturate storage throughput. The loader then lazily instantiates tensors in GPU device memory with DLPack.

Another design change is to offload sharding and other tensor manipulations to GPUs. The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses. Therefore, we introduce special APIs to run sharding with torch.distributed collective operations such as broadcast and scatter. The offloading is also applied to other tensor manipulations such as type conversions.

The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage. The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory.

Basic API usage

SafeTensorsFileLoader is a low-level entrypoint. To use it, pass either SingleGroup() for simple inference or ProcessGroup() (from torch.distributed) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the device and nogds arguments, respectively. Note that if GDS is not available, the loader will fail to open files when nogds=False. For more information on enabling GDS, please refer to the NVIDIA documentation.

After creating a SafeTensorsFileLoader instance, first map target files and a rank using the .add_filenames() method. Then, call .copy_file_to_device() to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the .get_tensor() method. Additionally, you can obtain sharded tensors by .get_sharded(), which internally runs collective operations in torch.distributed.

Important: To release the GPU memory allocated for tensors, you must explicitly call the .close() method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling .close(), which will then safely release the underlying GPU memory.

fastsafe_open is an easier entrypoint. You can force GDS off and run in fallback mode if nogds=True. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.

with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
    for key in f.get_keys():
        t = f.get_tensor(key).clone().detach() # clone if t is used outside

Development

Pre-commit Hooks

Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code.

To set up:

Install development dependencies:

pip install -e ".[dev]"

Install pre-commit hooks:

pre-commit install

Now, every time you commit, the following checks will run automatically:

black: Code formatting
isort: Import sorting
flake8: Basic linting (syntax errors, undefined names)
mypy: Type checking
trailing-whitespace: Remove trailing whitespace
end-of-file-fixer: Ensure files end with a newline
check-yaml: Validate YAML files
check-toml: Validate TOML files
check-merge-conflict: Detect merge conflict markers
debug-statements: Detect debug statements

To manually run pre-commit on all files:

pre-commit run --all-files

To skip pre-commit hooks (not recommended):

git commit --no-verify

Code of Conduct

Please refer to Foundation Model Stack Community Code of Conduct.

Publication

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors arXiv:2505.23072 and IEEE CLOUD 2025.

For NVIDIA

Install from PyPI

See https://pypi.org/project/fastsafetensors/

pip install fastsafetensors

Install from source

pip install .

For ROCm

On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports nogds=True mode. The performance gain example can be found at amd-perf.md.

Install from GitHub Source

ROCM_PATH=/opt/rocm pip install git+https://github.com/foundation-model-stack/fastsafetensors.git

Install from source

ROCM_PATH=/opt/rocm pip install .

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.3.1

May 6, 2026

This version

0.3

Apr 22, 2026

0.2.2

Feb 10, 2026

0.2.0

Dec 17, 2025

0.1.15

Jul 18, 2025

0.1.14

Jun 23, 2025

0.1.13

May 19, 2025

0.1.12

Apr 7, 2025

0.1.11 yanked

Apr 4, 2025

Reason this release was yanked:

regressions on vllm

0.1.10

Dec 16, 2024

0.1.9

Nov 1, 2024

0.1.8

Jul 26, 2024

0.1.7

Jun 24, 2024

0.1.5

Jun 11, 2024

0.1.4

Jun 7, 2024

0.1.3

Jun 7, 2024

0.1.2

Jun 5, 2024

0.1.1

May 27, 2024

0.1.0

May 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsafetensors-0.3.tar.gz (57.5 kB view details)

Uploaded Apr 22, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded Apr 22, 2026 CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded Apr 22, 2026 CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 22, 2026 CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded Apr 22, 2026 CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file fastsafetensors-0.3.tar.gz.

File metadata

Download URL: fastsafetensors-0.3.tar.gz
Upload date: Apr 22, 2026
Size: 57.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3.tar.gz
Algorithm	Hash digest
SHA256	`89f392569d2281d1a966d3b64f99a6386149116e37eef4f4890168c87a8c4f19`
MD5	`42e285e3d149d7699621dba3dcb8ac7b`
BLAKE2b-256	`3998053c622e61bb766d31327a88215082320a4ba8bd6a62c4c5435221844103`

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: Apr 22, 2026
Size: 1.9 MB
Tags: CPython 3.13, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`881b8dd5ebc5a73691ef9695a2d224f05bc9c5a60a95e1329f13df784502ae24`
MD5	`70ccab71cf61dcda957692633e909ffc`
BLAKE2b-256	`7045459a11e31aec2e9b803ea19cd796b3b678435086d688c91c29d3f880c996`

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: Apr 22, 2026
Size: 1.9 MB
Tags: CPython 3.12, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`3ce38241c5afedf94ef37531b8b8703016b2ea39350cfd33e819e65d4d5305e0`
MD5	`72ba326e3c0acd6be02d62ec60a44297`
BLAKE2b-256	`0a06bca80663bf8136f273643d149953dd29ca2c52aa4faac4b67506b871a5ec`

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: Apr 22, 2026
Size: 1.8 MB
Tags: CPython 3.11, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`7e875afdc7e76bc0ddf46fd4b32db9f232543a8dea383dc7eb9de8f1dcd9e090`
MD5	`ba5ed156b659d55653a1923be8c56f3f`
BLAKE2b-256	`7cfc78ca177fe45fa5ea0020b5a570cbe5a59cb9b3b4ff49e011261c75711634`

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: Apr 22, 2026
Size: 1.8 MB
Tags: CPython 3.10, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`36ce31f6d623e7b32dc509d23b430020afe4fc9696323ac5e57bab87024bea59`
MD5	`40d0e50dd038a682accc1a85d561157a`
BLAKE2b-256	`42440577a37b7c26a7b9fc6352f6d45e83ea933e3c8bd65db5ce7c421eaf5f5b`

See more details on using hashes here.

fastsafetensors 0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Basic API usage

Development

Pre-commit Hooks

Code of Conduct

Publication

For NVIDIA

Install from PyPI

Install from source

For ROCm

Install from GitHub Source

Install from source

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes