Skip to main content

High-performance safetensors model loader

Project description

fastsafetensors is an efficient safetensors model loader. This library is tested with Python 3.10-3.13 and PyTorch 2.1-2.7.

Disclaimer: This repository contains a research prototype. It should be used with caution.

Features

We introduced three major features to optimize model loading performance:

  1. Batched, lazy tensor instantiation.
  2. GPU offloading for sharding, type conversions, and device pointer alignment.
  3. GPU Direct Storage enablement for file loading from storage to GPU memory.

A major design difference from the original safetensors file loader is that fastsafetensors does NOT use mmap. The original loader loads tensors on demand from memory-mapped files, but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs. Therefore, we asynchronously transfer files in parallel to saturate storage throughput. The loader then lazily instantiates tensors in GPU device memory with DLPack.

Another design change is to offload sharding and other tensor manipulations to GPUs. The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses. Therefore, we introduce special APIs to run sharding with torch.distributed collective operations such as broadcast and scatter. The offloading is also applied to other tensor manipulations such as type conversions.

The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage. The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory.

Basic API usage

SafeTensorsFileLoader is a low-level entrypoint. To use it, pass either SingleGroup() for simple inference or ProcessGroup() (from torch.distributed) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the device and nogds arguments, respectively. Note that if GDS is not available, the loader will fail to open files when nogds=False. For more information on enabling GDS, please refer to the NVIDIA documentation.

After creating a SafeTensorsFileLoader instance, first map target files and a rank using the .add_filenames() method. Then, call .copy_file_to_device() to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the .get_tensor() method. Additionally, you can obtain sharded tensors by .get_sharded(), which internally runs collective operations in torch.distributed.

Important: To release the GPU memory allocated for tensors, you must explicitly call the .close() method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling .close(), which will then safely release the underlying GPU memory.

fastsafe_open is an easier entrypoint. You can force GDS off and run in fallback mode if nogds=True. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.

with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
    for key in f.get_keys():
        t = f.get_tensor(key).clone().detach() # clone if t is used outside

Development

Pre-commit Hooks

Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code.

To set up:

  1. Install development dependencies:
pip install -e ".[dev]"
  1. Install pre-commit hooks:
pre-commit install

Now, every time you commit, the following checks will run automatically:

  • black: Code formatting
  • isort: Import sorting
  • flake8: Basic linting (syntax errors, undefined names)
  • mypy: Type checking
  • trailing-whitespace: Remove trailing whitespace
  • end-of-file-fixer: Ensure files end with a newline
  • check-yaml: Validate YAML files
  • check-toml: Validate TOML files
  • check-merge-conflict: Detect merge conflict markers
  • debug-statements: Detect debug statements

To manually run pre-commit on all files:

pre-commit run --all-files

To skip pre-commit hooks (not recommended):

git commit --no-verify

Code of Conduct

Please refer to Foundation Model Stack Community Code of Conduct.

Publication

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors arXiv:2505.23072 and IEEE CLOUD 2025.

For NVIDIA

Install from PyPI

See https://pypi.org/project/fastsafetensors/

pip install fastsafetensors

Install from source

pip install .

For ROCm

On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports nogds=True mode. The performance gain example can be found at amd-perf.md.

Install from GitHub Source

ROCM_PATH=/opt/rocm pip install git+https://github.com/foundation-model-stack/fastsafetensors.git

Install from source

ROCM_PATH=/opt/rocm pip install .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsafetensors-0.3.1.tar.gz (55.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsafetensors-0.3.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file fastsafetensors-0.3.1.tar.gz.

File metadata

  • Download URL: fastsafetensors-0.3.1.tar.gz
  • Upload date:
  • Size: 55.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for fastsafetensors-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b7eb039a564d77280d17e5d63b27e9963ba5158ad02d2a3c1772c62072a81a53
MD5 be68d0029b3fa344a4dd2258fe5210ee
BLAKE2b-256 d269e34a1e86a02b255896c57263bf0dfbae45b4708fd609b937f783c2202e7b

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 348814ad027891317ad8aff90a3c62f8f704a0861d089ea33cff088980602355
MD5 0645a5151e2c43abb0dc3eb4f246c01a
BLAKE2b-256 73aa00acc9e8f8209f513bf504371f0c40f4073b90240ef3e73507461d2390b9

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5c9f8d3e18969090b5d66581cb6e9f6c9057158c151ed87e9c309dead5c64442
MD5 69cf94ecdc7aed96c512ccd711657c3c
BLAKE2b-256 84f0a9ec204e866b52ce323bafc31f4dbf15581b3c998d597ec53c62012716ff

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ac76f33e47959b7c31658fbbda1805df7540819828a3ce6a94eb34b4db0b1fa7
MD5 9246f057082272786d830b1e7be1cb97
BLAKE2b-256 6f50909871d673bacd6dfc7fee5e59bcd4ec9fbd19775bafe567ad236a3adced

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 32cf4b531b5d77de41d777106ea69036a16bdea80062646dabc93a11b4cf88ee
MD5 f96d7d10b358b07d69e2169b45157f5c
BLAKE2b-256 e867eaa10409a526242253926fe6981c652dfdb8aa4ec0d4cba4077a9376a1fd

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 acc0971b006f381598d32c5f7aae3a07d9a10d87c701f26b65215083e15a4733
MD5 94d080cab8cff28e9dd91d394c296ed9
BLAKE2b-256 0a770913cd907585085b2f10f2d7eef2aa52c9445241dc66238de25ee24c8241

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page