Skip to main content

High-performance safetensors model loader

Project description

fastsafetensors is an efficient safetensors model loader. This library is tested with Python 3.10-3.13 and PyTorch 2.1-2.7.

Disclaimer: This repository contains a research prototype. It should be used with caution.

Features

We introduced three major features to optimize model loading performance:

  1. Batched, lazy tensor instantiation.
  2. GPU offloading for sharding, type conversions, and device pointer alignment.
  3. GPU Direct Storage enablement for file loading from storage to GPU memory.

A major design difference from the original safetensors file loader is that fastsafetensors does NOT use mmap. The original loader loads tensors on demand from memory-mapped files, but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs. Therefore, we asynchronously transfer files in parallel to saturate storage throughput. The loader then lazily instantiates tensors in GPU device memory with DLPack.

Another design change is to offload sharding and other tensor manipulations to GPUs. The original loader provides slicing for sharding in user programs before copying to device memory. However, it incurs high CPU usage for host memory accesses. Therefore, we introduce special APIs to run sharding with torch.distributed collective operations such as broadcast and scatter. The offloading is also applied to other tensor manipulations such as type conversions.

The above two designs can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage. The technology helps minimize copy overheads from NVMe SSDs to GPU memory by bypassing host CPU and memory.

Basic API usage

SafeTensorsFileLoader is a low-level entrypoint. To use it, pass either SingleGroup() for simple inference or ProcessGroup() (from torch.distributed) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the device and nogds arguments, respectively. Note that if GDS is not available, the loader will fail to open files when nogds=False. For more information on enabling GDS, please refer to the NVIDIA documentation.

After creating a SafeTensorsFileLoader instance, first map target files and a rank using the .add_filenames() method. Then, call .copy_file_to_device() to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of tensors. Once the files are loaded, you can retrieve a tensor using the .get_tensor() method. Additionally, you can obtain sharded tensors by .get_sharded(), which internally runs collective operations in torch.distributed.

Important: To release the GPU memory allocated for tensors, you must explicitly call the .close() method. This is because fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling .close(), which will then safely release the underlying GPU memory.

fastsafe_open is an easier entrypoint. You can force GDS off and run in fallback mode if nogds=True. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.

with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
    for key in f.get_keys():
        t = f.get_tensor(key).clone().detach() # clone if t is used outside

Development

Pre-commit Hooks

Our CI workflow checks code formatting and linting with Python 3.13. Therefore, we recommend testing your code with Python 3.13 and running the following pre-commit hooks before contributing your code.

To set up:

  1. Install development dependencies:
pip install -e ".[dev]"
  1. Install pre-commit hooks:
pre-commit install

Now, every time you commit, the following checks will run automatically:

  • black: Code formatting
  • isort: Import sorting
  • flake8: Basic linting (syntax errors, undefined names)
  • mypy: Type checking
  • trailing-whitespace: Remove trailing whitespace
  • end-of-file-fixer: Ensure files end with a newline
  • check-yaml: Validate YAML files
  • check-toml: Validate TOML files
  • check-merge-conflict: Detect merge conflict markers
  • debug-statements: Detect debug statements

To manually run pre-commit on all files:

pre-commit run --all-files

To skip pre-commit hooks (not recommended):

git commit --no-verify

Code of Conduct

Please refer to Foundation Model Stack Community Code of Conduct.

Publication

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors arXiv:2505.23072 and IEEE CLOUD 2025.

For NVIDIA

Install from PyPI

See https://pypi.org/project/fastsafetensors/

pip install fastsafetensors

Install from source

pip install .

For ROCm

On ROCm, there is no GDS-equivalent support, so fastsafetensors only supports nogds=True mode. The performance gain example can be found at amd-perf.md.

Install from GitHub Source

ROCM_PATH=/opt/rocm pip install git+https://github.com/foundation-model-stack/fastsafetensors.git

Install from source

ROCM_PATH=/opt/rocm pip install .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsafetensors-0.3.tar.gz (57.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

File details

Details for the file fastsafetensors-0.3.tar.gz.

File metadata

  • Download URL: fastsafetensors-0.3.tar.gz
  • Upload date:
  • Size: 57.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for fastsafetensors-0.3.tar.gz
Algorithm Hash digest
SHA256 89f392569d2281d1a966d3b64f99a6386149116e37eef4f4890168c87a8c4f19
MD5 42e285e3d149d7699621dba3dcb8ac7b
BLAKE2b-256 3998053c622e61bb766d31327a88215082320a4ba8bd6a62c4c5435221844103

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 881b8dd5ebc5a73691ef9695a2d224f05bc9c5a60a95e1329f13df784502ae24
MD5 70ccab71cf61dcda957692633e909ffc
BLAKE2b-256 7045459a11e31aec2e9b803ea19cd796b3b678435086d688c91c29d3f880c996

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3ce38241c5afedf94ef37531b8b8703016b2ea39350cfd33e819e65d4d5305e0
MD5 72ba326e3c0acd6be02d62ec60a44297
BLAKE2b-256 0a06bca80663bf8136f273643d149953dd29ca2c52aa4faac4b67506b871a5ec

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7e875afdc7e76bc0ddf46fd4b32db9f232543a8dea383dc7eb9de8f1dcd9e090
MD5 ba5ed156b659d55653a1923be8c56f3f
BLAKE2b-256 7cfc78ca177fe45fa5ea0020b5a570cbe5a59cb9b3b4ff49e011261c75711634

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 36ce31f6d623e7b32dc509d23b430020afe4fc9696323ac5e57bab87024bea59
MD5 40d0e50dd038a682accc1a85d561157a
BLAKE2b-256 42440577a37b7c26a7b9fc6352f6d45e83ea933e3c8bd65db5ce7c421eaf5f5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page