Skip to main content

High-performance safetensors model loader

Project description

fastsafetensors is an efficient safetensors model loader. This library is tested with python 3.9-13 and pytorch 2.1-2.7.

Disclaimer: This repository contains a research prototype. It should be used with caution.

Features

We introduced three major features to optimize model loading performance:

  1. Batched, lazy tensor instantiations
  2. GPU offloading for sharding, type conversions, and device pointer alignment.
  3. GPU Direct Storage enablement for file loading from storage to GPU memory

A major design difference from the original safetensors file loader is NOT to use mmap. It loads tensors on-demand with mmap'ed files, but unfortunately, it cannot fully utilize high-throughput I/O such as NVMe SSDs. So, we asynchronously transfer files in parallel to saturate storage throughput. Then, fastsafetensors lazily instantiates tensors at GPU device memory with DLPack.

Another design change is to offload sharding and other manipulations on tensors to GPUs. The original loader provides slicing for sharding at user programs before copying to device memory. However, it incurrs high CPU usages for host memory accesses. So, we introduce a special APIs to run sharding with torch.distributed collective operations such as broadcast and scatter. The offloading is also applied to other tensor manipulations such as type conversions.

The above two design can be naturally extended to utilize device-to-device data transfers with GPU Direct Storage. The technology helps to minimize copy overheads from NVMe SSDs to GPU memory with host CPU and memory bypassed.

Basic API usages

SafeTensorsFileLoader is a low-level entrypoint. To use it, pass either SingleGroup() for simple inference or ProcessGroup() (from torch.distributed) for tensor-parallel inference. The loader supports both CPU and CUDA devices, with optional GPU Direct Storage (GDS) support. You can specify the device and GDS settings using the device and nogds arguments, respectively. Note that if GDS is not available, the loader will fail to open files when nogds=False. For more information on enabling GDS, please refer to the NVIDIA documentation.

After creating a SafeTensorsFileLoader instance, first map target files and a rank using the .add_filenames() method. Then, call .copy_file_to_device() to trigger the actual file copies on aggregated GPU memory fragments and directly instantiate a group of Tensors. Once the files are loaded, you can retrieve a tensor using the .get_tensor() method. Additionally, you can obtain sharded tensors by .get_sharded(), which internally run collective operations in torch.distributed.

Important: To release the GPU memory allocated for tensors, you must explicitly call the .close() method. This is because Fastsafetensors allows multiple tensors to share a limited number of GPU memory fragments. As a result, it is the user's responsibility to ensure that all tensors are properly released before calling .close(), which will then safely release the underlying GPU memory.

fastsafe_open is an easier entrypoint. You can force turning off GDS and run in the fallback mode if nogds==True. However, users must be aware of the above tricky memory management model, which should be fixed in future releases.

with fastsafe_open(filenames=[filename], nogds=True, device="cpu", debug_log=True) as f:
    for key in f.get_keys():
        t = f.get_tensor(key).clone().detach() # clone if t is used outside

Development

Pre-commit Hooks

This repository uses pre-commit hooks for automatic code formatting and linting. To set up:

  1. Install development dependencies:
pip install -e ".[dev]"
  1. Install pre-commit hooks:
pre-commit install

Now, every time you commit, the following checks will run automatically:

  • black: Code formatting
  • isort: Import sorting
  • flake8: Basic linting (syntax errors, undefined names)
  • mypy: Type checking
  • trailing-whitespace: Remove trailing whitespace
  • end-of-file-fixer: Ensure files end with a newline
  • check-yaml: Validate YAML files
  • check-toml: Validate TOML files
  • check-merge-conflict: Detect merge conflict markers
  • debug-statements: Detect debug statements

To manually run pre-commit on all files:

pre-commit run --all-files

To skip pre-commit hooks (not recommended):

git commit --no-verify

Code of Conduct

Please refer to Foundation Model Stack Community Code of Conduct.

Publication

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman. (2025) Speeding up Model Loading with fastsafetensors arXiv:2505.23072 and IEEE CLOUD 2025.

For NVIDIA

Install from PyPI

See https://pypi.org/project/fastsafetensors/

pip install fastsafetensors

Install from source

pip install .

For ROCm

On ROCm, there are not GDS equivalent support. So fastsafetensors support only supports nogds=True mode. The performance gain example can be found at amd-perf.md

Install from Github Source

ROCM_PATH=/opt/rocm pip install git+https://github.com/foundation-model-stack/fastsafetensors.git

Install from source

ROCM_PATH=/opt/rocm pip install .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastsafetensors-0.2.2.tar.gz (52.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fastsafetensors-0.2.2-cp313-cp313-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

fastsafetensors-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

fastsafetensors-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

fastsafetensors-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

fastsafetensors-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.34+ x86-64

File details

Details for the file fastsafetensors-0.2.2.tar.gz.

File metadata

  • Download URL: fastsafetensors-0.2.2.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for fastsafetensors-0.2.2.tar.gz
Algorithm Hash digest
SHA256 67c06bdbb9855070ebb036e1ce7a61a71d2ce722bbd5457534af602dd4734249
MD5 e0f75a25182e2302ac591ed5fc708269
BLAKE2b-256 f2c21504b51a4ce61cd0f6b8c6a88c0dbed22df011a2aeb99ade35e528b0e091

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.2.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.2.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 9596a91155df28064a9214ba3e2dcb6546dd97a1df494640f86ae1e5d2206b71
MD5 755cc8b39c03521656b82ffcc45c3cbd
BLAKE2b-256 f1c013dec18b8a9ec0e5f84c140add9b43261fcc495fbd96c868ac7d647824b0

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.2.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 284e8dd4859d8d94c1ab8c782994539abab977dc021974c57b258bfbe69bf288
MD5 67c8723050754f337e60f2ea76284c8d
BLAKE2b-256 464dd10f39e0bd184bc78fc6316c7485088045ed9f35c8e6a99c9b76cdd110d1

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.2.2-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e59b93871a0e6b429b819abe7855958baf15b096a4e668be220253f246bdf10a
MD5 ee5152459142843d458fae8c041373da
BLAKE2b-256 875d07cb244ca36e1332336aead0a36fb1c74080d1361d877b4e98c516acc679

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.2.2-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4c87c1e141932e3c72516e073aa81601c81cec6b5a908dd8a252457ff0fc4b98
MD5 e1a3edb937405206a4436d8b814772d5
BLAKE2b-256 7e47d21fbd0eb16221f7d488357d5a3cce3d5b2957e56687a1cdad073f18a23b

See more details on using hashes here.

File details

Details for the file fastsafetensors-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fastsafetensors-0.2.2-cp39-cp39-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c1434815ee80b67af1fa7be853b2abef91274b3dccb0429101409f8929043c58
MD5 77f9d34e01fa4995a1e9a438bb960698
BLAKE2b-256 8084e8252d7aab1eefdbf80966144c9fdf215bf1c5bcde80a8231980ba435ac6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page