A unified interface for memory efficient per tensor loading of safetensors files as raw bytes from offset, handling CPU/GPU pinned transfers, and converting between tensors and dicts.

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Operating System
- Microsoft :: Windows
- POSIX :: Linux
Programming Language
- Python :: 3

Project description

unifiedefficientloader

A unified interface for loading safetensors, handling CPU/GPU pinned transfers, and converting between tensors and dicts.

Documentation

Full API reference and guides in docs/.

Installation

You can install this package via pip. Since it heavily relies on torch and safetensors but doesn't strictly force them as hard dependencies for package building/installation, make sure you have them installed in your environment:

pip install unifiedefficientloader
pip install torch safetensors tqdm

Usage

Unified Safetensors Loader

from unifiedefficientloader import UnifiedSafetensorsLoader

# Standard mode (preload all)
with UnifiedSafetensorsLoader("model.safetensors", low_memory=False) as loader:
    tensor = loader.get_tensor("weight_name")

# Low memory mode (streaming)
with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    for key in loader.keys():
        tensor = loader.get_tensor(key)
        # Process tensor...
        loader.mark_processed(key) # Frees memory

Incremental Safetensors Writer

You can incrementally stream and save tensors to disk using a pre-allocated "dummy" file and a background ThreadPoolExecutor. This ensures that you don't need to hold the entire output model in memory, completely eliminating massive RAM spikes during saving.

from unifiedefficientloader import UnifiedSafetensorsLoader, IncrementalSafetensorsWriter

loader = UnifiedSafetensorsLoader("source_model.safetensors", low_memory=True)

# 1. Initialize with an optional metadata dictionary
# max_header_bytes defaults to 1MB, which is plenty for >10,000 tensors.
writer = IncrementalSafetensorsWriter("merged_or_quantized.safetensors", metadata=loader.metadata())

# 2. Stream tensors into the file
with writer:
    for key in loader.keys():
        t = loader.get_tensor(key)           # 1. Loader -> Memory
        gpu_t = t.to("cuda")                 # 2. Memory -> GPU
        del t                                # <-- Explicit cleanup
        
        out_gpu_t = custom_quantize(gpu_t)   # 3. Process on GPU
        out_t = out_gpu_t.cpu()              # 4. GPU -> Memory
        del gpu_t, out_gpu_t                 # <-- Explicit cleanup
        
        writer.write(key, out_t)             # 5. Memory -> File queue
        del out_t                            # <-- Explicit cleanup

Loading Specific Tensors Dynamically (Header Analysis)

You can analyze the file's header without loading the entire multi-gigabyte safetensors file into memory. This allows you to locate specific data (like embedded JSON dictionaries stored as uint8 tensors) and load only those specific tensors directly from their file offsets.

from unifiedefficientloader import UnifiedSafetensorsLoader, tensor_to_dict

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    # 1. Analyze the header metadata without loading any tensors
    # loader._header contains the full safetensors header directory
    uint8_tensor_keys = [
        key for key, info in loader._header.items()
        if isinstance(info, dict) and info.get("dtype") == "U8"
    ]

    # 2. Load ONLY those specific tensors using their keys
    for key in uint8_tensor_keys:
        # get_tensor dynamically reads only the bytes for this tensor
        # based on the offsets found in the header
        loaded_tensor = loader.get_tensor(key)

        # 3. Decode the uint8 tensor back into a Python dictionary
        extracted_dict = tensor_to_dict(loaded_tensor)
        print(f"Decoded {key}:", extracted_dict)

Optimized Asynchronous Streaming via ThreadPoolExecutor

For maximum I/O throughput while maintaining strict memory backpressure, use async_stream. This utilizes a ThreadPoolExecutor for background disk reading and a bounded queue to prevent memory exhaustion. By setting pin_memory=True, memory pinning is performed sequentially in the main thread to avoid OS-level lock contention and preserve high DMA transfer speeds.

from unifiedefficientloader import UnifiedSafetensorsLoader, transfer_to_gpu_pinned

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    keys_to_load = loader.keys()

    # Create the continuous streaming generator
    # prefetch_batches controls how many batches to buffer in memory
    stream = loader.async_stream(
        keys_to_load,
        batch_size=8,
        prefetch_batches=2,
        pin_memory=True
    )

    # Iterate directly over the generator
    for batch in stream:
        for key, pinned_tensor in batch:
            # Transfer directly to GPU via DMA (pinning is already done)
            gpu_tensor = transfer_to_gpu_pinned(pinned_tensor, device="cuda")

            # ... process gpu_tensor ...
            loader.mark_processed(key)

Unified Data Loader

A high-performance, threaded alternative to PyTorch's standard DataLoader. It eliminates multiprocessing IPC overhead and features a zero-copy pipeline capable of streaming batches directly from pinned CPU memory to VRAM (direct_gpu=True).

from unifiedefficientloader import UnifiedDataLoader
from torchvision import datasets, transforms

dataset = datasets.FakeData(transform=transforms.ToTensor())

# Replaces torch.utils.data.DataLoader
# Pre-allocates pinned buffer pools and streams directly to GPU
loader = UnifiedDataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    direct_gpu=True
)

for batch_image, batch_label in loader:
    # batch is already on the GPU (device="cuda")
    pass

Direct-to-GPU Streaming (Zero-Copy)

For the absolute fastest loading times on CUDA devices, use the direct_gpu=True flag. This creates a pipeline that pre-allocates pinned memory pools and GPU memory slabs. Tensors are loaded from disk directly into pinned buffers, and immediately asynchronously copied to the GPU using CUDA streams, hiding the PCIe transfer latency completely behind the disk I/O.

from unifiedefficientloader import UnifiedSafetensorsLoader

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True, direct_gpu=True) as loader:
    stream = loader.async_stream(
        loader.keys(),
        batch_size=8,
        prefetch_batches=2,
    )
    for batch in stream:
        for key, gpu_tensor in batch:
            # gpu_tensor is already on the GPU
            assert gpu_tensor.device.type == "cuda"
            # ... process gpu_tensor ...
            loader.mark_processed(key)  # releases GPU buffer back to pool

Zero-Copy MMAP Loading

use_mmap=True maps the file into virtual memory via the uel native extension. No data is copied into RAM — PyTorch holds a direct pointer into OS page cache.

from unifiedefficientloader import UnifiedSafetensorsLoader

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True, use_mmap=True) as loader:
    state_dict = loader.load_all()
    # all tensors are zero-copy views into mapped memory

Requires the uel native extension to be compiled. Falls back silently to standard IO if unavailable. See docs/mmap.md and docs/building.md.

Tensor/Dict Conversion

from unifiedefficientloader import dict_to_tensor, tensor_to_dict

my_dict = {"param": 1.0, "name": "test"}
tensor = dict_to_tensor(my_dict)
recovered_dict = tensor_to_dict(tensor)

Pinned Memory Transfers

import torch
from unifiedefficientloader import transfer_to_gpu_pinned

tensor = torch.randn(100, 100)
# Transfers using pinned memory if CUDA is available, otherwise falls back gracefully
gpu_tensor = transfer_to_gpu_pinned(tensor, device="cuda:0")

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
Operating System
- Microsoft :: Windows
- POSIX :: Linux
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.4

Apr 29, 2026

0.4.3

Apr 26, 2026

0.4.2

Apr 26, 2026

This version

0.4.1

Apr 25, 2026

0.4.0

Apr 25, 2026

0.3.2

Apr 15, 2026

0.2.3

Apr 14, 2026

0.2.2

Apr 12, 2026

0.2.1

Apr 6, 2026

0.2.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

unifiedefficientloader-0.4.1-cp39-abi3-win_amd64.whl (134.4 kB view details)

Uploaded Apr 25, 2026 CPython 3.9+Windows x86-64

unifiedefficientloader-0.4.1-cp39-abi3-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (121.1 kB view details)

Uploaded Apr 25, 2026 CPython 3.9+manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

File details

Details for the file unifiedefficientloader-0.4.1-cp39-abi3-win_amd64.whl.

File metadata

Download URL: unifiedefficientloader-0.4.1-cp39-abi3-win_amd64.whl
Upload date: Apr 25, 2026
Size: 134.4 kB
Tags: CPython 3.9+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unifiedefficientloader-0.4.1-cp39-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`cc069b779d7fdf9ff0dafb4fd56030589e6863765e97c995b82e9e603a4da8f5`
MD5	`44f34f02e827baeb703664647d01650f`
BLAKE2b-256	`0cda669ab56557c435444baf80dd908a77c7db0dbb54280871422e62de9ad7ea`

See more details on using hashes here.

File details

Details for the file unifiedefficientloader-0.4.1-cp39-abi3-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

Download URL: unifiedefficientloader-0.4.1-cp39-abi3-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Upload date: Apr 25, 2026
Size: 121.1 kB
Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64, manylinux: glibc 2.5+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unifiedefficientloader-0.4.1-cp39-abi3-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm	Hash digest
SHA256	`87b75a57940a64e63a24a009483f6b3f34e392bea83b6b5a206bd4d2de1bf420`
MD5	`37bc7182236c004751d7c67afed27cc9`
BLAKE2b-256	`66efbbfe9bebc3211191dc00a0def6938d3f9bb23c66cae4541b748c1e91303f`

See more details on using hashes here.

unifiedefficientloader 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

unifiedefficientloader

Documentation

Installation

Usage

Unified Safetensors Loader

Incremental Safetensors Writer

Loading Specific Tensors Dynamically (Header Analysis)

Optimized Asynchronous Streaming via ThreadPoolExecutor

Unified Data Loader

Direct-to-GPU Streaming (Zero-Copy)

Zero-Copy MMAP Loading

Tensor/Dict Conversion

Pinned Memory Transfers

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes