Skip to main content

Functions to efficiently rechunk multidimensional arrays

Project description

rechunkit

Functions to efficiently rechunk multidimensional arrays

codecov PyPI version


Documentation: https://mullenkamp.github.io/rechunkit/

Source Code: https://github.com/mullenkamp/rechunkit


Introduction

Rechunkit is a Python library for efficiently rechunking multidimensional numpy arrays stored as chunks. It uses a generator-based approach for on-the-fly rechunking without requiring the full target array in memory.

Key Features

  • Efficient On-the-Fly Rechunking: Uses Python generators to yield rechunked data without requiring the full target array to be stored in memory.
  • Memory-Aware Optimization: Employs a smart scaling algorithm to maximize performance within a user-defined memory limit (max_mem).
  • LCM Minimization: Utilizes highly composite numbers for chunk guessing to minimize the Least Common Multiple (LCM) between source and target, significantly reducing redundant reads.
  • Flexible Data Access: Supports subset selection (sel) and works with any source that implements a numpy __getitem__ style callable (method or function).
  • Source-Aligned Selection Reads: When rechunking a subset (sel), read requests are aligned to source chunk boundaries -- even when the selection offset doesn't fall on a chunk boundary. This allows source functions backed by chunk-based storage (HDF5, Zarr, cfdb) to serve each read from aligned chunks without needing to assemble across boundaries.
  • Preprocessing Utilities: Includes tools for estimating ideal chunk shapes, calculating memory requirements, and predicting the number of required read operations.

Installation

pip install rechunkit

Quick Example

import numpy as np
from math import prod
from rechunkit import rechunker

shape = (31, 31, 31)
dtype = np.dtype('int32')
source_data = np.arange(1, prod(shape) + 1, dtype=dtype).reshape(shape)
source = source_data.__getitem__

target = np.zeros(shape, dtype=dtype)
for write_chunk, data in rechunker(source, shape, dtype, (5, 2, 4), (4, 5, 3), max_mem=2000):
    target[write_chunk] = data

assert np.all(source_data == target)

See the documentation for detailed guides, integration examples (h5py, zarr), and the full API reference.

License

This project is licensed under the terms of the Apache Software License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rechunkit-0.5.1.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rechunkit-0.5.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file rechunkit-0.5.1.tar.gz.

File metadata

  • Download URL: rechunkit-0.5.1.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.7

File hashes

Hashes for rechunkit-0.5.1.tar.gz
Algorithm Hash digest
SHA256 5c53f23065b4a8603dbd9b617c333a70c935324834256a7b419cd169a414eb46
MD5 c206fdac1c118055c5db8b2fa7e9cf59
BLAKE2b-256 81f6e2cd3323d8d91a0a387cafda40f6b21e88f313925a8e2cd6a117ca756321

See more details on using hashes here.

File details

Details for the file rechunkit-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: rechunkit-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.7

File hashes

Hashes for rechunkit-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac2e3d895c5d2f51aa5d62d5d213b0035df473222c174b220d3afadd31b048e3
MD5 74a20e7ec6f491330fc2f9ec741b811d
BLAKE2b-256 000e0a9f89172e30aa9a151989b12e20dc734464737447d886bb827e701874af

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page