A package for HDF5-based chunked arrays
Project description
A minimal package for saving and reading large HDF5-based chunked arrays.
This package has been developed in the Portugues lab
for volumetric calcium imaging data. split_dataset
is extensively used in the calcium imaging analysis package fimpy
; The microscope control libraries sashimi
and brunoise
save files as split datasets.
napari-split-dataset
support the visualization of SplitDatasets in napari
.
Why using Split dataset?
Split datasets are numpy-like array saved over multiple h5 files. The concept of spli datasets is not different from e.g. zarr arrays; however, relying on h5 files allow for partial reading even within the same file, which is crucial for visualizing volumetric time series, the main application split_dataset
has been developed for (see this discussion on the limitation of zarr arrays).
Structure of a split dataset
A split dataset is contained in a folder containing multiple, numbered h5 files (one file per chunk) and a metadata json file with information on the shape of the full dataset and of its chunks.
The h5 files are saved using the flammkuchen library (ex deepdish). Each file contains a dictionary with the data under the stack
keyword.
SplitDataset
objects can than be instantiated from the dataset path, and numpy-style indexing can then be used to load data as numpy arrays. Any n of dimensions and block sizes are supported in principle; the package has been used mainly with 3D and 4D arrays.
Minimal example
# Load a SplitDataset via a SplitDataset object:
from split_dataset import SplitDataset
ds = SplitDataset(path_to_dataset)
# Retrieve data in an interval:
data_array = ds[n_start:n_end, :, :, :]
Creating split datasets
New split datasets can be created with the split_dataset.save_to_split_dataset
function, provided that the original data is fully loaded in memory. Alternatively, e.g. for time acquisitions, a split dataset can be saved one chunk at a time. It is enough to save with flammkuchen
correctly formatted .h5 files and the correspondent json metadata file describing the full split dataset shape (this is what happens in sashimi)
TODO
- provide utilities for partial saving of split datasets
- support for more advanced indexing (support for step and vector indexing)
- support for cropping a
SplitDataset
- support for resolution and frequency metadata
History
0.4.0 (2021-03-23)
- Added support to use a
SplitDataset
as data in anapari
layer.
...
0.1.0 (2020-05-06)
- First release on PyPI.
Credits
Part of this package was inspired by Cookiecutter and this template.
.. _Portugues lab
:
.. _Cookiecutter:
.. _this:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for split_dataset-0.4.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c4412b6e9c01b1171f2aab89b7bab2ef0f4c5cc4361f8c9ca90d57b51665ca10 |
|
MD5 | cbef6951e910ecf7d9b3d97e0c163d0e |
|
BLAKE2b-256 | dd28382d5ff11bf7be869c7e4031595ec199d3f78f61da345f7e997923966e07 |