HDF5-backed objects for array and matrix like data
Project description
hdf5array
Introduction
This is the Python equivalent of Bioconductor's HDF5Array package, providing a representation of HDF5-backed arrays within the delayedarray framework. The idea is to allow users to store, manipulate and operate on large datasets without loading them into memory, in a manner that is trivially compatible with other data structures in the BiocPy ecosystem.
Installation
This package can be installed from PyPI with the usual commands:
pip install hdf5array
Quick start
Let's mock up a dense array:
import numpy
data = numpy.random.rand(40, 50, 100)
import h5py
with h5py.File("whee.h5", "w") as handle:
handle.create_dataset("yay", data=data)
We can now represent it as a Hdf5DenseArray
:
import hdf5array
arr = hdf5array.Hdf5DenseArray("whee.h5", "yay", native_order=True)
## <40 x 50 x 100> Hdf5DenseArray object of type 'float64'
## [[[0.63008796, 0.34849183, 0.75621679, ..., 0.07343495, 0.63095765,
## 0.625732 ],
## [0.68123095, 0.91403054, 0.74737122, ..., 0.17344344, 0.82254404,
## 0.58158815],
## [0.83287116, 0.40738123, 0.89887551, ..., 0.34936481, 0.76600276,
## 0.91991967],
## ...,
This is just a subclass of a DelayedArray
and can be used anywhere in the BiocPy framework.
Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:
scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
## <40 x 50 x 100> DelayedArray object of type 'float64'
## [[[0.58803887, 0.3458478 , 0.82700531, ..., 0.08224734, 0.65678967,
## 0.56893312],
## [0.62348907, 0.7341526 , 0.82040225, ..., 0.18437718, 0.7932422 ,
## 0.53784637],
## [0.72176703, 0.39407341, 0.92788307, ..., 0.34205035, 0.75487196,
## 0.75456938],
## ...,
Check out the documentation for more details.
Handling sparse matrices
We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets -
usually data
, indices
and indptr
, based on the 10X Genomics sparse HDF5 format.
To demonstrate, let's mock up some sparse data using scipy:
import scipy.sparse
mock = scipy.sparse.random(1000, 200, 0.1).tocsc()
with h5py.File("sparse_whee.h5", "w") as handle:
handle.create_dataset("sparse_blah/data", data=mock.data, compression="gzip")
handle.create_dataset("sparse_blah/indices", data=mock.indices, compression="gzip")
handle.create_dataset("sparse_blah/indptr", data=mock.indptr, compression="gzip")
We can then create a sparse HDF5-backed matrix. Note that there is some variation in this HDF5 compressed sparse format, notably where the dimensions are stored and whether it is column/row-major. The constructor will not do any auto-detection so we need to provide this information explicitly:
import hdf5array
arr = hdf5array.Hdf5CompressedSparseMatrix(
"sparse_whee.h5",
"sparse_blah",
shape=(100, 200),
by_column=True
)
## <100 x 200> sparse Hdf5CompressedSparseMatrix object of type 'float64'
## [[0. , 0. , 0.26563417, ..., 0. , 0. ,
## 0. ],
## [0. , 0. , 0. , ..., 0.23896924, 0. ,
## 0. ],
## [0. , 0. , 0. , ..., 0.42236848, 0.3585153 ,
## 0. ],
## ...,
## [0. , 0. , 0.3363087 , ..., 0. , 0. ,
## 0. ],
## [0. , 0. , 0. , ..., 0. , 0. ,
## 0. ],
## [0. , 0. , 0. , ..., 0. , 0. ,
## 0. ]]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hdf5array-0.4.0.tar.gz
.
File metadata
- Download URL: hdf5array-0.4.0.tar.gz
- Upload date:
- Size: 30.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85072d27a55ffdf4e56b73f678362d52538623cb015be79184720005633e5816 |
|
MD5 | e3feb8f91093a5c7b774335a7fa3a420 |
|
BLAKE2b-256 | 5c08747f851668aa0be155258da5858ebad05d77d3926dda7283466ddd338b02 |
File details
Details for the file hdf5array-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: hdf5array-0.4.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c61af93192c430ed6c0d914570ca5c0cbe93e8ec2f8f9057c89a2205f198325b |
|
MD5 | f6d89acbf076fee43e3bf2077874c3f7 |
|
BLAKE2b-256 | 66fb01e5038eb64db84892c9a9059a7fbe07840f510c794c02164a7373582ab7 |