Skip to main content

HDF5-backed objects for array and matrix like data

Project description

Project generated with PyScaffold PyPI-Server Monthly Downloads Unit tests

hdf5array

Introduction

This is the Python equivalent of Bioconductor's HDF5Array package, providing a representation of HDF5-backed arrays within the delayedarray framework. The idea is to allow users to store, manipulate and operate on large datasets without loading them into memory, in a manner that is trivially compatible with other data structures in the BiocPy ecosystem.

Installation

This package can be installed from PyPI with the usual commands:

pip install hdf5array

Quick start

Let's mock up a dense array:

import numpy
data = numpy.random.rand(40, 50, 100)

import h5py
with h5py.File("whee.h5", "w") as handle:
    handle.create_dataset("yay", data=data)

We can now represent it as a Hdf5DenseArray:

import hdf5array
arr = hdf5array.Hdf5DenseArray("whee.h5", "yay", native_order=True)
## <40 x 50 x 100> Hdf5DenseArray object of type 'float64'
## [[[0.63008796, 0.34849183, 0.75621679, ..., 0.07343495, 0.63095765,
##    0.625732  ],
##   [0.68123095, 0.91403054, 0.74737122, ..., 0.17344344, 0.82254404,
##    0.58158815],
##   [0.83287116, 0.40738123, 0.89887551, ..., 0.34936481, 0.76600276,
##    0.91991967],
##   ...,

This is just a subclass of a DelayedArray and can be used anywhere in the BiocPy framework. Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:

scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
## <40 x 50 x 100> DelayedArray object of type 'float64'
## [[[0.58803887, 0.3458478 , 0.82700531, ..., 0.08224734, 0.65678967,
##    0.56893312],
##   [0.62348907, 0.7341526 , 0.82040225, ..., 0.18437718, 0.7932422 ,
##    0.53784637],
##   [0.72176703, 0.39407341, 0.92788307, ..., 0.34205035, 0.75487196,
##    0.75456938],
##   ...,

Check out the documentation for more details.

Handling sparse matrices

We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets - usually data, indices and indptr, based on the 10X Genomics sparse HDF5 format. To demonstrate, let's mock up some sparse data using scipy:

import scipy.sparse
mock = scipy.sparse.random(1000, 200, 0.1).tocsc()

with h5py.File("sparse_whee.h5", "w") as handle:
    handle.create_dataset("sparse_blah/data", data=mock.data, compression="gzip")
    handle.create_dataset("sparse_blah/indices", data=mock.indices, compression="gzip")
    handle.create_dataset("sparse_blah/indptr", data=mock.indptr, compression="gzip")

We can then create a sparse HDF5-backed matrix. Note that there is some variation in this HDF5 compressed sparse format, notably where the dimensions are stored and whether it is column/row-major. The constructor will not do any auto-detection so we need to provide this information explicitly:

import hdf5array
arr = hdf5array.Hdf5CompressedSparseMatrix(
    "sparse_whee.h5",
    "sparse_blah",
    shape=(100, 200),
    by_column=True
)
## <100 x 200> sparse Hdf5CompressedSparseMatrix object of type 'float64'
## [[0.        , 0.        , 0.26563417, ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.23896924, 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.42236848, 0.3585153 ,
##   0.        ],
##  ...,
##  [0.        , 0.        , 0.3363087 , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf5array-0.5.0.tar.gz (30.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hdf5array-0.5.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file hdf5array-0.5.0.tar.gz.

File metadata

  • Download URL: hdf5array-0.5.0.tar.gz
  • Upload date:
  • Size: 30.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for hdf5array-0.5.0.tar.gz
Algorithm Hash digest
SHA256 d0eb7ec71cb33258e29d21f67596b7828cbaea600e9bec60ce8b4db9ba3dda4c
MD5 eaf87b2e4c61cefee95242f7376ac7b9
BLAKE2b-256 9e77b876b11af019489086aae36b4b0c08870cddb7817a314c591b57db9685cd

See more details on using hashes here.

File details

Details for the file hdf5array-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: hdf5array-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for hdf5array-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4b98796ca5caacc38e3689ccdab5cc51284e5778f9461b37052e6d7a0666ed4
MD5 86976a445464ad3ff17d731493e9e31b
BLAKE2b-256 70a701397bfe1d9a6e55477ffcdf33982a777d4c3cc9aacc2b6acbb8d2bc5c7a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page