Skip to main content

File backed objects for array and matrix like data

Project description

Project generated with PyScaffold PyPI-Server Monthly Downloads Unit tests

hdf5array

Introduction

This is the Python equivalent of Bioconductor's HDF5Array package, providing a representation of HDF5-backed arrays within the delayedarray framework. The idea is to allow users to store, manipulate and operate on large datasets without loading them into memory, in a manner that is trivially compatible with other data structures in the BiocPy ecosystem.

Installation

This package can be installed from PyPI with the usual commands:

pip install hdf5array

Quick start

Let's mock up a dense array:

import numpy
data = numpy.random.rand(40, 50, 100)

import h5py
with h5py.File("whee.h5", "w") as handle:
    handle.create_dataset("yay", data=data)

We can now represent it as a Hdf5DenseArray:

import hdf5array
arr = hdf5array.Hdf5DenseArray("whee.h5", "yay", native_order=True)
## <40 x 50 x 100> Hdf5DenseArray object of type 'float64'
## [[[0.63008796, 0.34849183, 0.75621679, ..., 0.07343495, 0.63095765,
##    0.625732  ],
##   [0.68123095, 0.91403054, 0.74737122, ..., 0.17344344, 0.82254404,
##    0.58158815],
##   [0.83287116, 0.40738123, 0.89887551, ..., 0.34936481, 0.76600276,
##    0.91991967],
##   ...,

This is just a subclass of a DelayedArray and can be used anywhere in the BiocPy framework. Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:

scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
## <40 x 50 x 100> DelayedArray object of type 'float64'
## [[[0.58803887, 0.3458478 , 0.82700531, ..., 0.08224734, 0.65678967,
##    0.56893312],
##   [0.62348907, 0.7341526 , 0.82040225, ..., 0.18437718, 0.7932422 ,
##    0.53784637],
##   [0.72176703, 0.39407341, 0.92788307, ..., 0.34205035, 0.75487196,
##    0.75456938],
##   ...,

Check out the documentation for more details.

Handling sparse matrices

We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets - usually data, indices and indptr, based on the 10X Genomics sparse HDF5 format. To demonstrate, let's mock up some sparse data using scipy:

import scipy.sparse
mock = scipy.sparse.random(1000, 200, 0.1).tocsc()

with h5py.File("sparse_whee.h5", "w") as handle:
    handle.create_dataset("sparse_blah/data", data=mock.data, compression="gzip")
    handle.create_dataset("sparse_blah/indices", data=mock.indices, compression="gzip")
    handle.create_dataset("sparse_blah/indptr", data=mock.indptr, compression="gzip")

We can then create a sparse HDF5-backed matrix. Note that there is some variation in this HDF5 compressed sparse format, notably where the dimensions are stored and whether it is column/row-major. The constructor will not do any auto-detection so we need to provide this information explicitly:

import hdf5array
arr = hdf5array.Hdf5CompressedSparseMatrix(
    "sparse_whee.h5", 
    "sparse_blah", 
    shape=(100, 200), 
    by_column=True
)
## <100 x 200> sparse Hdf5CompressedSparseMatrix object of type 'float64'
## [[0.        , 0.        , 0.26563417, ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.23896924, 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.42236848, 0.3585153 ,
##   0.        ],
##  ...,
##  [0.        , 0.        , 0.3363087 , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf5array-0.3.0.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

hdf5array-0.3.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file hdf5array-0.3.0.tar.gz.

File metadata

  • Download URL: hdf5array-0.3.0.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hdf5array-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3a1193c616b1ed5a1ebd3331d6f5dd3434b94604737c647d585bdcef57269660
MD5 078368f75568296c14ccca784625cc52
BLAKE2b-256 2b869b3b3049db40cbace05640e2a14550e12b0de0d140cda959fb25db888468

See more details on using hashes here.

File details

Details for the file hdf5array-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: hdf5array-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hdf5array-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 450a85573ce93c70e558ee0e3a9b1e8d0245153f2e20449043cdf72a907107a6
MD5 43ce0b33af3c05aeb1be8f948921b274
BLAKE2b-256 e93901747dabd2221a6e6ca967bdbccb5480673019b16062e0c2936017292054

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page