Skip to main content

File backed objects for array and matrix like data

Project description

Project generated with PyScaffold PyPI-Server Monthly Downloads Unit tests

hdf5array

Introduction

This is the Python equivalent of Bioconductor's HDF5Array package, providing a representation of HDF5-backed arrays within the delayedarray framework. The idea is to allow users to store, manipulate and operate on large datasets without loading them into memory, in a manner that is trivially compatible with other data structures in the BiocPy ecosystem.

Installation

This package can be installed from PyPI with the usual commands:

pip install hdf5array

Quick start

Let's mock up a dense array:

import numpy
data = numpy.random.rand(40, 50, 100)

import h5py
with h5py.File("whee.h5", "w") as handle:
    handle.create_dataset("yay", data=data)

We can now represent it as a Hdf5DenseArray:

import hdf5array
arr = hdf5array.Hdf5DenseArray("whee.h5", "yay", native_order=True)
## <40 x 50 x 100> Hdf5DenseArray object of type 'float64'
## [[[0.63008796, 0.34849183, 0.75621679, ..., 0.07343495, 0.63095765,
##    0.625732  ],
##   [0.68123095, 0.91403054, 0.74737122, ..., 0.17344344, 0.82254404,
##    0.58158815],
##   [0.83287116, 0.40738123, 0.89887551, ..., 0.34936481, 0.76600276,
##    0.91991967],
##   ...,

This is just a subclass of a DelayedArray and can be used anywhere in the BiocPy framework. Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:

scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
## <40 x 50 x 100> DelayedArray object of type 'float64'
## [[[0.58803887, 0.3458478 , 0.82700531, ..., 0.08224734, 0.65678967,
##    0.56893312],
##   [0.62348907, 0.7341526 , 0.82040225, ..., 0.18437718, 0.7932422 ,
##    0.53784637],
##   [0.72176703, 0.39407341, 0.92788307, ..., 0.34205035, 0.75487196,
##    0.75456938],
##   ...,

Check out the documentation for more details.

Handling sparse matrices

We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets - usually data, indices and indptr, based on the 10X Genomics sparse HDF5 format. To demonstrate, let's mock up some sparse data using scipy:

import scipy.sparse
mock = scipy.sparse.random(1000, 200, 0.1).tocsc()

with h5py.File("sparse_whee.h5", "w") as handle:
    handle.create_dataset("sparse_blah/data", data=mock.data, compression="gzip")
    handle.create_dataset("sparse_blah/indices", data=mock.indices, compression="gzip")
    handle.create_dataset("sparse_blah/indptr", data=mock.indptr, compression="gzip")

We can then create a sparse HDF5-backed matrix. Note that there is some variation in this HDF5 compressed sparse format, notably where the dimensions are stored and whether it is column/row-major. The constructor will not do any auto-detection so we need to provide this information explicitly:

import hdf5array
arr = hdf5array.Hdf5CompressedSparseMatrix(
    "sparse_whee.h5", 
    "sparse_blah", 
    shape=(100, 200), 
    by_column=True
)
## <100 x 200> sparse Hdf5CompressedSparseMatrix object of type 'float64'
## [[0.        , 0.        , 0.26563417, ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.23896924, 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.42236848, 0.3585153 ,
##   0.        ],
##  ...,
##  [0.        , 0.        , 0.3363087 , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdf5array-0.2.1.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

hdf5array-0.2.1-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file hdf5array-0.2.1.tar.gz.

File metadata

  • Download URL: hdf5array-0.2.1.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hdf5array-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d1ed7be552ebc036f29cf212c40a75a2b0e3bfa7792fb0e0fb9138c7cf76c494
MD5 be5ea2c751354f9b4c12eb3c0b5f0795
BLAKE2b-256 cf80b1f6a563025145d063a0aafc84d7a4f3ce0a671b580114221d058cc47a7d

See more details on using hashes here.

File details

Details for the file hdf5array-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: hdf5array-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hdf5array-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4234910a96fe0f58a0650594bbacce6c47bcc25fe06b3e17c41c6ba298ed2d42
MD5 ce0758c188d8ca43a65b10a0592c4eca
BLAKE2b-256 7ad48485e37277b9a0c4d4e0c33c1077953a1b9e0b1093b7e9faa3d22ba5bfae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page