Skip to main content

File backed objects for array and matrix like data

Project description

Project generated with PyScaffold PyPI-Server Monthly Downloads Unit tests

FileBackedArray

Introduction

This is the Python equivalent of Bioconductor's HDF5Array package, providing a representation of HDF5-backed arrays within the delayedarray framework. The idea is to allow users to store, manipulate and operate on large datasets without loading them into memory, in a manner that is trivially compatible with other data structures in the BiocPy ecosystem.

Installation

This package can be installed from PyPI with the usual commands:

pip install filebackedarray

Quick start

Let's mock up a dense array:

import numpy
data = numpy.random.rand(40, 50, 100)

import h5py
with h5py.File("whee.h5", "w") as handle:
    handle.create_dataset("yay", data=data)

We can now represent it as a Hdf5DenseArray:

import filebackedarray
arr = filebackedarray.Hdf5DenseArray("whee.h5", "yay", native_order=True)
## <40 x 50 x 100> Hdf5DenseArray object of type 'float64'
## [[[0.63008796, 0.34849183, 0.75621679, ..., 0.07343495, 0.63095765,
##    0.625732  ],
##   [0.68123095, 0.91403054, 0.74737122, ..., 0.17344344, 0.82254404,
##    0.58158815],
##   [0.83287116, 0.40738123, 0.89887551, ..., 0.34936481, 0.76600276,
##    0.91991967],
##   ...,

This is just a subclass of a DelayedArray and can be used anywhere in the BiocPy framework. Parts of the NumPy API are also supported - for example, we could apply a variety of delayed operations:

scaling = numpy.random.rand(100)
transformed = numpy.log1p(arr / scaling)
## <40 x 50 x 100> DelayedArray object of type 'float64'
## [[[0.58803887, 0.3458478 , 0.82700531, ..., 0.08224734, 0.65678967,
##    0.56893312],
##   [0.62348907, 0.7341526 , 0.82040225, ..., 0.18437718, 0.7932422 ,
##    0.53784637],
##   [0.72176703, 0.39407341, 0.92788307, ..., 0.34205035, 0.75487196,
##    0.75456938],
##   ...,

Check out the documentation for more details.

Handling sparse matrices

We support a variety of compressed sparse formats where the non-zero elements are held inside three separate datasets - usually data, indices and indptr, based on the 10X Genomics sparse HDF5 format. To demonstrate, let's mock up some sparse data using scipy:

import scipy.sparse
mock = scipy.sparse.random(1000, 200, 0.1).tocsc()

with h5py.File("sparse_whee.h5", "w") as handle:
    handle.create_dataset("sparse_blah/data", data=mock.data, compression="gzip")
    handle.create_dataset("sparse_blah/indices", data=mock.indices, compression="gzip")
    handle.create_dataset("sparse_blah/indptr", data=mock.indptr, compression="gzip")

We can then create a sparse HDF5-backed matrix. Note that there is some variation in this HDF5 compressed sparse format, notably where the dimensions are stored and whether it is column/row-major. The constructor will not do any auto-detection so we need to provide this information explicitly:

import filebackedarray
arr = filebackedarray.Hdf5CompressedSparseMatrix(
    "sparse_whee.h5", 
    "sparse_blah", 
    shape=(100, 200), 
    by_column=True
)
## <100 x 200> sparse Hdf5CompressedSparseMatrix object of type 'float64'
## [[0.        , 0.        , 0.26563417, ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.23896924, 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.42236848, 0.3585153 ,
##   0.        ],
##  ...,
##  [0.        , 0.        , 0.3363087 , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ],
##  [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
##   0.        ]]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FileBackedArray-0.1.1.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

FileBackedArray-0.1.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file FileBackedArray-0.1.1.tar.gz.

File metadata

  • Download URL: FileBackedArray-0.1.1.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for FileBackedArray-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bb6f2bddb374c7cf558600145ec673ed89f4b558a397d5d43ccd84f821878c58
MD5 c045276803c468b189f1ba0a39bb1b38
BLAKE2b-256 cb01ea0d0e6a8fc70c1c61e1d97ed3015b444d9e1ec7541d9e6958f92bd52c5a

See more details on using hashes here.

File details

Details for the file FileBackedArray-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for FileBackedArray-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 05a949a37bc44825cbd6d560d7bcbbcce74866d96a07a8910015b1a1c900f9f8
MD5 7b73a8628cc8fa499f2ac28e716c7c69
BLAKE2b-256 0c54648db45b133b4b8c6dd62c1cb20901b7cfa62d3dc430d137a42c20fafe81

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page