Skip to main content

HDF5 storage driver for cogent3 sequence collections

Project description

CI Coverage Status Ruff

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the BLOSC2 compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This duplicated sequences are stored only once, and we also store the mapping of sequence names to the hexdigest.

Installation

pip install cogent3-h5seqs

Usage

Making cogent3-h5seqs the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

import cogent3

cogent3.set_storage_defaults(unaligned_seqs="h5seqs_unaligned",
                             aligned_seqs="h5seqs_aligned")

You can undo this setting by

cogent3.set_storage_defaults(unaligned_seqs=None, aligned_seqs=None)

Using cogent3-h5seqs as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

coll = cogent3.load_unaligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_unaligned")

or, for alignments.

aln = cogent3.load_aligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_aligned")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

import cogent3

sample_aln = cogent3.get_dataset("brca1")  # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5a"
sample_aln.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

For a sequence collection, do the following.

sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogent3_h5seqs-0.6.1.tar.gz (78.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogent3_h5seqs-0.6.1-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file cogent3_h5seqs-0.6.1.tar.gz.

File metadata

  • Download URL: cogent3_h5seqs-0.6.1.tar.gz
  • Upload date:
  • Size: 78.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for cogent3_h5seqs-0.6.1.tar.gz
Algorithm Hash digest
SHA256 9609edb8ea88ea01df0544e21cc07d346a47c21b1ef0775e24b0b09733cea641
MD5 90b58053c91f35e8aba5a33072653d7e
BLAKE2b-256 cc4c421df8d74c91077a6d926c3a45cdefb62f40b5a474dd98533af424f22e4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.6.1.tar.gz:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cogent3_h5seqs-0.6.1-py3-none-any.whl.

File metadata

  • Download URL: cogent3_h5seqs-0.6.1-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for cogent3_h5seqs-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 069bcec357bdef47a06f89c095039ee3119e08e9800dd8a0b19b904fbb3670ac
MD5 86562d5326e5f27a9897d46243579afd
BLAKE2b-256 292073b1eb1c0958c6156961e9797d3ffda112fea180e65c968e5eaf6808ff58

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.6.1-py3-none-any.whl:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page