Skip to main content

HDF5 storage driver for cogent3 sequence collections

Project description

CI Coverage Status Ruff Codacy Badge

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once.

Installation

pip install cogent3-h5seqs

Usage

Three types of sequence storage

Storage Class Suffix Compression Notes
UnalignedSeqsData .c3h5u lzf For variable-length sequences. DNA/RNA moltypes are 2-bit encoded, reducing storage by 75%. The encoding can be turned off using packed=False.
AlignedSeqsData .c3h5a lzf Dense storage for equal-length sequences. Every sequence is stored separately.
SparseSeqsData .c3h5s lzf Sparse storage for equal-length sequences. Much more memory efficient than AlignedSeqsData. Faster to create and write.

Making cogent3-h5seqs the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

import cogent3

cogent3.set_storage_defaults(unaligned_seqs="c3h5u",
                             aligned_seqs="c3h5a")

You can undo this setting by

cogent3.set_storage_defaults(reset=True)

Equivalently, you could define

Using cogent3-h5seqs as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

coll = cogent3.load_unaligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_unaligned")

or, for alignments.

aln = cogent3.load_aligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="c3h5s")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Note With the 2-bit encoding for DNA/RNA sequences, you can safely turn off compression with compression=False. This can speed up operations. You can also turn off the encoding by setting packed=False.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

sample_aln = cogent3.get_dataset("brca1")  # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5s"
sample_aln.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

For a sequence collection, do the following.

sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogent3_h5seqs-0.9.0.tar.gz (113.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogent3_h5seqs-0.9.0-py3-none-any.whl (31.2 kB view details)

Uploaded Python 3

File details

Details for the file cogent3_h5seqs-0.9.0.tar.gz.

File metadata

  • Download URL: cogent3_h5seqs-0.9.0.tar.gz
  • Upload date:
  • Size: 113.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cogent3_h5seqs-0.9.0.tar.gz
Algorithm Hash digest
SHA256 f4c9c403c9ffdef4241c3f2b7de85f4084a98fa9221711cbd76c235697592f20
MD5 5dbd35fc6d34b5b6514caf126de06c4d
BLAKE2b-256 ddff236d3870d46a123b4346bbd3e892a0fadb6b11940ac3078388cde4044bc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.9.0.tar.gz:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cogent3_h5seqs-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: cogent3_h5seqs-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 31.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cogent3_h5seqs-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a117e386134903206622a2f940dce0c2ba91cc755810aab7c00d2ca2ca327b9d
MD5 514853ee61befadae6068889e94fdee7
BLAKE2b-256 13e5765c5abbbe9edbee33ebc35cf613ab7bb8d612dddb9b479b24dcaec8e348

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.9.0-py3-none-any.whl:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page