Skip to main content

HDF5 storage driver for cogent3 sequence collections

Project description

CI Coverage Status Ruff

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once and we also store the mapping of sequence names to the hexdigest.

Installation

pip install cogent3-h5seqs

Usage

Three types of sequence storage

Unaligned sequences

For sequences that may not be the same length, select c3h5u, or h5seqs_unaligned.

Aligned sequences, full storage

For sequences that must be the same length, select c3h5a, or h5seqs_aligned. This is a dense storage format whete every sequence is stored separately.

Aligned sequences, sparse storage

For sequences that must be the same length, select c3h5s, or h5seqs_sparse. This is uses a sparse matrix for storage reducing memory and storage requirements. Faster to create and write than the dense variant.

Making cogent3-h5seqs the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

import cogent3

cogent3.set_storage_defaults(unaligned_seqs="c3h5u",
                             aligned_seqs="c3h5a")

You can undo this setting by

cogent3.set_storage_defaults(reset=True)

Equivalently, you could define

Using cogent3-h5seqs as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

coll = cogent3.load_unaligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_unaligned")

or, for alignments.

aln = cogent3.load_aligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="c3h5s")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Note You can turn off compression with compression=False. This can speed up operations.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

import cogent3

sample_aln = cogent3.get_dataset("brca1")  # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5s"
sample_aln.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

For a sequence collection, do the following.

sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogent3_h5seqs-0.7.3.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogent3_h5seqs-0.7.3-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file cogent3_h5seqs-0.7.3.tar.gz.

File metadata

  • Download URL: cogent3_h5seqs-0.7.3.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cogent3_h5seqs-0.7.3.tar.gz
Algorithm Hash digest
SHA256 acaefd2b7c61333a3865119500f518e4b7e6370bb3efd049490da66de42c3455
MD5 68bb2012775ef5eeba327072e7522281
BLAKE2b-256 dc1745dd77fb5890f6893455ea1f9f62a3c7317121775eab60767b6e753e6353

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.7.3.tar.gz:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cogent3_h5seqs-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: cogent3_h5seqs-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cogent3_h5seqs-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d2cb03919ab24d65186fded136927cff3ab728b35216300f0e870fd9ac211604
MD5 94ed110629e544f0c908f79113993a5d
BLAKE2b-256 c3a9ef24987de1b9004c6c0a61101c13c8a2933388578727c12733729c2eb18e

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.7.3-py3-none-any.whl:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page