Skip to main content

HDF5 storage driver for cogent3 sequence collections

Project description

CI Coverage Status Ruff

cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections

cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.

The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.

Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once and we also store the mapping of sequence names to the hexdigest.

Installation

pip install cogent3-h5seqs

Usage

Three types of sequence storage

Unaligned sequences

For sequences that may not be the same length, select c3h5u, or h5seqs_unaligned.

Aligned sequences, full storage

For sequences that must be the same length, select c3h5a, or h5seqs_aligned. This is a dense storage format whete every sequence is stored separately.

Aligned sequences, sparse storage

For sequences that must be the same length, select c3h5s, or h5seqs_sparse. This is uses a sparse matrix for storage reducing memory and storage requirements. Faster to create and write than the dense variant.

Making cogent3-h5seqs the default storage

Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.

The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.

import cogent3

cogent3.set_storage_defaults(unaligned_seqs="c3h5u",
                             aligned_seqs="c3h5a")

You can undo this setting by

cogent3.set_storage_defaults(reset=True)

Equivalently, you could define

Using cogent3-h5seqs as storage per object

You don't have to specify the storage as the default for all instances, but can do it on a per object basis.

coll = cogent3.load_unaligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="h5seqs_unaligned")

or, for alignments.

aln = cogent3.load_aligned_seqs(some_path,
                                   moltype="dna",
                                   storage_backend="c3h5s")

The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.

Note You can turn off compression with compression=False. This can speed up operations.

Saving storage to disk

cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example

import cogent3

sample_aln = cogent3.get_dataset("brca1")  # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5s"
sample_aln.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

For a sequence collection, do the following.

sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath)  # writes out as cogent3-h5seqs HDF5 storage

Loading storage from disk

cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.

inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")

Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogent3_h5seqs-0.7.2.tar.gz (72.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogent3_h5seqs-0.7.2-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file cogent3_h5seqs-0.7.2.tar.gz.

File metadata

  • Download URL: cogent3_h5seqs-0.7.2.tar.gz
  • Upload date:
  • Size: 72.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cogent3_h5seqs-0.7.2.tar.gz
Algorithm Hash digest
SHA256 ea4bb4c1fd438c0d310c42a96f2f8543ab83aff25d1aa16cb121997c75c3aee2
MD5 8aefce250f458e95f10c1ce9226ef2c2
BLAKE2b-256 9c6acf8e7f7c52d03391cb859bddfbb8dfb6597f45eb41c2b0a61c0135df3a66

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.7.2.tar.gz:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cogent3_h5seqs-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: cogent3_h5seqs-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cogent3_h5seqs-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9abe003f553004aefa1a1c94b59714c4fffd78c14d05f02126871ed23231bb46
MD5 cf23a156a914cb0a71e5810e6d3c9053
BLAKE2b-256 b58ffa93cd6094fa8f8102283c22f49ffdc122b4a68a31349742249dbeba7761

See more details on using hashes here.

Provenance

The following attestation bundles were made for cogent3_h5seqs-0.7.2-py3-none-any.whl:

Publisher: release.yml on cogent3/cogent3-h5seqs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page