HDF5 storage driver for cogent3 sequence collections
Project description
cogent3-h5seqs: a HDF5 storage driver for cogent3 sequence collections
cogent3-h5seqs is a sequence storage plug-in for cogent3. It uses HDF5 as the storage format for biological sequences, supporting both unaligned sequence collections and alignments. Storage can be in memory (the default) or on disk and sequences are compressed using the lzf compression engine.
The advantage of HDF5 is that once primary sequence formats have been converted from text into numpy arrays, loading and manipulating sequence data is fast and very memory efficient.
Sequences are stored under the hexdigest of their xxhash.hash64(). This means duplicated sequences are stored only once and we also store the mapping of sequence names to the hexdigest.
Installation
pip install cogent3-h5seqs
Usage
Three types of sequence storage
Unaligned sequences
For sequences that may not be the same length, select c3h5u, or h5seqs_unaligned. Nucleic acid sequences (cogent3 "dna" or "rna" moltypes) are 2-bit encoded, reducing storage (on disk or in memory) requirements by 75%.
Aligned sequences, full storage
For sequences that must be the same length, select c3h5a, or h5seqs_aligned. This is a dense storage format where every sequence is stored separately.
Aligned sequences, sparse storage
For sequences that must be the same length, select c3h5s, or h5seqs_sparse. This is uses a sparse matrix for storage reducing memory and storage requirements. Faster to create and write than the dense variant.
Making cogent3-h5seqs the default storage
Using cogent3.set_storage_defaults(), you can set cogent3-h5seqs as the default storage. This means whenever a sequence collection is loaded from disk or created in memory, it will use the storage within this package.
The following statement makes cogent3-h5seqs the default for both unaligned and aligned sequence collections.
import cogent3
cogent3.set_storage_defaults(unaligned_seqs="c3h5u",
aligned_seqs="c3h5a")
You can undo this setting by
cogent3.set_storage_defaults(reset=True)
Equivalently, you could define
Using cogent3-h5seqs as storage per object
You don't have to specify the storage as the default for all instances, but can do it on a per object basis.
coll = cogent3.load_unaligned_seqs(some_path,
moltype="dna",
storage_backend="h5seqs_unaligned")
or, for alignments.
aln = cogent3.load_aligned_seqs(some_path,
moltype="dna",
storage_backend="c3h5s")
The same values can also be provided to the make_unaligned_seqs(), make_aligned_seqs() functions in cogent3.
Note With the 2-bit encoding for DNA/RNA sequences, you can safely turn off compression with
compression=False. This can speed up operations. You can also turn off the encoding by settingpacked=False.
Saving storage to disk
cogent3-h5seqs supports writing to disk, and employs the filename suffix .c3h5u for unaligned sequences and .c3h5a for aligned sequences. This will work whether your current object is using cogent3-h5seqs for storage or not. For example
import cogent3
sample_aln = cogent3.get_dataset("brca1") # using the cogent3 builtin storage
outpath = "~/Desktop/alignment_output.c3h5s"
sample_aln.write(outpath) # writes out as cogent3-h5seqs HDF5 storage
For a sequence collection, do the following.
sample_coll = cogent3.get_dataset("brca1").degap()
# Note the different suffix
outpath = "~/Desktop/alignment_output.c3h5u"
sample_coll.write(outpath) # writes out as cogent3-h5seqs HDF5 storage
Loading storage from disk
cogent3 correctly directs to cogent3-h5seqs for loading based on the filename suffix.
inpath = "~/Desktop/alignment_output.c3h5u"
sample_coll = cogent3.load_unaligned_seqs(inpath, moltype="dna")
Note You cannot write an alignment instance to an unaligned storage type or vice versa. Nor can you read into the different types.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cogent3_h5seqs-0.8.0.tar.gz.
File metadata
- Download URL: cogent3_h5seqs-0.8.0.tar.gz
- Upload date:
- Size: 80.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1e25279de3c9784c676a25c5f1ddf56704247ead673ae55c2da871d60946248
|
|
| MD5 |
7ae4700daf0e538effb6eda3287c12b3
|
|
| BLAKE2b-256 |
0a0699b06704d1651cfb50210671d9d0ae67c29f4c8980115371890110f09129
|
Provenance
The following attestation bundles were made for cogent3_h5seqs-0.8.0.tar.gz:
Publisher:
release.yml on cogent3/cogent3-h5seqs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cogent3_h5seqs-0.8.0.tar.gz -
Subject digest:
c1e25279de3c9784c676a25c5f1ddf56704247ead673ae55c2da871d60946248 - Sigstore transparency entry: 858597131
- Sigstore integration time:
-
Permalink:
cogent3/cogent3-h5seqs@4d0dc206f927fe4598705537d28668c232d77283 -
Branch / Tag:
refs/tags/0.8.0 - Owner: https://github.com/cogent3
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d0dc206f927fe4598705537d28668c232d77283 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file cogent3_h5seqs-0.8.0-py3-none-any.whl.
File metadata
- Download URL: cogent3_h5seqs-0.8.0-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a169133fd3eee316dbfdb091be31c50bdb68e0a4393b0339a9d313f1757080b
|
|
| MD5 |
c37e31e4a92d8e694ecd23e1bcca460f
|
|
| BLAKE2b-256 |
e991900a94f3a6c041f9bf364a90598f18b70c741e8c0aa02e5bcadae6bb0a6a
|
Provenance
The following attestation bundles were made for cogent3_h5seqs-0.8.0-py3-none-any.whl:
Publisher:
release.yml on cogent3/cogent3-h5seqs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cogent3_h5seqs-0.8.0-py3-none-any.whl -
Subject digest:
1a169133fd3eee316dbfdb091be31c50bdb68e0a4393b0339a9d313f1757080b - Sigstore transparency entry: 858597213
- Sigstore integration time:
-
Permalink:
cogent3/cogent3-h5seqs@4d0dc206f927fe4598705537d28668c232d77283 -
Branch / Tag:
refs/tags/0.8.0 - Owner: https://github.com/cogent3
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4d0dc206f927fe4598705537d28668c232d77283 -
Trigger Event:
workflow_dispatch
-
Statement type: