Library and command-line utility for checksumming FASTA files and individual contigs.
Project description
fasta-checksum-utils
Asynchronous library and command-line utility for checksumming FASTA files and individual contigs.
Implements two checksumming algorithms: MD5
and GA4GH
, in order to fulfill the needs of the
Refget v2 API specification.
Installation
To install fasta-checksum-utils
, run the following pip
command:
pip install fasta-checksum-utils
CLI Usage
To generate a text report of checksums in the FASTA document, run the following command:
fasta-checksum-utils ./my-fasta.fa[.gz]
This will print output in the following tab-delimited format:
file [file size in bytes] md5 [file MD5 hash] ga4gh [file GA4GH hash]
chr1 [chr1 sequence length] md5 [chr1 sequence MD5 hash] ga4gh [chr1 sequence GA4GH hash]
chr2 [chr2 sequence length] md5 [chr2 sequence MD5 hash] ga4gh [chr2 sequence GA4GH hash]
...
The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:
file 30428 md5 825ab3c54b7a67ff2db55262eb532438 ga4gh SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2 29903 md5 105c82802b67521950854a851fc6eefd ga4gh SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D
If the --out-format bento-json
arguments are passed, the tool will instead output the report in a JSON
format, designed to be compatible with the requirements of the
Bento Reference Service. The following example
is the output generated by specifying the SARS-CoV-2 genome:
{
"fasta": "sars_cov_2.fa",
"fasta_size": 30428,
"md5": "825ab3c54b7a67ff2db55262eb532438",
"ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
"contigs": [
{
"name": "NC_045512.2",
"md5": "105c82802b67521950854a851fc6eefd",
"ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
"length": 29903
}
]
}
If an argument like --fai [path or URL]
is passed, an additional "fai": "..."
property will be added to the JSON
object output.
If an argument like --genome-id GRCh38
is provided, an additional "id": "GRCh38"
property will be added to the
JSON object output.
Library Usage
Below are some examples of how fasta-checksum-utils
can be used as an asynchronous Python library:
import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path
async def demo():
covid_genome: Path = Path("./sars_cov_2.fa")
# calculate an MD5 checksum for a whole file
file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
print(file_checksum)
# prints "863ee5dba1da0ca3f87783782284d489"
all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
# calculate multiple checksums for a whole file
all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
print(all_checksums)
# prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
# calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
fh = pysam.FastaFile(str(covid_genome))
try:
contig_checksums: tuple[str, ...] = await fc.checksum_contig(
fh=fh,
contig_name="NC_045512.2",
algorithms=all_algorithms,
)
print(contig_checksums)
# prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
finally:
fh.close() # always close the file handle
asyncio.run(demo())
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fasta_checksum_utils-0.4.3.tar.gz
.
File metadata
- Download URL: fasta_checksum_utils-0.4.3.tar.gz
- Upload date:
- Size: 8.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f1ad64cc11b14c3743391b252e24649487f5c795660582a3386f2f607d21038 |
|
MD5 | 7e91a3f711e7e02388365e30177bbf06 |
|
BLAKE2b-256 | b08cab4e547cb4ae01bdac53dfc716a536e4eb4ac5796dcfab1a739012b0d40b |
File details
Details for the file fasta_checksum_utils-0.4.3-py3-none-any.whl
.
File metadata
- Download URL: fasta_checksum_utils-0.4.3-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1023-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 53e6b796915c7d346890fbae34d58379d71b924282a71cb419e2e0e6585a337a |
|
MD5 | 2a43a39eca8dae98ef8cd791625ba55f |
|
BLAKE2b-256 | ea91606a1cb8537207b3aef9eaa81814de68bce7f0c79ad722ef6e8e16a120f7 |