Skip to main content

Library and command-line utility for checksumming FASTA files and individual contigs.

Project description

fasta-checksum-utils

Asynchronous library and command-line utility for checksumming FASTA files and individual contigs. Implements two checksumming algorithms: MD5 and GA4GH, in order to fulfill the needs of the Refget v2 API specification.

Installation

To install fasta-checksum-utils, run the following pip command:

pip install fasta-checksum-utils

CLI Usage

To generate a text report of checksums in the FASTA document, run the following command:

fasta-checksum-utils ./my-fasta.fa[.gz]

This will print output in the following tab-delimited format:

file  [file size in bytes]    md5 [file MD5 hash]           ga4gh  [file GA4GH hash]
chr1  [chr1 sequence length]  md5 [chr1 sequence MD5 hash]  ga4gh  [chr1 sequence GA4GH hash]
chr2  [chr2 sequence length]  md5 [chr2 sequence MD5 hash]  ga4gh  [chr2 sequence GA4GH hash]
...

The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:

file	    30428	md5	825ab3c54b7a67ff2db55262eb532438	ga4gh	SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2	29903	md5	105c82802b67521950854a851fc6eefd	ga4gh	SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D

If the --out-format bento-json arguments are passed, the tool will instead output the report in a JSON format, designed to be compatible with the requirements of the Bento Reference Service. The following example is the output generated by specifying the SARS-CoV-2 genome:

{
  "fasta": "sars_cov_2.fa",
  "fasta_size": 30428,
  "md5": "825ab3c54b7a67ff2db55262eb532438",
  "ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
  "contigs": [
    {
      "name": "NC_045512.2",
      "md5": "105c82802b67521950854a851fc6eefd",
      "ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
      "length": 29903
    }
  ]
}

If an argument like --fai [path or URL] is passed, an additional "fai": "..." property will be added to the JSON object output.

If an argument like --genome-id GRCh38 is provided, an additional "id": "GRCh38" property will be added to the JSON object output.

Library Usage

Below are some examples of how fasta-checksum-utils can be used as an asynchronous Python library:

import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path


async def demo():
    covid_genome: Path = Path("./sars_cov_2.fa")
    
    # calculate an MD5 checksum for a whole file
    file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
    print(file_checksum)
    # prints "863ee5dba1da0ca3f87783782284d489"
    
    all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
    
    # calculate multiple checksums for a whole file
    all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
    print(all_checksums)
    # prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
    
    # calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
    fh = pysam.FastaFile(str(covid_genome))
    try:
        contig_checksums: tuple[str, ...] = await fc.checksum_contig(
            fh=fh, 
            contig_name="NC_045512.2", 
            algorithms=all_algorithms,
        )
        print(contig_checksums)
        # prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
    finally:
        fh.close()  # always close the file handle


asyncio.run(demo())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasta_checksum_utils-0.4.3.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

fasta_checksum_utils-0.4.3-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file fasta_checksum_utils-0.4.3.tar.gz.

File metadata

  • Download URL: fasta_checksum_utils-0.4.3.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.14 Linux/6.5.0-1023-azure

File hashes

Hashes for fasta_checksum_utils-0.4.3.tar.gz
Algorithm Hash digest
SHA256 1f1ad64cc11b14c3743391b252e24649487f5c795660582a3386f2f607d21038
MD5 7e91a3f711e7e02388365e30177bbf06
BLAKE2b-256 b08cab4e547cb4ae01bdac53dfc716a536e4eb4ac5796dcfab1a739012b0d40b

See more details on using hashes here.

File details

Details for the file fasta_checksum_utils-0.4.3-py3-none-any.whl.

File metadata

File hashes

Hashes for fasta_checksum_utils-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 53e6b796915c7d346890fbae34d58379d71b924282a71cb419e2e0e6585a337a
MD5 2a43a39eca8dae98ef8cd791625ba55f
BLAKE2b-256 ea91606a1cb8537207b3aef9eaa81814de68bce7f0c79ad722ef6e8e16a120f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page