Skip to main content

Library and command-line utility for checksumming FASTA files and individual contigs.

Project description

fasta-checksum-utils

Asynchronous library and command-line utility for checksumming FASTA files and individual contigs. Implements two checksumming algorithms: MD5 and GA4GH, in order to fulfill the needs of the Refget v2 API specification.

Installation

To install fasta-checksum-utils, run the following pip command:

pip install fasta-checksum-utils

CLI Usage

To generate a text report of checksums in the FASTA document, run the following command:

fasta-checksum-utils ./my-fasta.fa[.gz]

This will print output in the following tab-delimited format:

file  [file size in bytes]    md5 [file MD5 hash]           ga4gh  [file GA4GH hash]
chr1  [chr1 sequence length]  md5 [chr1 sequence MD5 hash]  ga4gh  [chr1 sequence GA4GH hash]
chr2  [chr2 sequence length]  md5 [chr2 sequence MD5 hash]  ga4gh  [chr2 sequence GA4GH hash]
...

The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:

file	    30428	md5	825ab3c54b7a67ff2db55262eb532438	ga4gh	SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2	29903	md5	105c82802b67521950854a851fc6eefd	ga4gh	SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D

If the --out-format bento-json arguments are passed, the tool will instead output the report in a JSON format, designed to be compatible with the requirements of the Bento Reference Service. The following example is the output generated by specifying the SARS-CoV-2 genome:

{
  "fasta": "sars_cov_2.fa",
  "fasta_size": 30428,
  "md5": "825ab3c54b7a67ff2db55262eb532438",
  "ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
  "contigs": [
    {
      "name": "NC_045512.2",
      "md5": "105c82802b67521950854a851fc6eefd",
      "ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
      "length": 29903
    }
  ]
}

If an argument like --fai [path or URL] is passed, an additional "fai": "..." property will be added to the JSON object output.

If an argument like --genome-id GRCh38 is provided, an additional "id": "GRCh38" property will be added to the JSON object output.

Library Usage

Below are some examples of how fasta-checksum-utils can be used as an asynchronous Python library:

import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path


async def demo():
    covid_genome: Path = Path("./sars_cov_2.fa")
    
    # calculate an MD5 checksum for a whole file
    file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
    print(file_checksum)
    # prints "863ee5dba1da0ca3f87783782284d489"
    
    all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
    
    # calculate multiple checksums for a whole file
    all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
    print(all_checksums)
    # prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
    
    # calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
    fh = pysam.FastaFile(str(covid_genome))
    try:
        contig_checksums: tuple[str, ...] = await fc.checksum_contig(
            fh=fh, 
            contig_name="NC_045512.2", 
            algorithms=all_algorithms,
        )
        print(contig_checksums)
        # prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
    finally:
        fh.close()  # always close the file handle


asyncio.run(demo())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasta_checksum_utils-0.4.3.tar.gz (8.5 kB view hashes)

Uploaded Source

Built Distribution

fasta_checksum_utils-0.4.3-py3-none-any.whl (11.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page