Skip to main content

Library and command-line utility for checksumming FASTA files and individual contigs.

Project description

fasta-checksum-utils

Asynchronous library and command-line utility for checksumming FASTA files and individual contigs. Implements two checksumming algorithms: MD5 and GA4GH, in order to fulfill the needs of the Refget v2 API specification.

Installation

To install fasta-checksum-utils, run the following pip command:

pip install fasta-checksum-utils

CLI Usage

To generate a text report of checksums in the FASTA document, run the following command:

fasta-checksum-utils ./my-fasta.fa[.gz]

This will print output in the following tab-delimited format:

file  [file size in bytes]    md5 [file MD5 hash]           ga4gh  [file GA4GH hash]
chr1  [chr1 sequence length]  md5 [chr1 sequence MD5 hash]  ga4gh  [chr1 sequence GA4GH hash]
chr2  [chr2 sequence length]  md5 [chr2 sequence MD5 hash]  ga4gh  [chr2 sequence GA4GH hash]
...

The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:

file	    30428	md5	825ab3c54b7a67ff2db55262eb532438	ga4gh	SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2	29903	md5	105c82802b67521950854a851fc6eefd	ga4gh	SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D

If the --out-format bento-json arguments are passed, the tool will instead output the report in a JSON format, designed to be compatible with the requirements of the Bento Reference Service. The following example is the output generated by specifying the SARS-CoV-2 genome:

{
  "fasta": "sars_cov_2.fa",
  "fasta_size": 30428,
  "md5": "825ab3c54b7a67ff2db55262eb532438",
  "ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
  "contigs": [
    {
      "name": "NC_045512.2",
      "md5": "105c82802b67521950854a851fc6eefd",
      "ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
      "length": 29903
    }
  ]
}

If an argument like --fai [path or URL] is passed, an additional "fai": "..." property will be added to the JSON object output.

If an argument like --genome-id GRCh38 is provided, an additional "id": "GRCh38" property will be added to the JSON object output.

Library Usage

Below are some examples of how fasta-checksum-utils can be used as an asynchronous Python library:

import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path


async def demo():
    covid_genome: Path = Path("./sars_cov_2.fa")
    
    # calculate an MD5 checksum for a whole file
    file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
    print(file_checksum)
    # prints "863ee5dba1da0ca3f87783782284d489"
    
    all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
    
    # calculate multiple checksums for a whole file
    all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
    print(all_checksums)
    # prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
    
    # calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
    fh = pysam.FastaFile(str(covid_genome))
    try:
        contig_checksums: tuple[str, ...] = await fc.checksum_contig(
            fh=fh, 
            contig_name="NC_045512.2", 
            algorithms=all_algorithms,
        )
        print(contig_checksums)
        # prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
    finally:
        fh.close()  # always close the file handle


asyncio.run(demo())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fasta_checksum_utils-0.5.1.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fasta_checksum_utils-0.5.1-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file fasta_checksum_utils-0.5.1.tar.gz.

File metadata

  • Download URL: fasta_checksum_utils-0.5.1.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.10.20 Linux/6.17.0-1013-azure

File hashes

Hashes for fasta_checksum_utils-0.5.1.tar.gz
Algorithm Hash digest
SHA256 59b120a5eb2af349e91322f11c15e2aff4fa4d25886cfc1f87ede0d42574cfcc
MD5 b674575b7de7e28b2c3dd14599103466
BLAKE2b-256 0794e0400003fcf3f34823056be3700dd38e830496ec8d3e6db078a4ba8929b0

See more details on using hashes here.

File details

Details for the file fasta_checksum_utils-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: fasta_checksum_utils-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.10.20 Linux/6.17.0-1013-azure

File hashes

Hashes for fasta_checksum_utils-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 62acf20b08e905d1a27a3d9e5cc68d9919021dfe0508c21bdca49bf60dbe809c
MD5 6bfae7c3234729b2fc997c8a8c083ab0
BLAKE2b-256 fba5130c4dcd65932ee1260f051cc5fefee46e8273458cb30411a854c3e47ee5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page