Skip to main content

Robust individual and aggregate checksums for nucleotide sequences

Project description

Tests PyPI version

Seqsum

Robust individual and aggregate checksums for nucleotide sequences. Accepts data from either standard input or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual checksums for each sequence, and an aggregate checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with xxHash (xxh3_128) and may optionally be normalised (with -n) to use only the characters ACGTN-. Read IDs and base quality scores do not inform the checksum. Outputs TSV or JSON to stdout.

A typical use case is for determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections.

Install (Python 3.10+)

# Latest release
pip install seqsum

# Latest commit
git clone https://github.com/bede/seqsum
pip install ./seqsum

# Development
git clone https://github.com/bede/seqsum.git
cd seqsum
pip install --editable '.[dev]'
pytest

CLI usage

# Fasta with one record
$ seqsum nt MN908947.fasta
MN908947.3	ca5e95436b957f93

# Fasta with two records
$ seqsum nt MN908947-BA_2_86_1.fasta
MN908947.3	ca5e95436b957f93
BA.2.86.1		d5f014ee6745cb77
aggregate	837cfd6836b9a406

# Fastq (gzipped) with 1m records, redirected to file, with progress bar
$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv
Processed 1000000 records (317kit/s)
INFO: Found duplicate sequences

# Fasta via stdin
% cat MN908947.fasta | seqsum nt -
MN908947.3	ca5e95436b957f93
$ seqsum nt -h                       
usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input

Robust individual and aggregate checksums for nucleotide sequences. Accepts input
from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual
checksums for each sequence, and an aggregate checksum-of-checksums for a
collection of sequences. Warnings are shown for duplicate sequences and
within-collection checksum collisions at the chosen bit depth. Sequences are
uppercased prior to hashing with xxHash and may optionally be normalised to use only
the characters ACGTN-. Read IDs and base quality scores do not inform the checksum

positional arguments:
  input                 path to fasta/q file (or - for stdin)

options:
  -h, --help            show this help message and exit
  -n, --normalise       replace U with T and characters other than ACGT- with N
                        (default: False)
  -s, --strict          raise error for characters other than ABCDGHKMNRSTVWY-
                        (default: False)
  -b BITS, --bits BITS  displayed checksum length
                        (default: 64)
  -i, --individual      output only individual checksums
                        (default: False)
  -a, --aggregate       output only aggregate checksum
                        (default: False)
  -j, --json            output JSON
                        (default: False)
  -p, --progress        show progress and speed
                        (default: False)

Python usage

from seqsum import lib

fasta_contents = ">read1\nACGT"
checksums, aggregate_checksum = lib.sum_nt(fasta_contents)
print(checksums)
from pathlib import Path
from seqsum import lib

checksums, aggregate_checksum = lib.sum_nt(Path("tests/data/MN908947-BA_2_86_1.fasta"))
print(checksums, aggregate_checksum)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqsum-0.2.0.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

seqsum-0.2.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file seqsum-0.2.0.tar.gz.

File metadata

  • Download URL: seqsum-0.2.0.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for seqsum-0.2.0.tar.gz
Algorithm Hash digest
SHA256 398285951979ca9b201bc7382302be623f94dd7fd4f37d24c0f2f21d1ea02a2f
MD5 f70f7a359c1bbe73c9e2f91b409f9e59
BLAKE2b-256 4cd46b94872aca7ddcf8ef45075fe4089e967356971eb22cf8840b8cad90b178

See more details on using hashes here.

File details

Details for the file seqsum-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: seqsum-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for seqsum-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45a4637e646487dc5ee16101fc8e2d6c14627c7d74490965511b38cf99cc15f9
MD5 4a52e6736d91d0af17e6128f1798e33b
BLAKE2b-256 2d3b46bda1bf2eac3f48fb7d51844ec337d5a474a9883d0c6e8efe3121edece7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page