Robust individual and aggregate checksums for nucleotide sequences
Project description
Seqsum
Robust individual and aggregate checksums for nucleotide sequences. Accepts data from either standard input or fast[a|q][.gz|.zst|.xz|.bz2]
files. Generates individual checksums for each sequence, and an aggregate checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with xxHash (xxh3_128
) and may optionally be normalised (with -n
) to use only the characters ACGTN-
. Read IDs and base quality scores do not inform the checksum. Outputs TSV or JSON to stdout.
A typical use case is for determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections.
Install (Python 3.10+)
# Latest release
pip install seqsum
# Latest commit
git clone https://github.com/bede/seqsum
pip install ./seqsum
# Development
git clone https://github.com/bede/seqsum.git
cd seqsum
pip install --editable '.[dev]'
pytest
CLI usage
# Fasta with one record
$ seqsum nt MN908947.fasta
MN908947.3 ca5e95436b957f93
# Fasta with two records
$ seqsum nt MN908947-BA_2_86_1.fasta
MN908947.3 ca5e95436b957f93
BA.2.86.1 d5f014ee6745cb77
aggregate 837cfd6836b9a406
# Fastq (gzipped) with 1m records, redirected to file, with progress bar
$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv
Processed 1000000 records (317kit/s)
INFO: Found duplicate sequences
# Fasta via stdin
% cat MN908947.fasta | seqsum nt -
MN908947.3 ca5e95436b957f93
$ seqsum nt -h
usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input
Robust individual and aggregate checksums for nucleotide sequences. Accepts input
from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual
checksums for each sequence, and an aggregate checksum-of-checksums for a
collection of sequences. Warnings are shown for duplicate sequences and
within-collection checksum collisions at the chosen bit depth. Sequences are
uppercased prior to hashing with xxHash and may optionally be normalised to use only
the characters ACGTN-. Read IDs and base quality scores do not inform the checksum
positional arguments:
input path to fasta/q file (or - for stdin)
options:
-h, --help show this help message and exit
-n, --normalise replace U with T and characters other than ACGT- with N
(default: False)
-s, --strict raise error for characters other than ABCDGHKMNRSTVWY-
(default: False)
-b BITS, --bits BITS displayed checksum length
(default: 64)
-i, --individual output only individual checksums
(default: False)
-a, --aggregate output only aggregate checksum
(default: False)
-j, --json output JSON
(default: False)
-p, --progress show progress and speed
(default: False)
Python usage
from seqsum import lib
fasta_contents = ">read1\nACGT"
checksums, aggregate_checksum = lib.sum_nt(fasta_contents)
print(checksums)
from pathlib import Path
from seqsum import lib
checksums, aggregate_checksum = lib.sum_nt(Path("tests/data/MN908947-BA_2_86_1.fasta"))
print(checksums, aggregate_checksum)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file seqsum-0.2.0.tar.gz
.
File metadata
- Download URL: seqsum-0.2.0.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 398285951979ca9b201bc7382302be623f94dd7fd4f37d24c0f2f21d1ea02a2f |
|
MD5 | f70f7a359c1bbe73c9e2f91b409f9e59 |
|
BLAKE2b-256 | 4cd46b94872aca7ddcf8ef45075fe4089e967356971eb22cf8840b8cad90b178 |
File details
Details for the file seqsum-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: seqsum-0.2.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 45a4637e646487dc5ee16101fc8e2d6c14627c7d74490965511b38cf99cc15f9 |
|
MD5 | 4a52e6736d91d0af17e6128f1798e33b |
|
BLAKE2b-256 | 2d3b46bda1bf2eac3f48fb7d51844ec337d5a474a9883d0c6e8efe3121edece7 |