Skip to main content

A zero-dependency, ultra-fast Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

Project description

zani 🧬🗜️🤪

pronounced zany (/ˈzeɪni/)

Release License DOI PyPI Wheel Python Versions Python Implementations Source Issues DownloadsLicense

Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

About

zani computes pairwise genomic distances using the Normalized Compression Distance (NCD) metric. Inspired by the pioneering work of LZ-ANI, zani leverages the blazing-fast Zstandard (zstd) compression algorithm to estimate Average Nucleotide Identity (ANI) without the need for expensive sequence alignments or k-mer counting.

The Algorithm

At its core, zani treats reference genomes as compression dictionaries. For a given reference genome $x$ and a query genome $y$:

  1. Dictionary Training: A Zstd dictionary is trained on the reference genome $x$.
  2. Baseline Compression: We compute $C(x)$, the size of the reference genome compressed with its own dictionary.
  3. Conditional Compression: The query genome $y$ is compressed using the dictionary trained on $x$. This yields $C(y|x)$, representing the amount of novel information in $y$ not found in $x$.

The Math

zani calculates distance using the standard Normalized Compression Distance (NCD) formula:

$$ NCD(x,y) = \frac{C(x,y) - \min(C(x), C(y))}{\max(C(x), C(y))} $$

To achieve maximum execution speed, zani approximates the joint compression size $C(x,y)$ as:

$$ C(x,y) \approx C(x) + C(y|x) $$

Furthermore, to avoid the performance penalty of compressing the query genome twice to find its baseline $C(y)$, zani rapidly estimates $C(y)$ using the ratio of their uncompressed lengths ($|x|$ and $|y|$):

$$ C(y) \approx C(x) \times \frac{|y|}{|x|} $$

This mathematical approach, combined with zero-copy memoryviews and thread-local C-contexts, allows zani to stream thousands of genomes through concurrent worker threads, achieving massive I/O throughput and utilizing 100% of available CPU cores.

Installation

zani can be installed with pip:

pip install zani

CLI Usage 💻

zani has a very basic CLI, use it like so:

 uv run zani -h
usage: zani <genomes ...> [options]

🧬🗜️🤪 Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

📁:
  Input arguments

  <genomes ...>       Paths to genomes in fasta format; Files may be compressed.
  -a, --allvsall      Run all-vs-all comparison

🛠️:
  Other options

  -t, --max-workers   Maximum number of threads to use for parallelization
  -v, --version       Show version number and exit
  -h, --help          Show this help message and exit

API Usage 💻

from pathlib import Path
from zani import ZaniEngine

genomes = Path('genomes').glob('*.fasta.gz')

with ZaniEngine() as engine:
    for result in engine.query(genomes):
        print(result)

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zani-0.0.1a1.tar.gz (37.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zani-0.0.1a1-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file zani-0.0.1a1.tar.gz.

File metadata

  • Download URL: zani-0.0.1a1.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zani-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 b1fe7e3c046a30c49362148251ca2fc222567ab856d8781e176a0e7a8c431bf6
MD5 ba63ffde470f8f0701714b2602205f65
BLAKE2b-256 dc62b784fe1f11e0918fa19a76b26946b4d6f5fbf85f23e343e91a17e8b385bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for zani-0.0.1a1.tar.gz:

Publisher: publish.yml on tomdstanton/zani

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zani-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: zani-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zani-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 3c2776e5b399e5e8f60bfcf7742674e7d39a00a6f58688a413997d471eed4a81
MD5 5b8f26a0c96e6ca60d30a34cebf090d2
BLAKE2b-256 b72ba20c2aa51dc6fcd101e2c6292277637641e781d5de96f1daa9cb84bed8da

See more details on using hashes here.

Provenance

The following attestation bundles were made for zani-0.0.1a1-py3-none-any.whl:

Publisher: publish.yml on tomdstanton/zani

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page