Skip to main content

A zero-dependency, ultra-fast Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

Project description

zani 🧬🗜️🤪

pronounced zany (/ˈzeɪni/)

Release License DOI PyPI Wheel Python Versions Python Implementations Source Issues DownloadsLicense

Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

About

zani computes pairwise genomic distances using the Normalized Compression Distance (NCD) metric. Inspired by the pioneering work of LZ-ANI, zani leverages the blazing-fast Zstandard (zstd) compression algorithm to estimate Average Nucleotide Identity (ANI) without the need for expensive sequence alignments or k-mer counting.

The Algorithm

At its core, zani treats reference genomes as compression dictionaries. For a given reference genome $x$ and a query genome $y$:

  1. Dictionary Training: A Zstd dictionary is trained on the reference genome $x$.
  2. Baseline Compression: We compute $C(x)$, the size of the reference genome compressed with its own dictionary.
  3. Conditional Compression: The query genome $y$ is compressed using the dictionary trained on $x$. This yields $C(y|x)$, representing the amount of novel information in $y$ not found in $x$.

The Math

zani calculates distance using the standard Normalized Compression Distance (NCD) formula:

$$ NCD(x,y) = \frac{C(x,y) - \min(C(x), C(y))}{\max(C(x), C(y))} $$

To achieve maximum execution speed, zani approximates the joint compression size $C(x,y)$ as:

$$ C(x,y) \approx C(x) + C(y|x) $$

Furthermore, to avoid the performance penalty of compressing the query genome twice to find its baseline $C(y)$, zani rapidly estimates $C(y)$ using the ratio of their uncompressed lengths ($|x|$ and $|y|$):

$$ C(y) \approx C(x) \times \frac{|y|}{|x|} $$

This mathematical approach, combined with zero-copy memoryviews and thread-local C-contexts, allows zani to stream thousands of genomes through concurrent worker threads, achieving massive I/O throughput and utilizing 100% of available CPU cores.

Installation

zani can be installed with pip:

pip install zani

CLI Usage 💻

zani has a very basic CLI, use it like so:

 uv run zani -h
usage: zani <genomes ...> [options]

🧬🗜️🤪 Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.

📁:
  Input arguments

  <genomes ...>       Paths to genomes in fasta format; Files may be compressed.
  -a, --allvsall      Run all-vs-all comparison

🛠️:
  Other options

  -t, --max-workers   Maximum number of threads to use for parallelization
  -v, --version       Show version number and exit
  -h, --help          Show this help message and exit

API Usage 💻

from pathlib import Path
from zani import ZaniEngine

genomes = Path('genomes').glob('*.fasta.gz')

with ZaniEngine() as engine:
    for result in engine.query(genomes):
        print(result)

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zani-0.0.1a2.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zani-0.0.1a2-py3-none-any.whl (33.7 kB view details)

Uploaded Python 3

File details

Details for the file zani-0.0.1a2.tar.gz.

File metadata

  • Download URL: zani-0.0.1a2.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zani-0.0.1a2.tar.gz
Algorithm Hash digest
SHA256 1b56c6d2533ccafb8fbb5f44537cd7433a21ca2fd4a40184bd9d42344c38a8ff
MD5 c265f7a5bfff46663016a9ed956d308f
BLAKE2b-256 f66e07b2a6c8fcaaf1925ddb53fd5cc1c52ddd8aa76ec13b233aae6da800e824

See more details on using hashes here.

Provenance

The following attestation bundles were made for zani-0.0.1a2.tar.gz:

Publisher: publish.yml on tomdstanton/zani

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file zani-0.0.1a2-py3-none-any.whl.

File metadata

  • Download URL: zani-0.0.1a2-py3-none-any.whl
  • Upload date:
  • Size: 33.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for zani-0.0.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 52d60f71bdc8ef0c24fd9b89757da320802d1be84fbbf0bfd105641bb669bbca
MD5 eb0211ec6ce8daffff9f3d2d8b6909b5
BLAKE2b-256 8fbea6f4099a509bd3fcb2a91fbd8bbdbce5114210dad550011aaafd57c24fe4

See more details on using hashes here.

Provenance

The following attestation bundles were made for zani-0.0.1a2-py3-none-any.whl:

Publisher: publish.yml on tomdstanton/zani

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page