A zero-dependency, ultra-fast Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.
Project description
zani 🧬🗜️🤪
pronounced zany (/ˈzeɪni/)
Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.
About
zani computes pairwise genomic distances using the Normalized Compression Distance (NCD) metric.
Inspired by the pioneering work of LZ-ANI, zani leverages the
blazing-fast Zstandard (zstd) compression algorithm to estimate Average Nucleotide Identity (ANI) without the need
for expensive sequence alignments or k-mer counting.
The Algorithm
At its core, zani treats reference genomes as compression dictionaries. For a given reference genome $x$ and a query genome $y$:
- Dictionary Training: A Zstd dictionary is trained on the reference genome $x$.
- Baseline Compression: We compute $C(x)$, the size of the reference genome compressed with its own dictionary.
- Conditional Compression: The query genome $y$ is compressed using the dictionary trained on $x$. This yields $C(y|x)$, representing the amount of novel information in $y$ not found in $x$.
The Math
zani calculates distance using the standard Normalized Compression Distance (NCD) formula:
$$ NCD(x,y) = \frac{C(x,y) - \min(C(x), C(y))}{\max(C(x), C(y))} $$
To achieve maximum execution speed, zani approximates the joint compression size $C(x,y)$ as:
$$ C(x,y) \approx C(x) + C(y|x) $$
Furthermore, to avoid the performance penalty of compressing the query genome twice to find its baseline $C(y)$, zani rapidly estimates $C(y)$ using the ratio of their uncompressed lengths ($|x|$ and $|y|$):
$$ C(y) \approx C(x) \times \frac{|y|}{|x|} $$
This mathematical approach, combined with zero-copy memoryviews and thread-local C-contexts, allows zani to stream thousands of genomes through concurrent worker threads, achieving massive I/O throughput and utilizing 100% of available CPU cores.
Installation
zani can be installed with pip:
pip install zani
CLI Usage 💻
zani has a very basic CLI, use it like so:
❯ uv run zani -h
usage: zani <genomes ...> [options]
🧬🗜️🤪 Average Nucleotide Identity (ANI) estimator using Zstandard compression distance.
📁:
Input arguments
<genomes ...> Paths to genomes in fasta format; Files may be compressed.
-a, --allvsall Run all-vs-all comparison
🛠️:
Other options
-t, --max-workers Maximum number of threads to use for parallelization
-v, --version Show version number and exit
-h, --help Show this help message and exit
API Usage 💻
from pathlib import Path
from zani import ZaniEngine
genomes = Path('genomes').glob('*.fasta.gz')
with ZaniEngine() as engine:
for result in engine.query(genomes):
print(result)
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zani-0.0.1a1.tar.gz.
File metadata
- Download URL: zani-0.0.1a1.tar.gz
- Upload date:
- Size: 37.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1fe7e3c046a30c49362148251ca2fc222567ab856d8781e176a0e7a8c431bf6
|
|
| MD5 |
ba63ffde470f8f0701714b2602205f65
|
|
| BLAKE2b-256 |
dc62b784fe1f11e0918fa19a76b26946b4d6f5fbf85f23e343e91a17e8b385bb
|
Provenance
The following attestation bundles were made for zani-0.0.1a1.tar.gz:
Publisher:
publish.yml on tomdstanton/zani
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zani-0.0.1a1.tar.gz -
Subject digest:
b1fe7e3c046a30c49362148251ca2fc222567ab856d8781e176a0e7a8c431bf6 - Sigstore transparency entry: 1459474668
- Sigstore integration time:
-
Permalink:
tomdstanton/zani@778e0a1b79ac09575a8630d3173f71a78d04d410 -
Branch / Tag:
refs/tags/v0.0.1-alpha1 - Owner: https://github.com/tomdstanton
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@778e0a1b79ac09575a8630d3173f71a78d04d410 -
Trigger Event:
push
-
Statement type:
File details
Details for the file zani-0.0.1a1-py3-none-any.whl.
File metadata
- Download URL: zani-0.0.1a1-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c2776e5b399e5e8f60bfcf7742674e7d39a00a6f58688a413997d471eed4a81
|
|
| MD5 |
5b8f26a0c96e6ca60d30a34cebf090d2
|
|
| BLAKE2b-256 |
b72ba20c2aa51dc6fcd101e2c6292277637641e781d5de96f1daa9cb84bed8da
|
Provenance
The following attestation bundles were made for zani-0.0.1a1-py3-none-any.whl:
Publisher:
publish.yml on tomdstanton/zani
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
zani-0.0.1a1-py3-none-any.whl -
Subject digest:
3c2776e5b399e5e8f60bfcf7742674e7d39a00a6f58688a413997d471eed4a81 - Sigstore transparency entry: 1459475201
- Sigstore integration time:
-
Permalink:
tomdstanton/zani@778e0a1b79ac09575a8630d3173f71a78d04d410 -
Branch / Tag:
refs/tags/v0.0.1-alpha1 - Owner: https://github.com/tomdstanton
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@778e0a1b79ac09575a8630d3173f71a78d04d410 -
Trigger Event:
push
-
Statement type: