WhatsGNU protein allele frequency analysis for AllTheBacteria (2.4M+ genomes)
Project description
WhatsGNU-ATB
A custom reimplementation of WhatsGNU optimised for the scale of AllTheBacteria. It uses LMDB-backed sharded storage (8 shards) with numpy for hashing. The query tool is also custom-built for this database format. Protein allele frequency analysis at the scale of AllTheBacteria (2.4M+ bacterial genomes).
WhatsGNU-ATB builds a sharded LMDB database from Bakta protein annotations and lets you query any bacterial genome to find out, for each protein, how many of the 2,438,285 genomes carry an identical copy — along with which species they belong to and which genomes are most similar.
A pre-built database covering all AllTheBacteria genomes is available on OSF. If you just want to query genomes, skip to Quick Start (Query).
Features
- GNU scores: for every protein in a query genome, reports the exact number of genomes (out of 2.4M+) containing an identical allele
- Species breakdown: top-K species contributing to each allele, with counts (other metadata like MLST contributions are coming soon)
- Genome similarity: ranks all 2.4M+ genomes by shared protein alleles with your query, identifying the closest relatives
- Batch querying: pass a directory of
.faafiles to query hundreds of genomes in one run - Sequence export: optionally include the amino acid sequence in the output
- Sharded LMDB backend: 8 parallel shards with batched reads for fast lookups
- Optional sequence storage: store a representative amino acid sequence per allele hash in the database
- Allele counts export: dump the full allele frequency table as a TSV
Installation
Option A — Conda (recommended, once available on bioconda)
conda install -c bioconda whatsgnu-atb
Option B — pip
pip install whatsgnu-atb
Option C — From source
git clone https://github.com/microbialARC/WhatsGNU-ATB.git
cd WhatsGNU-ATB
bash setup_whatsgnu_atb.sh
conda activate whatsgnu-atb
Option D — Manual from source
conda create -n whatsgnu-atb -c conda-forge python=3.12
conda activate whatsgnu-atb
pip install numpy lmdb pandas
git clone https://github.com/microbialARC/WhatsGNU-ATB.git
For publication figure generation, also install:
pip install matplotlib seaborn networkx adjustText scipy
Quick Start (Query)
If you just want to query genomes against the pre-built AllTheBacteria database:
1. Download the database from OSF
Use the included downloader (no OSF account or token required):
# Download the database (required for querying)
python scripts/download_osf.py --folder WGNU_ATB_DB --out-dir ./WGNU_ATB_DB
# Download everything
python scripts/download_osf.py --all --out-dir ./whatsgnu_db
# List available folders
python scripts/download_osf.py --list
The downloader skips files that have already been downloaded with the correct size, so it is safe to rerun if interrupted.
2. Query a single genome
Your input must be a protein FASTA (.faa) file. See the AllTheBacteria Bakta documentation or the Bakta GitHub if you need to annotate your genome first.
Basic query (GNU scores only — fast, no postings needed):
python scripts/Query_WhatsGNU_ATB.py \
--db_dir WGNU_ATB_DB/ \
--shards 8 \
--faa your_genome.bakta.faa \
--out_dir results/
Full query (GNU scores + species breakdown + genome similarity):
python scripts/Query_WhatsGNU_ATB.py \
--db_dir WGNU_ATB_DB/ \
--shards 8 \
--faa your_genome.bakta.faa \
--include_sequence \
--with_postings \
--samples_tsv WGNU_ATB_DB/samples_with_ids.tsv \
--species_names_tsv WGNU_ATB_DB/samples_with_ids.tsv \
--top_k_species 5 \
--top_k_genomes 10 \
--out_dir results/
3. Query a batch of genomes
Pass a directory instead of a single file:
python scripts/Query_WhatsGNU_ATB.py \
--db_dir WGNU_ATB_DB/ \
--shards 8 \
--faa directory_of_faa_files/ \
--include_sequence \
--with_postings \
--out_dir results_batch/
Note: If you installed via conda or pip, the scripts are on your PATH and you can run
Query_WhatsGNU_ATB.py,WhatsGNU_ATB_DB.py, anddownload_osf.pydirectly without thescripts/prefix.
OSF Data
All data is hosted at https://osf.io/6jr4u/:
| Folder | Description |
|---|---|
WGNU_ATB_DB/ |
Pre-built LMDB database (8 count + 8 posting shards, genome-to-species index, function lookup table, Sample-to-ID mapping (samples_with_ids.tsv), build metadata). Required for querying. |
Sample_tables/ |
List of included genomes (final_2438285_genomes.txt), species statistics, and per-genome/per-species allele record counts. |
ATB_hash_seq/ |
Hash-to-amino-acid-sequence lookup table, split into 20 xz-compressed parts (hash_to_sequence_part_00.xz – part_19.xz). |
ATB_summary_figures_tables/ |
Publication figures, per-species GNU histograms, allele frequency tables, species-sharing networks, coverage estimates, cross-species allele analyses, and the pre-computed counts cache. |
Query Output Files
<sample>.whatsgnu.tsv
Per-protein results with one row per protein:
| Column | Description |
|---|---|
protein_id |
Protein identifier from the FASTA header |
allele_hash |
128-bit BLAKE2b hash of the amino acid sequence (hex) |
sequence |
Amino acid sequence from the query genome (if --include_sequence) |
GNU_count |
Number of genomes containing this exact allele |
top5_species_names |
Top 5 species carrying this allele (if --with_postings) |
top5_species_counts |
Counts per species (if --with_postings) |
total_db_hits |
Total genomes in posting list |
hits_checked |
Number of postings actually decoded |
<sample>.similarity.tsv
Genome similarity ranking (if --with_postings):
| Column | Description |
|---|---|
rank |
Rank by shared alleles (1 = most similar) |
genome_id |
Integer genome ID |
sample_name |
Sample accession (if --samples_tsv provided) |
species_id |
Species integer ID |
species_name |
Species name (if --species_names_tsv provided) |
shared_alleles |
Number of identical proteins shared with query |
percent_of_query |
Shared alleles as percentage of query proteome |
Query Options Reference
| Option | Description | Default |
|---|---|---|
--db_dir |
Root database directory (required) | — |
--shards |
Number of shards, must be power of 2 (required) | — |
--faa |
Input .faa file or directory of .faa files (required) |
— |
--out_dir |
Output directory (required) | — |
--with_postings |
Enable species breakdown and genome similarity | off |
--include_sequence |
Include amino acid sequence in output | off |
--top_k_species |
Number of top species to report per protein | 5 |
--top_k_genomes |
Number of top similar genomes to report | 10 |
--postings_limit |
Max genome IDs to decode per allele (0 = all) | 0 |
--species_names_tsv |
TSV mapping SpeciesID → species name | none |
--samples_tsv |
TSV mapping SampleID → sample accession | none |
Interpreting GNU Scores
| GNU Score Range | Interpretation |
|---|---|
| >100,000 | Highly conserved ubiquitous allele |
| 1000–10,000 | Common allele |
| 1–100 | Rare allele, likely strain-specific |
| 0 | Unique to the query genome — not in any AllTheBacteria genome |
Building a Database
To build a new database from scratch (e.g., for a custom genome set):
Input Requirements
A sample table TSV with these columns:
| Column | Description |
|---|---|
SampleID |
Unique integer ID per genome |
Sample |
Sample name (used to find .faa file) |
SpeciesID |
Integer species ID |
Optional column: faa_path (full path to FAA file). If absent, uses --faa_dir/<Sample><faa_suffix>.
Build Command
python scripts/WhatsGNU_ATB_DB.py \
--sample_table samples_with_ids.tsv \
--faa_dir /path/to/faa_files/ \
--out_dir WGNU_ATB_DB/ \
--tmp_dir /scratch/tmp/ \
--shards 8 \
--with_postings \
--sort_mem_mb 65536 \
--lmdb_map_gb_counts_per_shard 24 \
--lmdb_map_gb_postings_per_shard 160 \
--export_allele_counts allele_counts.tsv \
--log_file build.log \
--log_level INFO
Build with Sequences
To also store representative amino acid sequences per allele hash:
python scripts/WhatsGNU_ATB_DB.py \
--sample_table samples_with_ids.tsv \
--faa_dir /path/to/faa_files/ \
--out_dir WGNU_ATB_DB/ \
--shards 8 \
--with_postings \
--with_sequences \
--lmdb_map_gb_sequences_per_shard 25 \
--log_level INFO
Build Options Reference
| Option | Description | Default |
|---|---|---|
--sample_table |
Sample table TSV (required) | — |
--faa_dir |
Directory of .faa files |
none |
--out_dir |
Output directory (required) | — |
--tmp_dir |
Temp directory for intermediate files | <out_dir>/tmp |
--reduce_tmp_dir |
Local scratch for sort/reduce (faster I/O) | none |
--shards |
Number of shards, power of 2 | 16 |
--with_postings |
Build posting lists (genome IDs per allele) | off |
--with_sequences |
Store representative AA sequence per allele | off |
--faa_suffix |
Suffix appended to Sample name for FAA lookup | .bakta.faa |
--sort_mem_mb |
RAM for external sort per shard (MB) | 65536 |
--lmdb_map_gb_counts_per_shard |
LMDB map size for counts (GB) | 24 |
--lmdb_map_gb_postings_per_shard |
LMDB map size for postings (GB) | 160 |
--lmdb_map_gb_sequences_per_shard |
LMDB map size for sequences (GB) | 25 |
--export_allele_counts |
Path to write allele frequency TSV | none |
--parse_only |
Only parse FAA → record bins, skip reduce | off |
--reduce_only |
Only reduce existing record bins → LMDB | off |
--resume |
Auto-detect: skip parse if record bins exist | off |
--skip_existing_shards |
Skip shards with existing LMDB output | off |
--log_file |
Log file path | <out_dir>/build.log |
--log_level |
Logging level | INFO |
Database Structure
WGNU_ATB_DB/
├── lmdb_counts/
│ ├── shard_00/ # LMDB: hash → (func_id, GNU_count)
│ ├── shard_01/
│ └── ...
├── lmdb_postings/ # (if --with_postings)
│ ├── shard_00/ # LMDB: hash → varint-encoded genome IDs
│ ├── shard_01/
│ └── ...
├── lmdb_sequences/ # (if --with_sequences)
│ ├── shard_00/ # LMDB: hash → amino acid sequence (UTF-8)
│ └── ...
├── indexes/
│ └── genome_species.u32 # Binary array: genome_id → species_id
└── metadata/
├── build_info.json # Build parameters, stats, version
└── functions.tsv.gz # Function ID → function description
Technical Details
- Hashing: BLAKE2b with 128-bit (16-byte) digest of the amino acid sequence
- Sharding:
shard_id = first_byte(hash) & (num_shards - 1) - GNU count: number of genomes containing an allele at least once (deduplicated within each genome)
- Postings: delta + varint encoded sorted unique genome IDs
- External sort: numpy structured arrays for memory-efficient sorting; batched multi-pass merge with fanin of 64
- Query optimizations: batched LMDB reads (one transaction per shard), numpy-vectorized species lookups, partial argsort for top-K genome ranking
Resource Requirements
Building (2.4M genomes, 8 shards)
| Resource | Recommendation |
|---|---|
| RAM | 250–500 GB |
| CPUs | 4–6 cores |
| Disk (tmp) | ~2 TB scratch |
| Wall time | 6–24 hours (I/O dependent) |
Querying
| Resource | Recommendation |
|---|---|
| RAM | ~2 GB (basic) / ~4 GB (with postings) |
| Wall time | ~5–150 seconds per genome |
Citation
If you use WhatsGNU-ATB in your research, please cite:
Moustafa AM and Planet PJ. WhatsGNU: a tool for identifying proteomic novelty. Genome Biology, 2020. doi:10.1186/s13059-020-01965-w
Hunt M, Lima L, Shen W, Lees J, Iqbal Z. AllTheBacteria - all bacterial genomes assembled, available and searchable. bioRxiv, 2024.https://doi.org/10.1101/2024.03.08.584059
License
GPL-3.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file whatsgnu_atb-1.0.0.tar.gz.
File metadata
- Download URL: whatsgnu_atb-1.0.0.tar.gz
- Upload date:
- Size: 42.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4ecacdae4bf2b50646f46426164589551a9325afa0c3f1ae663d373dc992a38
|
|
| MD5 |
8388052eb8a5366ddfb170988d7584a7
|
|
| BLAKE2b-256 |
9ec35e8c7161dcb3b349b587cd22aa80e6c59e35c9aaabe2a4a87fb6d160330f
|
File details
Details for the file whatsgnu_atb-1.0.0-py3-none-any.whl.
File metadata
- Download URL: whatsgnu_atb-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd47fa5e5f141ae64711378f285bce0a60179624ca79e045b48ae2ff1877ba50
|
|
| MD5 |
b9f2bfe593e3b0846283f83000f75317
|
|
| BLAKE2b-256 |
5e2d51dbdfb1e32317294606ad824bc80d5d9c6f32c38f9368cb28c03d97be12
|