Skip to main content

WhatsGNU protein allele frequency analysis for AllTheBacteria (2.4M+ genomes)

Project description

WhatsGNU-ATB

A custom reimplementation of WhatsGNU optimised for the scale of AllTheBacteria. It uses LMDB-backed sharded storage (8 shards) with numpy for hashing. The query tool is also custom-built for this database format. Protein allele frequency analysis at the scale of AllTheBacteria (2.4M+ bacterial genomes).

WhatsGNU-ATB builds a sharded LMDB database from Bakta protein annotations and lets you query any bacterial genome to find out, for each protein, how many of the 2,438,285 genomes carry an identical copy — along with which species they belong to and which genomes are most similar.

A pre-built database covering all AllTheBacteria genomes is available on OSF. If you just want to query genomes, skip to Quick Start (Query).

Features

  • GNU scores: for every protein in a query genome, reports the exact number of genomes (out of 2.4M+) containing an identical allele
  • Species breakdown: top-K species contributing to each allele, with counts (other metadata like MLST contributions are coming soon)
  • Genome similarity: ranks all 2.4M+ genomes by shared protein alleles with your query, identifying the closest relatives
  • Batch querying: pass a directory of .faa files to query hundreds of genomes in one run
  • Sequence export: optionally include the amino acid sequence in the output
  • Sharded LMDB backend: 8 parallel shards with batched reads for fast lookups
  • Optional sequence storage: store a representative amino acid sequence per allele hash in the database
  • Allele counts export: dump the full allele frequency table as a TSV

Installation

Option A — Conda (recommended, once available on bioconda)

conda install -c bioconda whatsgnu-atb

Option B — pip

pip install whatsgnu-atb

Option C — From source

git clone https://github.com/microbialARC/WhatsGNU-ATB.git
cd WhatsGNU-ATB
bash setup_whatsgnu_atb.sh
conda activate whatsgnu-atb

Option D — Manual from source

conda create -n whatsgnu-atb -c conda-forge python=3.12
conda activate whatsgnu-atb
pip install numpy lmdb pandas

git clone https://github.com/microbialARC/WhatsGNU-ATB.git

For publication figure generation, also install:

pip install matplotlib seaborn networkx adjustText scipy

Quick Start (Query)

If you just want to query genomes against the pre-built AllTheBacteria database:

1. Download the database from OSF

Use the included downloader (no OSF account or token required):

# Download the database (required for querying)
python scripts/download_osf.py --folder WGNU_ATB_DB --out-dir ./WGNU_ATB_DB

# Download everything
python scripts/download_osf.py --all --out-dir ./whatsgnu_db

# List available folders
python scripts/download_osf.py --list

The downloader skips files that have already been downloaded with the correct size, so it is safe to rerun if interrupted.

2. Query a single genome

Your input must be a protein FASTA (.faa) file. See the AllTheBacteria Bakta documentation or the Bakta GitHub if you need to annotate your genome first.

Basic query (GNU scores only — fast, no postings needed):

python scripts/Query_WhatsGNU_ATB.py \
    --db_dir WGNU_ATB_DB/ \
    --shards 8 \
    --faa your_genome.bakta.faa \
    --out_dir results/

Full query (GNU scores + species breakdown + genome similarity):

python scripts/Query_WhatsGNU_ATB.py \
    --db_dir WGNU_ATB_DB/ \
    --shards 8 \
    --faa your_genome.bakta.faa \
    --include_sequence \
    --with_postings \
    --samples_tsv WGNU_ATB_DB/samples_with_ids.tsv \
    --species_names_tsv WGNU_ATB_DB/samples_with_ids.tsv \
    --top_k_species 5 \
    --top_k_genomes 10 \
    --out_dir results/

3. Query a batch of genomes

Pass a directory instead of a single file:

python scripts/Query_WhatsGNU_ATB.py \
    --db_dir WGNU_ATB_DB/ \
    --shards 8 \
    --faa directory_of_faa_files/ \
    --include_sequence \
    --with_postings \
    --out_dir results_batch/

Note: If you installed via conda or pip, the scripts are on your PATH and you can run Query_WhatsGNU_ATB.py, WhatsGNU_ATB_DB.py, and download_osf.py directly without the scripts/ prefix.

OSF Data

All data is hosted at https://osf.io/6jr4u/:

Folder Description
WGNU_ATB_DB/ Pre-built LMDB database (8 count + 8 posting shards, genome-to-species index, function lookup table, Sample-to-ID mapping (samples_with_ids.tsv), build metadata). Required for querying.
Sample_tables/ List of included genomes (final_2438285_genomes.txt), species statistics, and per-genome/per-species allele record counts.
ATB_hash_seq/ Hash-to-amino-acid-sequence lookup table, split into 20 xz-compressed parts (hash_to_sequence_part_00.xzpart_19.xz).
ATB_summary_figures_tables/ Publication figures, per-species GNU histograms, allele frequency tables, species-sharing networks, coverage estimates, cross-species allele analyses, and the pre-computed counts cache.

Query Output Files

<sample>.whatsgnu.tsv

Per-protein results with one row per protein:

Column Description
protein_id Protein identifier from the FASTA header
allele_hash 128-bit BLAKE2b hash of the amino acid sequence (hex)
sequence Amino acid sequence from the query genome (if --include_sequence)
GNU_count Number of genomes containing this exact allele
top5_species_names Top 5 species carrying this allele (if --with_postings)
top5_species_counts Counts per species (if --with_postings)
total_db_hits Total genomes in posting list
hits_checked Number of postings actually decoded

<sample>.similarity.tsv

Genome similarity ranking (if --with_postings):

Column Description
rank Rank by shared alleles (1 = most similar)
genome_id Integer genome ID
sample_name Sample accession (if --samples_tsv provided)
species_id Species integer ID
species_name Species name (if --species_names_tsv provided)
shared_alleles Number of identical proteins shared with query
percent_of_query Shared alleles as percentage of query proteome

Query Options Reference

Option Description Default
--db_dir Root database directory (required)
--shards Number of shards, must be power of 2 (required)
--faa Input .faa file or directory of .faa files (required)
--out_dir Output directory (required)
--with_postings Enable species breakdown and genome similarity off
--include_sequence Include amino acid sequence in output off
--top_k_species Number of top species to report per protein 5
--top_k_genomes Number of top similar genomes to report 10
--postings_limit Max genome IDs to decode per allele (0 = all) 0
--species_names_tsv TSV mapping SpeciesID → species name none
--samples_tsv TSV mapping SampleID → sample accession none

Interpreting GNU Scores

GNU Score Range Interpretation
>100,000 Highly conserved ubiquitous allele
1000–10,000 Common allele
1–100 Rare allele, likely strain-specific
0 Unique to the query genome — not in any AllTheBacteria genome

Building a Database

To build a new database from scratch (e.g., for a custom genome set):

Input Requirements

A sample table TSV with these columns:

Column Description
SampleID Unique integer ID per genome
Sample Sample name (used to find .faa file)
SpeciesID Integer species ID

Optional column: faa_path (full path to FAA file). If absent, uses --faa_dir/<Sample><faa_suffix>.

Build Command

python scripts/WhatsGNU_ATB_DB.py \
    --sample_table samples_with_ids.tsv \
    --faa_dir /path/to/faa_files/ \
    --out_dir WGNU_ATB_DB/ \
    --tmp_dir /scratch/tmp/ \
    --shards 8 \
    --with_postings \
    --sort_mem_mb 65536 \
    --lmdb_map_gb_counts_per_shard 24 \
    --lmdb_map_gb_postings_per_shard 160 \
    --export_allele_counts allele_counts.tsv \
    --log_file build.log \
    --log_level INFO

Build with Sequences

To also store representative amino acid sequences per allele hash:

python scripts/WhatsGNU_ATB_DB.py \
    --sample_table samples_with_ids.tsv \
    --faa_dir /path/to/faa_files/ \
    --out_dir WGNU_ATB_DB/ \
    --shards 8 \
    --with_postings \
    --with_sequences \
    --lmdb_map_gb_sequences_per_shard 25 \
    --log_level INFO

Build Options Reference

Option Description Default
--sample_table Sample table TSV (required)
--faa_dir Directory of .faa files none
--out_dir Output directory (required)
--tmp_dir Temp directory for intermediate files <out_dir>/tmp
--reduce_tmp_dir Local scratch for sort/reduce (faster I/O) none
--shards Number of shards, power of 2 16
--with_postings Build posting lists (genome IDs per allele) off
--with_sequences Store representative AA sequence per allele off
--faa_suffix Suffix appended to Sample name for FAA lookup .bakta.faa
--sort_mem_mb RAM for external sort per shard (MB) 65536
--lmdb_map_gb_counts_per_shard LMDB map size for counts (GB) 24
--lmdb_map_gb_postings_per_shard LMDB map size for postings (GB) 160
--lmdb_map_gb_sequences_per_shard LMDB map size for sequences (GB) 25
--export_allele_counts Path to write allele frequency TSV none
--parse_only Only parse FAA → record bins, skip reduce off
--reduce_only Only reduce existing record bins → LMDB off
--resume Auto-detect: skip parse if record bins exist off
--skip_existing_shards Skip shards with existing LMDB output off
--log_file Log file path <out_dir>/build.log
--log_level Logging level INFO

Database Structure

WGNU_ATB_DB/
├── lmdb_counts/
│   ├── shard_00/         # LMDB: hash → (func_id, GNU_count)
│   ├── shard_01/
│   └── ...
├── lmdb_postings/        # (if --with_postings)
│   ├── shard_00/         # LMDB: hash → varint-encoded genome IDs
│   ├── shard_01/
│   └── ...
├── lmdb_sequences/       # (if --with_sequences)
│   ├── shard_00/         # LMDB: hash → amino acid sequence (UTF-8)
│   └── ...
├── indexes/
│   └── genome_species.u32   # Binary array: genome_id → species_id
└── metadata/
    ├── build_info.json       # Build parameters, stats, version
    └── functions.tsv.gz      # Function ID → function description

Technical Details

  • Hashing: BLAKE2b with 128-bit (16-byte) digest of the amino acid sequence
  • Sharding: shard_id = first_byte(hash) & (num_shards - 1)
  • GNU count: number of genomes containing an allele at least once (deduplicated within each genome)
  • Postings: delta + varint encoded sorted unique genome IDs
  • External sort: numpy structured arrays for memory-efficient sorting; batched multi-pass merge with fanin of 64
  • Query optimizations: batched LMDB reads (one transaction per shard), numpy-vectorized species lookups, partial argsort for top-K genome ranking

Resource Requirements

Building (2.4M genomes, 8 shards)

Resource Recommendation
RAM 250–500 GB
CPUs 4–6 cores
Disk (tmp) ~2 TB scratch
Wall time 6–24 hours (I/O dependent)

Querying

Resource Recommendation
RAM ~2 GB (basic) / ~4 GB (with postings)
Wall time ~5–150 seconds per genome

Citation

If you use WhatsGNU-ATB in your research, please cite:

Moustafa AM and Planet PJ. WhatsGNU: a tool for identifying proteomic novelty. Genome Biology, 2020. doi:10.1186/s13059-020-01965-w

Hunt M, Lima L, Shen W, Lees J, Iqbal Z. AllTheBacteria - all bacterial genomes assembled, available and searchable. bioRxiv, 2024.https://doi.org/10.1101/2024.03.08.584059

License

GPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whatsgnu_atb-1.0.0.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

whatsgnu_atb-1.0.0-py3-none-any.whl (40.1 kB view details)

Uploaded Python 3

File details

Details for the file whatsgnu_atb-1.0.0.tar.gz.

File metadata

  • Download URL: whatsgnu_atb-1.0.0.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for whatsgnu_atb-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c4ecacdae4bf2b50646f46426164589551a9325afa0c3f1ae663d373dc992a38
MD5 8388052eb8a5366ddfb170988d7584a7
BLAKE2b-256 9ec35e8c7161dcb3b349b587cd22aa80e6c59e35c9aaabe2a4a87fb6d160330f

See more details on using hashes here.

File details

Details for the file whatsgnu_atb-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: whatsgnu_atb-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 40.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for whatsgnu_atb-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd47fa5e5f141ae64711378f285bce0a60179624ca79e045b48ae2ff1877ba50
MD5 b9f2bfe593e3b0846283f83000f75317
BLAKE2b-256 5e2d51dbdfb1e32317294606ad824bc80d5d9c6f32c38f9368cb28c03d97be12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page