Skip to main content

Germline-informed reverse translation of antibody amino acid sequences to nucleotide sequences

Project description

abverse

abverse logo

Germline-informed reverse translation of antibody amino acid sequences to nucleotide sequences.

abverse is a companion package to abstar. It takes antibody amino acid sequences — common output from mass spectrometry, proteomics, or databases — and produces nucleotide sequences that are maximally faithful to the inferred germline, so that downstream abstar annotation (V/J assignment, mutation counts, CDR/FWR regions) reflects real somatic hypermutation rather than arbitrary codon choices.


Why abverse?

abstar requires nucleotide input. Researchers with AA sequences have two options:

  1. Naive reverse-translation → pick any codon per amino acid → run abstar → get inflated mutation counts and unreliable CDR boundaries because every codon choice that differs from germline looks like a mutation.
  2. abverse → single-pass algorithm → germline-faithful NT → feed directly into abstar.run().

The abverse approach is provably optimal: for each codon position aligned to a germline gene, it picks the synonymous codon with the minimum Hamming distance to the germline codon (ties broken by human codon frequency). Because codons don't overlap and Hamming distance is additive, the global minimum equals the sum of per-position minima — and the entire lookup table is pre-computed at import time (O(1) per position at runtime).


Installation

pip install abverse

Requirements: Python ≥ 3.10, abutils ≥ 0.5.1, abstar (for germline databases), polars ≥ 0.20, MMseqs2 (bundled via abutils).


Quick start

import abutils
import abverse
import abstar

# Load your AA sequences (FASTA file, list of strings, or list of abutils.Sequence)
aa_seqs = abutils.io.read_fasta("antibodies_aa.fasta")

# Reverse-translate to germline-faithful NT sequences
nt_seqs = abverse.reverse_translate(aa_seqs)

# Feed directly into abstar — results will have meaningful mutation counts
results = abstar.run(nt_seqs)

The returned nt_seqs is a list[abutils.Sequence]. Each sequence carries three annotations:

Annotation Description
v_call Assigned V germline gene
j_call Assigned J germline gene
reconstruction_method germline_vj, germline_v_only, germline_j_only, or codon_frequency

API

abverse.reverse_translate(sequences, ...)

abverse.reverse_translate(
    sequences,              # FASTA path | list[str] | list[abutils.Sequence]
    species="human",        # germline species
    receptor="bcr",         # receptor type
    n_processes=None,       # worker processes (default: cpu_count)
    threads=None,           # MMseqs2 threads
    chunksize=500,          # sequences per worker batch
    force_rebuild_db=False, # force re-build of germline AA databases
    output_fasta=None,      # optional path to write NT FASTA
    verbose=False,          # print progress
) -> list[abutils.Sequence]

abverse.build_germline_aa_db(species, receptor, force_rebuild)

Pre-builds (or validates the cache of) the germline amino acid databases used internally. Call this once on first install to populate ~/.abverse/germline_dbs/. Subsequent calls reuse the cache unless the source germline files change (SHA-256 invalidation).


How it works

Algorithm

1. MMseqs2 protein–protein search (all AA sequences vs. V germline AA DB)
   → best V assignment per sequence

2. Extract post-V region (aa_seq[v_qend+1:]) per sequence
   → MMseqs2 protein–protein search vs. J germline AA DB
   → best J assignment per sequence

3. Parallel reconstruction (ProcessPoolExecutor):
   • 5' overhang (before V alignment)  → most frequent human codon
   • V region                           → argmin_c[Hamming(c, germline_codon)] per position
   • CDR3 (V end → J start)            → most frequent human codon
   • J region                           → argmin_c[Hamming(c, germline_codon)] per position
   • 3' overhang (after J alignment)   → most frequent human codon

4. Validate: assert translate(output_nt) == input_aa for every sequence

Germline database cache

On first use, abverse translates abstar's nucleotide V/J germlines to amino acid FASTA files, builds MMseqs2 protein databases, and caches everything under ~/.abverse/germline_dbs/. The cache is automatically invalidated and rebuilt if abstar's germline files change (checked via SHA-256).

Frame detection for J genes uses the conserved WG.G (IGH) / FG.G (IGK/IGL) motif; a stop-free-frame fallback covers unusual alleles.


Performance

Benchmarked on a single CPU core with 10,000 BCR AA sequences:

Metric Value
Throughput ~775 sequences/second/core
abstar calls in critical path 0
translate(output) == input guarantee 100% (validated per sequence)

No iterative abstar calls occur during reverse_translate — the algorithm is a single-pass pipeline.


Integration test results

Tested on 100 real human BCR sequences with known germline assignments:

Metric Result Threshold
V-gene family agreement ≥ 90% 90%
J-gene family agreement ≥ 80% 80%
Exact V-call match 75% informational
Exact J-call match 91% informational

The exact V-call rate of 75% reflects the fundamental ambiguity of assigning a specific allele from amino acid sequence alone (multiple alleles can share the same AA sequence). Gene-family agreement — the metric that matters for mutation analysis — passes comfortably.


Edge cases

Situation Handling
No V assignment Human codon frequency for all positions; reconstruction_method='codon_frequency'
No J assignment Germline lookup for V region; fallback elsewhere
5′ / 3′ overhangs Human codon frequency
Germline codon truncated at gene edge Human codon frequency
Non-standard AA (X, B, Z) NNN
Stop codon in input AA ValueError with position and sequence ID
V/J alignment overlap V takes priority; J starts after V end

Development

git clone https://github.com/bnemoz/abverse.git
cd abverse
pip install -e . --no-build-isolation
pip install pytest

# Run all tests (unit + integration + scaling benchmark)
python3 -m pytest abverse/tests/ -v

The test suite (59 tests) covers the codon lookup table, germline database building, per-sequence reconstruction with all edge cases, the end-to-end pipeline, integration with real BCR sequences, and a 10k-sequence throughput benchmark.


Package structure

abverse/
├── pyproject.toml
└── abverse/
    ├── __init__.py          # public API: reverse_translate, build_germline_aa_db
    ├── _codons.py           # 1280-entry optimal codon lookup table
    ├── _germline_db.py      # germline translation, MMseqs2 DB build, cache
    ├── _search.py           # V + J protein–protein search wrappers (Polars)
    ├── _reconstruct.py      # per-sequence NT reconstruction (pure, picklable)
    ├── _pipeline.py         # orchestration and parallel dispatch
    └── tests/
        ├── test_codons.py
        ├── test_germline_db.py
        ├── test_reconstruct.py
        ├── test_pipeline.py
        ├── test_integration.py
        └── test_scaling.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abverse-0.1.3.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abverse-0.1.3-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file abverse-0.1.3.tar.gz.

File metadata

  • Download URL: abverse-0.1.3.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for abverse-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b40edfc17f32bb8b882b0772b3760557d820bdbc16b762cf8dbeb6d9a71e4638
MD5 a76209740a93765629a7ecb87fed45fe
BLAKE2b-256 7e71f4cf15926d2da79dd7c327331cdfabf8c3d2df9a3e7dcdea38f52185662d

See more details on using hashes here.

Provenance

The following attestation bundles were made for abverse-0.1.3.tar.gz:

Publisher: publish.yml on bnemoz/abverse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file abverse-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: abverse-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for abverse-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bb557aadbf6cb954ee06e50b75621c646886ccca7b3867f46f9ed6b31a74defd
MD5 66a172463e193a2e139a0dd45990f269
BLAKE2b-256 3a53ebc4ebc0916414619bd9f04b57c5b807e7ddefb15131e0fcd982abbf37e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for abverse-0.1.3-py3-none-any.whl:

Publisher: publish.yml on bnemoz/abverse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page