Skip to main content

Antigen Receptor Domain Annotation — fast TCR/BCR FR/CDR region annotation

Project description

arda

arda — Antigen Receptor Domain Annotation

CI docs python license

Versatile, fast, exact FR/CDR annotation of TCR and BCR sequences — mRNA and protein in FASTA, and reads in FASTQ from both amplicon and bulk RNA-seq — for nucleotide and amino-acid input, across all loci at once.

arda does the expensive IgBLAST work once, offline — building a pre-aligned reference database of every in-frame V·J germline scaffold with FR1–4 / CDR1–3 markup — then at runtime maps your sequences to that database with MMseqs2 and transfers the markup through the alignment in a small C++ hot path. The result is an AIRR-formatted annotation that matches IgBLAST (≈97% region concordance on real GenBank mRNA), from a plain CLI + Python library — no Docker, no workflow engine.

Why

IgBLAST is the gold standard but is slow to invoke per-batch and awkward to embed. arda keeps IgBLAST-quality region calls while being:

  • Fast & scalable — MMseqs2 search + a C++ projection step; multiprocessing and SLURM-friendly from small FASTA to large FASTQ.
  • Embeddableimport arda; arda.annotate_sequences(...).
  • Easy to install — conda for the mmseqs binary, pip install -e . for the package + C++ extension; IgBLAST is fetched into a gitignored bin/ and is only needed to (re)build the reference DB, not at runtime.

Install

bash setup.sh            # creates conda env `arda`, fetches IgBLAST, pip install -e .
conda activate arda

Flags: --no-conda (use the active env), --build-db (rebuild references after install), --tests (run the fast suites). The committed database/vdj/<organism>/ references mean most users never need to build anything.

Supported organisms: human, mouse (full IG + TR), rat, rabbit, rhesus_monkey (IG only — IgBLAST ships no TR internal annotation for these).

CLI

arda info                                   # resolved paths + tool availability
arda annotate -i reads.fastq -o out.airr.tsv --organism human --seqtype nt
arda annotate -i prot.fasta  -o out.airr.tsv --organism human --seqtype aa
arda annotate -i reads.fastq -o out.airr.tsv --strand forward   # plus-strand only
arda build-db   --organism all              # rebuild references (needs IgBLAST)
arda build-index --organism all             # (re)build the precompiled mmseqs DBs
arda slurm -i big.fastq -o big.airr.tsv --shards 50 --partition cpu   # cluster scale

See examples/ for a runnable per-locus demo and benchmarks/RESULTS.md for measured speed/accuracy.

The reference database ships with precompiled MMseqs2 indexes (database/vdj/<organism>/mmseqs/), so annotation runs out of the box with no build step. They are used automatically when the local MMseqs2 version matches the shipped one; otherwise arda transparently rebuilds a private cache on first run (arda build-index regenerates the shipped DBs for your version).

Input may be FASTA or FASTQ, plain or gzipped. Nucleotide input is searched on both strands by default (reverse-complement reads are re-oriented and flagged rev_comp=T); a single search annotates a mixed bulk RNA-seq file across all loci.

Library

import arda

records = arda.annotate_sequences(
    ["GACGTGCAG...", ("clone7", "CAGGTG...")],  # strings or (id, seq) pairs
    seqtype="nt", organism="human",
)
# -> list of AIRR record dicts: v_call, d_call/d2_call, j_call, fwr1..fwr4,
#    cdr1..cdr3, *_start/*_end (1-based closed), *_aa, junction(_aa), np1/np2/np3,
#    v_sequence_end, j_sequence_start, productive, rev_comp, ...

Annotating bare germline segments

There is no coverage filter, so a V-only or J-only query maps to its scaffold and only the regions inside the query's coverage are returned. This lets you annotate isolated germline V or J alleles without synthesising a rearrangement — a bare V yields fwr1..fwr3, a bare J yields fwr4:

from arda.annotate.mapper import annotate_records

recs = annotate_records(
    [("TRBV9*01", v_germline_nt), ("TRBJ2-7*01", j_germline_nt)],
    organism="human", seqtype="nt", strand="forward", map_d=False,
)
# V record -> fwr1/cdr1/fwr2/cdr2/fwr3 (+ v_sequence_end = CDR3 start)
# J record -> fwr4 (+ j_sequence_start = CDR3 end / FR4 start)

(mirpy uses exactly this to bake per-allele FR/CDR subsequences into its gene library; see tests/synthetic/test_germline_segments.py.)

How it works

  1. Reference build (arda.refbuild, offline): download IMGT/V-QUEST germlines → enumerate deduplicated in-frame V×J scaffolds (D only affects CDR3 interior, so it isn't enumerated) → annotate with igblastn -outfmt 19 → translate → write database/vdj/<organism>/{alleles.fasta, alleles.aa.fasta, markup.tsv, markup.aa.tsv, combinations.tsv, build.log}.
  2. Runtime (arda.annotate): MMseqs2 search query→scaffolds → best hit → C++ transfer_regions projects scaffold region coordinates onto the query (handling indels, truncation, mid-codon alignment starts, reverse strand) → for VDJ loci a gapless C++ local alignment of the CDR3 interior against the D germlines adds d_call/d2_call + np* → AIRR TSV. Out-of-frame junctions are reported with an N-bridge (_) so FR4 still reads.

See memory/ for design rationale and gotchas. Fast sequence primitives (translate, detect_coding_frame, reverse_complement, back_translate) live in the C++ extension and are re-exported from arda.refbuild.translate — mirpy-API-compatible, so mirpy can import arda and reuse them.

Performance

Exact annotation that matches IgBLAST while being several times faster, scaling to large FASTQ. Synthetic human IGH, 16 threads (scripts/bench_vs_igblast.py):

sequences arda arda rate speedup vs IgBLAST region concordance
10,000 5.5s ~1.8k/s 4.4× 98.9%
50,000 16s ~3.0k/s 7.3×
100,000 30s ~3.3k/s 7.9×

On ~7.3k real GenBank mRNA records spanning all five organisms and their loci (committed, gzipped test fixtures), region concordance with IgBLAST on productive records is 98–99.7% per organism; junction_aa/cdr3_aa match IgBLAST ~99% and satisfy the AIRR invariants exactly. V-gene assignment agrees ~100%. (GenBank also contains genomic/partial/non-productive entries that confuse both tools; those are excluded from the comparison.)

Bulk RNA-seq is much faster than amplicon, because mmseqs prefilters by k-mer matching — reads with no receptor k-mer are rejected before alignment. At 150 nt reads, 16 threads (scripts/bench_prefilter.py):

receptor content throughput
100% (amplicon) ~5.7k reads/s
10% ~19k reads/s
1% (blood RNA-seq) ~25k reads/s

Extrapolated to a 32-core node, a 30M-read bulk RNA-seq library (~1% receptor) annotates in roughly 10–20 min — the same order of magnitude as a STAR genome alignment pass on the same data (STAR is faster per read, but arda maps only to a tiny germline DB and the non-receptor majority costs just prefilter rejection). Large FASTQ is streamed in bounded chunks (a background reader prefetches the next chunk while the current one is annotated), so memory stays flat regardless of input size — --chunk-size tunes it.

Roadmap / TODO

See ROADMAP.md. Done: V·J reference build (5 organisms), MMseqs2 mapping, C++ markup transfer, reverse-complement, all-loci querying, streaming I/O, out-of-frame junctions, D-segment mapping incl. D-D fusions, precompiled indexes, multi-node (SLURM) sharding. Next: full AIRR productivity.

Development

pip install -e .                                  # rebuilds the C++ ext on import
python -m pytest tests/unit tests/synthetic -q    # fast suite
env ARDA_REALWORLD=1 python -m pytest tests/realworld -s   # vs IgBLAST (network)
env RUN_BENCHMARK=1   python -m pytest tests/benchmark -s  # timing/memory/scaling

Layout: src/arda/{refbuild,annotate}, C++ in src/_markup/markup.cpp, references in database/, downloads in gitignored bin/ + data/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arda_mapper-2.0.2.tar.gz (9.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arda_mapper-2.0.2-cp313-cp313-win_amd64.whl (138.0 kB view details)

Uploaded CPython 3.13Windows x86-64

arda_mapper-2.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (151.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arda_mapper-2.0.2-cp313-cp313-macosx_11_0_arm64.whl (117.2 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arda_mapper-2.0.2-cp312-cp312-win_amd64.whl (138.0 kB view details)

Uploaded CPython 3.12Windows x86-64

arda_mapper-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (151.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arda_mapper-2.0.2-cp312-cp312-macosx_11_0_arm64.whl (117.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arda_mapper-2.0.2-cp311-cp311-win_amd64.whl (136.2 kB view details)

Uploaded CPython 3.11Windows x86-64

arda_mapper-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (152.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arda_mapper-2.0.2-cp311-cp311-macosx_11_0_arm64.whl (117.0 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arda_mapper-2.0.2-cp310-cp310-win_amd64.whl (135.4 kB view details)

Uploaded CPython 3.10Windows x86-64

arda_mapper-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (150.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

arda_mapper-2.0.2-cp310-cp310-macosx_11_0_arm64.whl (115.9 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file arda_mapper-2.0.2.tar.gz.

File metadata

  • Download URL: arda_mapper-2.0.2.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arda_mapper-2.0.2.tar.gz
Algorithm Hash digest
SHA256 8564afbd8180696836d8c171de9365e00f6089cb88723d0bf69f3dd6ec7b62fd
MD5 858191e4195f3ab72a1219b609f9a3f0
BLAKE2b-256 f236bba1b3392596bad3c83340df994a8e30004db627ca24d4eb402f0afe5e8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2.tar.gz:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: arda_mapper-2.0.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 138.0 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arda_mapper-2.0.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d1c462e96729030af420d631bb7d71f251fac66d19f5e84b5d01fd39f9bd7a98
MD5 947b73cd70e19bc20c5d81705ddc9b96
BLAKE2b-256 700a048f370f70dea868f49e5b6b4621ff67c86ae200b25237534165c752dfc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b8a2847668ddc41c86663e5023501fc63d521e4e699b568bcb1d0fdb3ba1609f
MD5 d6f277fc1547b35bf666be7cc1f2fb6d
BLAKE2b-256 7568b54fe49a60e4ca68a8d855a4938a8354635b253b43f5db8dade30936b1de

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6fe01f756129b8c98c8ed76c9adb9c5ad12dc475aa3131fbf22917ff3d5b0953
MD5 89de3aecfb551855350575ee5f4827c7
BLAKE2b-256 373a9934a8b376f43b6442ae39188403844f2e6ac227d06628b8f7f28a3e20b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: arda_mapper-2.0.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 138.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arda_mapper-2.0.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 b365a1c0041e483a0509a7ecc4285ca334bed7884cca43149fa625c4550eaa85
MD5 accfe464680be1b41245b38e4ee3116f
BLAKE2b-256 fe5e53c288d65c473a74133ed67d2144370f3aa34bd3d30bce3ae1d87a563275

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 46a5f87845173f88796254504748b82b7a5c69db62318dc5e926e569eb224227
MD5 6f0a23195dbfa5e7e48a1ba61bb1a498
BLAKE2b-256 3bbca9b92c5559e1736eeade701038633dfe62ae9af3f83e5469af812caa0350

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94ffc2bf963d62a43336fc5ab5582011166dfa0b50421c28bfd038aa5fe7d1eb
MD5 0f0ef293933250c517674ac084ec0708
BLAKE2b-256 7e958db48ad41437eecae30d6b3e4e89e96dba6306a75516ebbf399c0ac4fb5b

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: arda_mapper-2.0.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 136.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arda_mapper-2.0.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d7af6708abd2278bb5829c2c44c6dce4909716cc5288b6ba09491eeb30eac44f
MD5 ae26742d97c029867bf501ccd3ae2417
BLAKE2b-256 11ec8e612457251427344d25ec4c2262ca8b8cb5210c6b05cdca42909666c6f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0093ec6e0de8557d507cc2a54157fe42afc916c0e2cccc11ab9c4c9daa4fa9dd
MD5 97f6134ac3a1fdf4d3b212aa41c991d5
BLAKE2b-256 04b44ddccd2c1025785799ec45b5a0e2d95d1b8461693eb47bee2c3763b5e70d

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 13781f4b4814e7d0a20b6483c8d5ca4dc61a5306bff521ee175934a8d6990e12
MD5 ae0d6cc04d7277af3675c42833bddcbc
BLAKE2b-256 ffdc859c00767d4e13ba2b27235fa600e21687f65671447374d47b02a459eae7

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: arda_mapper-2.0.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 135.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arda_mapper-2.0.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c9e644d8a67ffab3c0ecb85db17275a434f46599c431015a40ee11dc8749aa71
MD5 4202a2562d06bc11756b342f5898c232
BLAKE2b-256 dd3ab0b0fa1f046143b7e2d8ce196ef07f3a238dfa9b0455e73d32fdf8f5676b

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp310-cp310-win_amd64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a419555acfd81df75e1de832bc2e7fd7917f95bd9420a67b6108ed5f1e6d3ec2
MD5 e43c42faf3a6b3400ebcb8cb2db7dcf8
BLAKE2b-256 14161fb18ab6187b61ba37a323513c010303586f2c20c9ab9a87767ee9c86694

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arda_mapper-2.0.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arda_mapper-2.0.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 568b9fabfa1e6d159b5b0686e2d6e35e5c91e152ebe7da4e61bdcf707ce84aa6
MD5 aab4cf13070e80fa55dc2c368e11b025
BLAKE2b-256 ce2121c121d13fb5ce1fcbd51f6ac3d29afa8ecbc97b81bcec4540b3bd2e865b

See more details on using hashes here.

Provenance

The following attestation bundles were made for arda_mapper-2.0.2-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: publish.yml on antigenomics/arda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page