Synthetic immune-receptor-sequence simulator for benchmarking alignment models and sequence analysis.
Project description
GenAIRR
Synthetic Adaptive Immune Receptor Repertoire Generator
High-performance BCR and TCR sequence simulation with full ground-truth annotations.
Rust kernel · 23 species · constraint-aware sampling · cross-platform wheels
Installation
pip install GenAIRR
GenAIRR ships as a single wheel that bundles both the Python API and the Rust simulation kernel — no extra packages, no compiler needed. Pre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64), supporting Python 3.9+.
Building from source needs a stable Rust toolchain (rustup install stable) — see CONTRIBUTING.md.
Quick Start
import GenAIRR as ga
# Generate 1,000 human heavy-chain sequences via standard V(D)J recombination.
outcomes = ga.Experiment.on("human_igh").recombine().run(n=1000, seed=42)
# Each item is an Outcome wrapping the full pipeline state.
o = outcomes[0]
sim = o.final_simulation()
sim.bases() # b'gaggtgcagctggtg...' — the assembled sequence
sim.regions() # [<Region V [0..296) frame_phase=0>,
# <Region NP1 [296..309) frame_phase=2>,
# <Region D [309..336) frame_phase=0>,
# <Region NP2 [336..342) frame_phase=0>,
# <Region J [342..388) frame_phase=0>]
sim.v_allele_id() # 146 (index into the V pool of the active refdata)
sim.d_allele_id() # 5
sim.j_allele_id() # 1
# The trace records every random draw the engine made.
o.trace().find("np.np1.length").value # 13
o.pass_names() # ['sample_allele.v', 'sample_allele.d',
# 'sample_allele.j', 'trim.v_3', ...]
Experiment.on(...) accepts a config-name string (e.g. "human_igh", "mouse_tcrb"), a DataConfig loaded from the bundled species pickles, or a RefDataConfig for custom reference data.
See the full walkthrough in the docs: Quick Start · Interpreting Results
Constraint-aware sampling
GenAIRR's signature feature is constraint-aware sampling: contracts that prune the candidate distribution at sample time, not retries after the fact. The canonical bundle is productive() (in-frame junction + no stop codons + V/J anchors preserved):
import GenAIRR as ga
# Every sequence is productive by construction. No retry loops, no
# post-hoc filtering — the engine only ever picks NP lengths, NP bases,
# and mutation substitutions that satisfy the bundle.
outcomes = (
ga.Experiment.on("human_igh")
.recombine()
.run(n=1000, seed=42, respect=ga.productive())
)
Docs: Productive sequences
Strict vs permissive mode
By default, if a contract can't admit any candidate at a sampling step the runtime falls back to unconstrained sampling and the run continues. Pass strict=True to surface the failure as an exception instead — useful for catching unsatisfiable plans early during development:
import GenAIRR as ga
try:
ga.Experiment.on("human_igh").recombine().run(
n=10, seed=42, respect=ga.productive(), strict=True
)
except ga.StrictSamplingError as e:
pass_name, address, reason = e.args
# pass_name e.g. "generate_np.np1", address e.g. "np.np1.length",
# reason in {"empty_admissible_support", "support_unavailable", ...}
print(f"{pass_name} could not satisfy the contract at {address}: {reason}")
Reproducibility
import GenAIRR as ga
# Same seed → byte-identical outcomes across runs and platforms.
a = ga.Experiment.on("human_igh").recombine().run(n=100, seed=42)
b = ga.Experiment.on("human_igh").recombine().run(n=100, seed=42)
assert a[0].final_simulation().bases() == b[0].final_simulation().bases()
# `n` runs use seeds [seed, seed+1, ..., seed+n-1] so consecutive
# batches stitch together by offsetting the starting seed.
batch_a = ga.Experiment.on("human_igh").recombine().run(n=100, seed=0)
batch_b = ga.Experiment.on("human_igh").recombine().run(n=100, seed=100)
# batch_a[50] is byte-equal to a one-off run at seed=50.
Docs: Reproducibility
Compile once, run many times
For a hot loop, compile() once and reuse the plan. Contracts (respect=) are baked into the compiled plan, so they only need to be passed once:
import GenAIRR as ga
compiled = (
ga.Experiment.on("human_igk")
.recombine()
.compile(respect=ga.productive())
)
# Run 10 batches of 100, seeded so they don't overlap.
for batch in range(10):
outcomes = compiled.run(n=100, seed=batch * 100)
What you get back
Every outcome in the returned list is an Outcome with:
| Accessor | Returns | Description |
|---|---|---|
outcome.final_simulation() |
Simulation |
The end-of-pipeline IR snapshot. |
outcome.revision(i) |
Simulation |
The IR after the i-th pass — full step-by-step history. |
outcome.revision_after(name) |
Simulation | None |
First revision produced by the named pass. |
outcome.pass_names() |
list[str] |
Names of every pass that ran, in order. |
outcome.trace() |
Trace |
Addressed log of every random draw. |
Each Simulation exposes len(sim) (pool length), sim.bases() → bytes, sim.regions() → list[Region], sim.germline_position(i), sim.v_allele_id() / .d_allele_id() / .j_allele_id(). Each Region carries segment ("V"/"D"/"J"/"NP1"/"NP2"), start/end/len(), frame_phase, and amino_acids() → bytes (codon-rail translation, including stop markers and ambiguous codons).
outcome.trace() supports find(address), prefix_query(prefix), and prefix_count(prefix) — every random draw the engine made is keyed by a hierarchical address ("sample_allele.v", "np.np1.length", "np.np1.bases[3]", ...). This is the same trace the engine uses internally for replay determinism, so your downstream tooling sees exactly what the kernel saw.
Docs: Simulation Pipeline · Metadata Accuracy · Interpreting Results
Supported Species & Chains
GenAIRR ships with 106 built-in configurations covering 23 species (sourced from IMGT and OGRDB).
import GenAIRR as ga
print(ga.list_configs()) # all available configs
| Species | BCR | TCR |
|---|---|---|
| Human | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Mouse | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Rat | IGH, IGK, IGL | — |
| Rabbit | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Dog | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Cat | IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Rhesus | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
All 23 species
Alpaca, Cat, Chicken, Cow, Cynomolgus, Dog, Dromedary, Ferret, Goat, Gorilla, Horse, Human, Mouse (generic + C57BL/6J), Pig, Platypus, Rabbit, Rat, Rhesus, Salmon, Sheep, Trout, Zebrafish.
import GenAIRR as ga
ga.Experiment.on("mouse_igh").recombine().run(n=500)
ga.Experiment.on("rabbit_tcrb").recombine().run(n=500)
ga.Experiment.on("rhesus_igk").recombine().run(n=500)
Docs: Choosing a config · Chain types
Custom reference data
For non-builtin alleles (custom IMGT pulls, in-house references, etc.) you can build a RefDataConfig directly and pass it to Experiment.on(...):
import GenAIRR as ga
cfg = ga.RefDataConfig.vj()
cfg.add_v_allele("v_custom*01", "v_custom", b"GAAGTACAGCTGGTGCAG...", anchor=288)
cfg.add_v_allele("v_custom*02", "v_custom", b"GAAGTACAGCTAGTGCAG...", anchor=288)
cfg.add_j_allele("j_custom*01", "j_custom", b"TGGGGCCAAGGG...", anchor=10)
outcomes = ga.Experiment.on(cfg).recombine().run(n=100, seed=42)
RefDataConfig.vdj() builds a heavy-chain-shaped refdata (with a D pool); add_d_allele(...) populates it. Anchors are 0-based offsets of the V Cys / J W or F codon's first base, used to keep the junction frame-aligned during recombination.
Key Features
- Rust simulation kernel — persistent IR with full revision history, addressed-trace introspection,
cargo test-grade unit coverage. - Constraint-aware sampling — contracts prune candidate distributions at sample time so productive sequences come out of the engine by construction; no retry loops.
- Strict-mode opt-in — surface unsatisfiable plans as
StrictSamplingErrorinstead of silently relaxing the bundle. - Deterministic seeds — same seed reproduces every byte of the pool and every entry of the trace, across runs and platforms.
- Full revision history —
outcome.revision(i)exposes the IR after each pass for fine-grained debugging. - Addressed trace — every random draw is keyed by a hierarchical string (
"np.np1.bases[3]") and survives end-to-end into the returnedOutcome. - 23 species, 106 configs — built-in IMGT + OGRDB reference pickles ship with the wheel.
- Zero mandatory Python dependencies — one wheel, everything in the box.
Optional Extras
pip install GenAIRR[all] # numpy, scipy, graphviz, tqdm
pip install GenAIRR[dataconfig] # numpy + scipy (custom DataConfig analysis)
pip install GenAIRR[viz] # graphviz
Documentation
The full documentation site is at mutejester.github.io/GenAIRR. Useful starting points:
- Getting started — Quick Start · Choosing a Config · Interpreting Results
- Concepts — Simulation Pipeline · Metadata Accuracy
- Guides — Experiment DSL · Chain Types · Export
- Options — Productive · Reproducibility · SHM · Biology · Artifacts
Citing GenAIRR
If GenAIRR is useful in your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G. An unbiased comparison of immunoglobulin sequence aligners. Briefings in Bioinformatics. 2024;25(6):bbae556. doi:10.1093/bib/bbae556
Contributing
Contributions are welcome. See CONTRIBUTING.md for development setup and guidelines.
License
GPL-3.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genairr-2.0.0.tar.gz.
File metadata
- Download URL: genairr-2.0.0.tar.gz
- Upload date:
- Size: 3.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
609d5bae6c51bd0bcf436e97a2648d35ab99fb920f1fbfc16b2384dd32256dbd
|
|
| MD5 |
57d1d4fa3417a792e6fc6c7aec418305
|
|
| BLAKE2b-256 |
08eb6ae72962f790ab676e12457df87de5834973a93f474e64d4e111118700f0
|
File details
Details for the file genairr-2.0.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: genairr-2.0.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 3.6 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f25cb9cd32eb910c45fd9494fc7fa8c22e601beb8a964a5d09260b1a7808c40e
|
|
| MD5 |
7c4ac19f10bcb762e71ccc86e2331ed9
|
|
| BLAKE2b-256 |
2d8ed2d9838eeec48b1384024d0ec3b0a68c72eee743af6724ed6dbec29adde5
|
File details
Details for the file genairr-2.0.0-cp39-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: genairr-2.0.0-cp39-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0db687b4b6cf347675d8c6f70244ac9d5028a718fd35219d5390313a8e429a5
|
|
| MD5 |
5e953df8235453c652304a48a52b07d6
|
|
| BLAKE2b-256 |
8bea655e36ddd531916fb399c4644e6e97a0213009fd353c75294d1cb2305fc1
|
File details
Details for the file genairr-2.0.0-cp39-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: genairr-2.0.0-cp39-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
540cdc96e40d31a9934aad39209342121e4a3ecb9ef303fa1740cc1e2637daf7
|
|
| MD5 |
9f50ca9784604d2fd75a129820882875
|
|
| BLAKE2b-256 |
21c368e4277c70b0b7b7aef163fa8be9990d595b1d05b9811890443d5bc291ca
|
File details
Details for the file genairr-2.0.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: genairr-2.0.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c9f124d386b283d829022725c567045ee62974fd665599824cee231ca6418da
|
|
| MD5 |
bc15ae34db01b90ab30f7c3c5854eb11
|
|
| BLAKE2b-256 |
46ef1b29c92fdac475ac8242ea294ee85a6a722abaffdf989696ffd7a6f4a8ab
|
File details
Details for the file genairr-2.0.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: genairr-2.0.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
088c7a10bf70c6000abf731422b91e52f765989d023fabb67cdd1ca165ccce17
|
|
| MD5 |
ca902e095037c902688a4b46ffe0b503
|
|
| BLAKE2b-256 |
3b698f647d62319d1ed12457cea21fbba582b2567a7253470c07e2dc9bcdfdc5
|