Synthetic immune-receptor-sequence simulator for benchmarking alignment models and sequence analysis.
Project description
GenAIRR
Synthetic Adaptive Immune Receptor Repertoire Generator
High-performance BCR and TCR sequence simulation with full ground-truth annotations.
Rust kernel · 23 species · constraint-aware sampling · cross-platform wheels
Installation
pip install GenAIRR
GenAIRR ships as a single wheel that bundles both the Python API and the Rust simulation kernel — no extra packages, no compiler needed. Pre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64), supporting Python 3.9+.
Building from source needs a stable Rust toolchain (rustup install stable) — see CONTRIBUTING.md.
Quick Start
import GenAIRR as ga
# Generate 1,000 productive human heavy-chain sequences. Every sequence
# comes back with the full AIRR-format annotation block — gene calls,
# junction, productive flag, identity, mutation counts.
result = (
ga.Experiment.on("human_igh")
.recombine()
.run_records(n=1000, seed=42, respect=ga.productive())
)
# `result` is a SimulationResult — list-like over AIRR record dicts.
# Each dict has the 50+ standard AIRR fields per row.
len(result) # 1000
rec = result[0]
rec["sequence"] # 'gaggtgcagctggtggagtctgggggaggc...' (nucleotide)
rec["sequence_aa"] # 'EVQLVESGGGLVQPGGSLRLSCSAS...' (translated)
rec["locus"] # 'IGH'
rec["v_call"] # 'IGHVF10-G38*04' (comma-separated if the call ties)
rec["d_call"] # 'IGHD2-15*01'
rec["j_call"] # 'IGHJ2*01'
rec["junction_aa"] # 'CVKDDGNRGYCSGGSCYGRCCALDYWYFDLW'
rec["productive"] # True
rec["v_identity"] # 1.0 (matches/total over the V segment)
rec["n_mutations"] # 0
# Export in any of the standard formats. TSV/FASTA/FASTQ are dependency-free;
# to_dataframe() needs pandas (pip install GenAIRR[all]).
result.to_tsv("repertoire.tsv") # AIRR-spec TSV (50+ columns)
result.to_fasta("sequences.fasta") # FASTA with v_call/j_call in the headers
result.to_fastq("sequences.fastq") # FASTQ with illumina-shaped quality scores
df = result.to_dataframe() # one row per record, AIRR columns
Experiment.on(...) accepts a config-name string (e.g. "human_igh", "mouse_tcrb"), a DataConfig loaded from the bundled species pickles, or a RefDataConfig for custom reference data. respect=ga.productive() is the constraint-aware bundle — covered in the next section. Drop it to allow non-productive sequences (~30% of records will then have stop codons in the junction).
See the full walkthrough in the docs: Quick Start · Interpreting Results
A realistic pipeline — everything in one place
The Experiment DSL is a fluent builder. Each step appends to the pipeline; the same Experiment is returned so calls chain. The example below uses every major feature GenAIRR offers — recombination, clonal expansion, per-descendant somatic hypermutation, primer-trimming, structural indels, PCR errors, N-base injection, custom metadata, and the productive constraint:
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
# 1. V(D)J recombination — sample alleles, trim, fill NP1/NP2, assemble.
.recombine()
# 2. Clonal structure — 50 lineages × 20 sister sequences each.
# Passes BEFORE this point apply to the parent rearrangement;
# passes AFTER apply per-descendant. So each clone shares the
# same V(D)J recombination but accumulates its own SHM + errors.
.with_clonal_structure(n_clones=50, size=20)
# 3. Somatic hypermutation per descendant — S5F context-dependent
# model, 5–15 mutations per sequence sampled uniformly.
.mutate(model="s5f", count=(5, 15))
# 4. Sequencing artefacts per descendant: primer trimming, structural
# indels, PCR substitution errors, quality-driven N injection.
.corrupt_5prime_loss(length=(0, 8))
.corrupt_3prime_loss(length=(0, 4))
.corrupt_indels(count=(0, 2), insertion_prob=0.5)
.corrupt_pcr(count=(0, 3))
.corrupt_ns(count=(0, 2))
# 5. Stamp arbitrary metadata onto every record.
.with_metadata(experiment_id="exp001", tissue="peripheral_blood")
# Constraint-aware sampling: the productive() bundle is enforced at
# rearrangement + SHM time. Corruption passes can still introduce
# stop codons / frameshifts post-hoc, so expect ~70% productive
# when aggressive corruption is in the chain — that mirrors real
# wet-lab data, where a productive B-cell can sequence as a
# non-productive read because of an indel during library prep.
.run_records(seed=42, respect=ga.productive())
)
len(result) # 1000 (= n_clones × size)
sum(1 for r in result if r["productive"]) # 697 (~70% under this corruption load)
# Same clone, different descendants — same V(D)J recombination,
# independent SHM + errors:
result[0]["clone_id"], result[1]["clone_id"] # (0, 0)
result[0]["v_call"], result[1]["v_call"] # both 'IGHVF10-G38*04'
result[0]["n_mutations"], result[1]["n_mutations"] # (13, 15) — independent SHM
result[0]["n_pcr_errors"], result[1]["n_pcr_errors"] # (1, 1) — independent errors
# Custom metadata propagated:
result[0]["experiment_id"], result[0]["tissue"] # ('exp001', 'peripheral_blood')
result.to_tsv("repertoire.tsv")
Other feature flags worth knowing:
| Step | What it does |
|---|---|
.corrupt_contaminants(prob=0.02) |
Replace ~2% of records with unrelated background sequences. |
.corrupt_quality(count=(0, 5)) |
Lowercase 0–5 bases per sequence to mark sequencer-low-quality positions. |
.corrupt_reverse_complement(prob=0.5) |
Flip ~50% of records to the reverse strand (with the rev_comp flag set). |
.using(v=[...], d=[...], j=[...]) |
Restrict allele sampling to a specific subset — useful for benchmarking against a known repertoire. |
.mutate(model="uniform", count=(0, 30)) |
Use a uniform-rate mutation model instead of S5F. |
compile() then compiled.run_records(...) |
Compile the plan once, reuse it across many batches — see Compile once. |
Constraint-aware sampling
GenAIRR's signature feature is constraint-aware sampling: contracts that prune the candidate distribution at sample time, not retries after the fact. The canonical bundle is productive() (in-frame junction + no stop codons + V/J anchors preserved):
import GenAIRR as ga
# Every sequence is productive by construction. No retry loops, no
# post-hoc filtering — the engine only ever picks NP lengths, NP bases,
# and mutation substitutions that satisfy the bundle.
result = (
ga.Experiment.on("human_igh")
.recombine()
.run_records(n=1000, seed=42, respect=ga.productive())
)
assert all(rec["productive"] for rec in result)
Docs: Productive sequences
Strict vs permissive mode
By default, if a contract can't admit any candidate at a sampling step the runtime falls back to unconstrained sampling and the run continues. Pass strict=True to surface the failure as an exception instead — useful for catching unsatisfiable plans early during development:
import GenAIRR as ga
try:
ga.Experiment.on("human_igh").recombine().run_records(
n=10, seed=42, respect=ga.productive(), strict=True
)
except ga.StrictSamplingError as e:
pass_name, address, reason = e.args
# pass_name e.g. "generate_np.np1", address e.g. "np.np1.length",
# reason in {"empty_admissible_support", "support_unavailable", ...}
print(f"{pass_name} could not satisfy the contract at {address}: {reason}")
Reproducibility
import GenAIRR as ga
# Same seed → byte-identical records across runs and platforms.
a = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=42)
b = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=42)
assert a[0]["sequence"] == b[0]["sequence"]
# `n` runs use seeds [seed, seed+1, ..., seed+n-1] so consecutive
# batches stitch together by offsetting the starting seed.
batch_a = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=0)
batch_b = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=100)
# batch_a[50] is byte-equal to a one-off run at seed=50.
Docs: Reproducibility
Compile once, run many times
For a hot loop, compile() once and reuse the plan. Contracts (respect=) are baked into the compiled plan, so they only need to be passed once:
import GenAIRR as ga
compiled = (
ga.Experiment.on("human_igk")
.recombine()
.compile(respect=ga.productive())
)
# Run 10 batches of 100, seeded so they don't overlap.
for batch in range(10):
result = compiled.run_records(n=100, seed=batch * 100)
result.to_tsv(f"batch_{batch:02d}.tsv")
What you get back
.run_records(...) returns a SimulationResult — a list-like wrapper around a batch of AIRR record dicts:
| Method / attribute | Returns | Description |
|---|---|---|
len(result) |
int |
Number of records in the batch. |
result[i] |
dict |
The i-th AIRR record. Standard 0-based indexing + slicing. |
for rec in result: |
iterates dicts |
Records in [seed, seed+1, …, seed+n-1] order. |
result.records |
list[dict] |
The underlying list. Mutate-through is fine. |
result.to_tsv(path, *, airr_strict=False) |
— | AIRR-format TSV. airr_strict=True converts coordinates to 1-based-inclusive per spec. |
result.to_csv(path, *, airr_strict=False) |
— | Comma-separated. Same options as to_tsv. |
result.to_fasta(path, *, prefix="seq") |
— | FASTA. Headers include v_call and j_call. |
result.to_fastq(path, *, quality="illumina", **kw) |
— | FASTQ. Quality models: "illumina" (smoothed trapezoid) or "constant". |
result.to_dataframe(*, airr_strict=False) |
pandas.DataFrame |
One row per record. Requires pandas (pip install GenAIRR[all]). |
result.outcomes |
list[Outcome] | None |
The underlying Outcome objects, for advanced introspection (see below). |
Each record dict has 50+ AIRR fields. The most commonly used:
| Field | Example value | Description |
|---|---|---|
sequence |
'gaggtgcagctggtg…' |
Assembled nucleotide sequence (uppercase + lowercase corruption markers). |
sequence_aa |
'EVQLVESGGG…' |
Codon-rail translation. Stops emit *, ambiguous codons emit X. |
locus |
'IGH' |
Locus code derived from v_call / j_call. |
v_call / d_call / j_call |
'IGHV3-23*01' |
Gene calls. Comma-separated tie set when the evidence walker can't disambiguate. |
junction / junction_aa |
'TGC…GAC' / 'CAR…D' |
Junction nucleotide + AA. AA includes the V Cys (anchor) through J W/F+3. |
productive |
True / False / None |
In-frame junction AND no stop codons AND anchors preserved. None when undefined (e.g. junction not present). |
v_identity / d_identity / j_identity |
0.987 |
Match rate over each segment's CIGAR M/D ops. |
v_cigar / d_cigar / j_cigar |
'17D279M' |
CIGAR strings. Only M/I/D ops are emitted — no soft-clips. |
n_mutations / n_pcr_errors / n_quality_errors / n_indels |
4 / 0 / 2 / 1 |
Per-record error counts from the trace. |
The full schema (plus the *_sequence_start/end, *_alignment_start/end, *_germline_start/end coordinate fields, vj_in_frame, stop_codon, rev_comp, and others) is documented at Interpreting Results.
Advanced: full pipeline state via Outcome
When you need step-by-step IR history or the raw trace of every random draw — debugging an engine bug, building a custom alignment tool, replaying a specific seed — use .run() instead of .run_records(). It returns a list of Outcome objects that carry the full pipeline state:
| Accessor | Returns | Description |
|---|---|---|
outcome.final_simulation() |
Simulation |
End-of-pipeline IR snapshot. |
outcome.revision(i) |
Simulation |
IR after the i-th pass — full step-by-step history. |
outcome.revision_after(name) |
Simulation | None |
First revision produced by the named pass. |
outcome.pass_names() |
list[str] |
Names of every pass that ran, in order. |
outcome.trace() |
Trace |
Addressed log of every random draw. |
Each Simulation exposes len(sim) (pool length), sim.bases() → bytes, sim.regions() → list[Region], sim.germline_position(i), sim.v_allele_id() / .d_allele_id() / .j_allele_id(). Each Region carries segment ("V"/"D"/"J"/"NP1"/"NP2"), start/end/len(), frame_phase, and amino_acids() → bytes (codon-rail translation, including stop markers and ambiguous codons).
outcome.trace() supports find(address), prefix_query(prefix), and prefix_count(prefix) — every random draw is keyed by a hierarchical address ("sample_allele.v", "np.np1.length", "np.np1.bases[3]", …). This is the same trace the engine uses internally for replay determinism.
.run_records(...) also exposes these via result.outcomes[i] — so you can have both the AIRR records and the deep introspection from a single call.
Docs: Simulation Pipeline · Metadata Accuracy · Interpreting Results
Supported Species & Chains
GenAIRR ships with 106 built-in configurations covering 23 species (sourced from IMGT and OGRDB).
import GenAIRR as ga
print(ga.list_configs()) # all available configs
| Species | BCR | TCR |
|---|---|---|
| Human | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Mouse | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Rat | IGH, IGK, IGL | — |
| Rabbit | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Dog | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Cat | IGK, IGL | TCRA, TCRB, TCRD, TCRG |
| Rhesus | IGH, IGK, IGL | TCRA, TCRB, TCRD, TCRG |
All 23 species
Alpaca, Cat, Chicken, Cow, Cynomolgus, Dog, Dromedary, Ferret, Goat, Gorilla, Horse, Human, Mouse (generic + C57BL/6J), Pig, Platypus, Rabbit, Rat, Rhesus, Salmon, Sheep, Trout, Zebrafish.
import GenAIRR as ga
ga.Experiment.on("mouse_igh").recombine().run_records(n=500)
ga.Experiment.on("rabbit_tcrb").recombine().run_records(n=500)
ga.Experiment.on("rhesus_igk").recombine().run_records(n=500)
Docs: Choosing a config · Chain types
Custom reference data
For non-builtin alleles (custom IMGT pulls, in-house references, etc.) you can build a RefDataConfig directly and pass it to Experiment.on(...):
import GenAIRR as ga
cfg = ga.RefDataConfig.vj()
cfg.add_v_allele("v_custom*01", "v_custom", b"GAAGTACAGCTGGTGCAG...", anchor=288)
cfg.add_v_allele("v_custom*02", "v_custom", b"GAAGTACAGCTAGTGCAG...", anchor=288)
cfg.add_j_allele("j_custom*01", "j_custom", b"TGGGGCCAAGGG...", anchor=10)
result = ga.Experiment.on(cfg).recombine().run_records(n=100, seed=42)
RefDataConfig.vdj() builds a heavy-chain-shaped refdata (with a D pool); add_d_allele(...) populates it. Anchors are 0-based offsets of the V Cys / J W or F codon's first base, used to keep the junction frame-aligned during recombination.
Key Features
- Rust simulation kernel — persistent IR with full revision history, addressed-trace introspection,
cargo test-grade unit coverage. - Constraint-aware sampling — contracts prune candidate distributions at sample time so productive sequences come out of the engine by construction; no retry loops.
- Strict-mode opt-in — surface unsatisfiable plans as
StrictSamplingErrorinstead of silently relaxing the bundle. - Deterministic seeds — same seed reproduces every byte of the pool and every entry of the trace, across runs and platforms.
- Full revision history —
outcome.revision(i)exposes the IR after each pass for fine-grained debugging. - Addressed trace — every random draw is keyed by a hierarchical string (
"np.np1.bases[3]") and survives end-to-end into the returnedOutcome. - 23 species, 106 configs — built-in IMGT + OGRDB reference pickles ship with the wheel.
- Zero mandatory Python dependencies — one wheel, everything in the box.
Optional Extras
pip install GenAIRR[all] # numpy, scipy, graphviz, tqdm, fastmcp
pip install GenAIRR[dataconfig] # numpy + scipy (custom DataConfig analysis)
pip install GenAIRR[viz] # graphviz
pip install GenAIRR[mcp] # fastmcp (for the MCP server, see next section)
MCP server — drive GenAIRR from an LLM agent
GenAIRR ships an MCP server that exposes 14 tools an LLM agent (Claude, Cursor, etc.) can call to discover configs, simulate repertoires, validate AIRR records, and replay specific seeds — all without writing Python. Install the extra, then point your MCP client at python -m GenAIRR.mcp_server:
pip install GenAIRR[mcp]
Config snippets
Claude Code — .mcp.json in the project root (or ~/.claude/mcp.json globally):
{
"mcpServers": {
"genairr": {
"type": "stdio",
"command": "python",
"args": ["-m", "GenAIRR.mcp_server"]
}
}
}
Claude Desktop — ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"genairr": {
"command": "/path/to/venv/bin/python",
"args": ["-m", "GenAIRR.mcp_server"]
}
}
}
Cursor — ~/.cursor/mcp.json or per-project:
{
"mcpServers": {
"genairr": {
"command": "python",
"args": ["-m", "GenAIRR.mcp_server"]
}
}
}
Use the full path to the venv's python if GenAIRR isn't installed in the system interpreter — the MCP server inherits the launching process's Python environment.
What you get
After reloading the client, the agent has 14 tools available under the genairr namespace:
| Category | Tools |
|---|---|
| Discovery | list_configs, config_info, list_alleles, inspect_allele |
| Simulation | simulate_repertoire, simulate_preset, simulate_allele |
| Analysis | validate_records, align_to_germline, score_allele_calls, analyze_mutations, classify_regions, summarize_dataset |
| Reproducibility | replay_seed |
Every tool returns a uniform {ok, tool, elapsed_ms, result | error} envelope; failures carry a stable error-code token (config_not_found, allele_not_found, invalid_preset, invalid_parameter, malformed_record, seed_replay_mismatch) the agent can branch on.
A quick smoke-test prompt to verify the install: "List the available GenAIRR configs, then simulate 100 productive human heavy-chain sequences with moderate SHM and summarise the V-gene usage." — the agent should chain list_configs → simulate_repertoire(config="human_igh", n=100, productive_only=true, mutation_model="s5f", mutation_count_min=5, mutation_count_max=15) and read v_usage_top from the result.
Documentation
The full documentation site is at mutejester.github.io/GenAIRR. Useful starting points:
- Getting started — Quick Start · Choosing a Config · Interpreting Results
- Concepts — Simulation Pipeline · Metadata Accuracy
- Guides — Experiment DSL · Chain Types · Export
- Options — Productive · Reproducibility · SHM · Biology · Artifacts
Citing GenAIRR
If GenAIRR is useful in your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G. An unbiased comparison of immunoglobulin sequence aligners. Briefings in Bioinformatics. 2024;25(6):bbae556. doi:10.1093/bib/bbae556
Contributing
Contributions are welcome. See CONTRIBUTING.md for development setup and guidelines.
License
GPL-3.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genairr-2.1.1.tar.gz.
File metadata
- Download URL: genairr-2.1.1.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f23bd9f5f00caf95150e1c039010e0d4fd353050e913d2b9c0fa0e1c3626d98
|
|
| MD5 |
38536da60119201150103499b0257b85
|
|
| BLAKE2b-256 |
ad2acda72948974a2f90317e9ce6a065af57daf888fb9617ff0fe15d4d872123
|
File details
Details for the file genairr-2.1.1-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: genairr-2.1.1-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72fc1027cbd207c61fb23314c9d68e86a9136f611ea4715a642fb8617805b959
|
|
| MD5 |
cf1cab31867913b6a7ecbea36f0ffb79
|
|
| BLAKE2b-256 |
6910ea53acd2fe02edb50de08d4d5c4aa0098bfe3715e026b6593e34ad674bf2
|
File details
Details for the file genairr-2.1.1-cp39-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: genairr-2.1.1-cp39-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 3.8 MB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca2e9091859fb99312e648013a78613865c6f212b31fd643ad6fe64f77d2d718
|
|
| MD5 |
ade5604f8f924b39d0810e0b4e0e5f55
|
|
| BLAKE2b-256 |
1ecf3dcb442eed3e76a55894d6923a6dc64d1d136c192980a45ed5030a687f19
|
File details
Details for the file genairr-2.1.1-cp39-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: genairr-2.1.1-cp39-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18f97fff5c1ed78f51053ace4b5650d06af56fd022f7d1ddd8f1eefe9ea30699
|
|
| MD5 |
6a8a6e3ae0fdb3f48a9179c952cac643
|
|
| BLAKE2b-256 |
90b4a768e660e7d8ac702b14b3c406c3117d592d27d40b544762a58b1ba94c7f
|
File details
Details for the file genairr-2.1.1-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: genairr-2.1.1-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2620fbd3ca930e9c2a5edc51dacb7e762b32495207715900a954a7865bc9af58
|
|
| MD5 |
916b98eda37417163b252f64bdf38fbd
|
|
| BLAKE2b-256 |
edbb3cc2754427135590032a5bbade0f9f8fb7686da2bcfb6d3bd0d1567420af
|
File details
Details for the file genairr-2.1.1-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: genairr-2.1.1-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.7 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79c9c6bf7e0862dd1f772b275b083aff5e9d6bcefad1f55d912898455dd6a7b7
|
|
| MD5 |
8fe876a16e15a033ee8c5724d2bdde4e
|
|
| BLAKE2b-256 |
5dcaa1ca0d737db697259988e408f7b716bda5a8a9d0075730b8eb6f040610ba
|