Skip to main content

Synthetic immune-receptor-sequence simulator for benchmarking alignment models and sequence analysis.

Project description

GenAIRR

Synthetic Adaptive Immune Receptor Repertoire Generator

PyPI Tests Python License

High-performance BCR and TCR sequence simulation with full ground-truth annotations.
Rust kernel · 23 species · constraint-aware sampling · cross-platform wheels

📖 Documentation


Installation

pip install GenAIRR

GenAIRR ships as a single wheel that bundles both the Python API and the Rust simulation kernel — no extra packages, no compiler needed. Pre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64), supporting Python 3.9+.

Building from source needs a stable Rust toolchain (rustup install stable) — see CONTRIBUTING.md.


Quick Start

import GenAIRR as ga

# Generate 1,000 productive human heavy-chain sequences. Every sequence
# comes back with the full AIRR-format annotation block — gene calls,
# junction, productive flag, identity, mutation counts.
result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .run_records(n=1000, seed=42, respect=ga.productive())
)

# `result` is a SimulationResult — list-like over AIRR record dicts.
# Each dict has the 50+ standard AIRR fields per row.
len(result)                 # 1000
rec = result[0]

rec["sequence"]             # 'gaggtgcagctggtggagtctgggggaggc...' (nucleotide)
rec["sequence_aa"]          # 'EVQLVESGGGLVQPGGSLRLSCSAS...'      (translated)
rec["locus"]                # 'IGH'
rec["v_call"]               # 'IGHVF10-G38*04'   (comma-separated if the call ties)
rec["d_call"]               # 'IGHD2-15*01'
rec["j_call"]               # 'IGHJ2*01'
rec["junction_aa"]          # 'CVKDDGNRGYCSGGSCYGRCCALDYWYFDLW'
rec["productive"]           # True
rec["v_identity"]           # 1.0  (matches/total over the V segment)
rec["n_mutations"]          # 0

# Export in any of the standard formats. TSV/FASTA/FASTQ are dependency-free;
# to_dataframe() needs pandas (pip install GenAIRR[all]).
result.to_tsv("repertoire.tsv")        # AIRR-spec TSV (50+ columns)
result.to_fasta("sequences.fasta")     # FASTA with v_call/j_call in the headers
result.to_fastq("sequences.fastq")     # FASTQ with illumina-shaped quality scores
df = result.to_dataframe()             # one row per record, AIRR columns

Experiment.on(...) accepts a config-name string (e.g. "human_igh", "mouse_tcrb"), a DataConfig loaded from the bundled species pickles, or a RefDataConfig for custom reference data. respect=ga.productive() is the constraint-aware bundle — covered in the next section. Drop it to allow non-productive sequences (~30% of records will then have stop codons in the junction).

See the full walkthrough in the docs: Quick Start · Interpreting Results


A realistic pipeline — everything in one place

The Experiment DSL is a fluent builder. Each step appends to the pipeline; the same Experiment is returned so calls chain. The example below uses every major feature GenAIRR offers — recombination, clonal expansion, per-descendant somatic hypermutation, primer-trimming, structural indels, PCR errors, N-base injection, custom metadata, and the productive constraint:

import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
      # 1. V(D)J recombination — sample alleles, trim, fill NP1/NP2, assemble.
      .recombine()
      # 2. Clonal structure — 50 lineages × 20 sister sequences each.
      #    Passes BEFORE this point apply to the parent rearrangement;
      #    passes AFTER apply per-descendant. So each clone shares the
      #    same V(D)J recombination but accumulates its own SHM + errors.
      .with_clonal_structure(n_clones=50, size=20)
      # 3. Somatic hypermutation per descendant — S5F context-dependent
      #    model, 5–15 mutations per sequence sampled uniformly.
      .mutate(model="s5f", count=(5, 15))
      # 4. Sequencing artefacts per descendant: primer trimming, structural
      #    indels, PCR substitution errors, quality-driven N injection.
      .corrupt_5prime_loss(length=(0, 8))
      .corrupt_3prime_loss(length=(0, 4))
      .corrupt_indels(count=(0, 2), insertion_prob=0.5)
      .corrupt_pcr(count=(0, 3))
      .corrupt_ns(count=(0, 2))
      # 5. Stamp arbitrary metadata onto every record.
      .with_metadata(experiment_id="exp001", tissue="peripheral_blood")
      # Constraint-aware sampling: the productive() bundle is enforced at
      # rearrangement + SHM time. Corruption passes can still introduce
      # stop codons / frameshifts post-hoc, so expect ~70% productive
      # when aggressive corruption is in the chain — that mirrors real
      # wet-lab data, where a productive B-cell can sequence as a
      # non-productive read because of an indel during library prep.
      .run_records(seed=42, respect=ga.productive())
)

len(result)                                  # 1000  (= n_clones × size)
sum(1 for r in result if r["productive"])    # 697   (~70% under this corruption load)

# Same clone, different descendants — same V(D)J recombination,
# independent SHM + errors:
result[0]["clone_id"], result[1]["clone_id"]              # (0, 0)
result[0]["v_call"],   result[1]["v_call"]                # both 'IGHVF10-G38*04'
result[0]["n_mutations"], result[1]["n_mutations"]        # (13, 15) — independent SHM
result[0]["n_pcr_errors"], result[1]["n_pcr_errors"]      # (1, 1)   — independent errors

# Custom metadata propagated:
result[0]["experiment_id"], result[0]["tissue"]           # ('exp001', 'peripheral_blood')

result.to_tsv("repertoire.tsv")

Other feature flags worth knowing:

Step What it does
.corrupt_contaminants(prob=0.02) Replace ~2% of records with unrelated background sequences.
.corrupt_quality(count=(0, 5)) Lowercase 0–5 bases per sequence to mark sequencer-low-quality positions.
.corrupt_reverse_complement(prob=0.5) Flip ~50% of records to the reverse strand (with the rev_comp flag set).
.using(v=[...], d=[...], j=[...]) Restrict allele sampling to a specific subset — useful for benchmarking against a known repertoire.
.mutate(model="uniform", count=(0, 30)) Use a uniform-rate mutation model instead of S5F.
compile() then compiled.run_records(...) Compile the plan once, reuse it across many batches — see Compile once.

Constraint-aware sampling

GenAIRR's signature feature is constraint-aware sampling: contracts that prune the candidate distribution at sample time, not retries after the fact. The canonical bundle is productive() (in-frame junction + no stop codons + V/J anchors preserved):

import GenAIRR as ga

# Every sequence is productive by construction. No retry loops, no
# post-hoc filtering — the engine only ever picks NP lengths, NP bases,
# and mutation substitutions that satisfy the bundle.
result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .run_records(n=1000, seed=42, respect=ga.productive())
)
assert all(rec["productive"] for rec in result)

Docs: Productive sequences

Strict vs permissive mode

By default, if a contract can't admit any candidate at a sampling step the runtime falls back to unconstrained sampling and the run continues. Pass strict=True to surface the failure as an exception instead — useful for catching unsatisfiable plans early during development:

import GenAIRR as ga

try:
    ga.Experiment.on("human_igh").recombine().run_records(
        n=10, seed=42, respect=ga.productive(), strict=True
    )
except ga.StrictSamplingError as e:
    pass_name, address, reason = e.args
    # pass_name e.g. "generate_np.np1", address e.g. "np.np1.length",
    # reason in {"empty_admissible_support", "support_unavailable", ...}
    print(f"{pass_name} could not satisfy the contract at {address}: {reason}")

Reproducibility

import GenAIRR as ga

# Same seed → byte-identical records across runs and platforms.
a = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=42)
b = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=42)
assert a[0]["sequence"] == b[0]["sequence"]

# `n` runs use seeds [seed, seed+1, ..., seed+n-1] so consecutive
# batches stitch together by offsetting the starting seed.
batch_a = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=0)
batch_b = ga.Experiment.on("human_igh").recombine().run_records(n=100, seed=100)
# batch_a[50] is byte-equal to a one-off run at seed=50.

Docs: Reproducibility


Compile once, run many times

For a hot loop, compile() once and reuse the plan. Contracts (respect=) are baked into the compiled plan, so they only need to be passed once:

import GenAIRR as ga

compiled = (
    ga.Experiment.on("human_igk")
      .recombine()
      .compile(respect=ga.productive())
)

# Run 10 batches of 100, seeded so they don't overlap.
for batch in range(10):
    result = compiled.run_records(n=100, seed=batch * 100)
    result.to_tsv(f"batch_{batch:02d}.tsv")

What you get back

.run_records(...) returns a SimulationResult — a list-like wrapper around a batch of AIRR record dicts:

Method / attribute Returns Description
len(result) int Number of records in the batch.
result[i] dict The i-th AIRR record. Standard 0-based indexing + slicing.
for rec in result: iterates dicts Records in [seed, seed+1, …, seed+n-1] order.
result.records list[dict] The underlying list. Mutate-through is fine.
result.to_tsv(path, *, airr_strict=False) AIRR-format TSV. airr_strict=True converts coordinates to 1-based-inclusive per spec.
result.to_csv(path, *, airr_strict=False) Comma-separated. Same options as to_tsv.
result.to_fasta(path, *, prefix="seq") FASTA. Headers include v_call and j_call.
result.to_fastq(path, *, quality="illumina", **kw) FASTQ. Quality models: "illumina" (smoothed trapezoid) or "constant".
result.to_dataframe(*, airr_strict=False) pandas.DataFrame One row per record. Requires pandas (pip install GenAIRR[all]).
result.outcomes list[Outcome] | None The underlying Outcome objects, for advanced introspection (see below).

Each record dict has 50+ AIRR fields. The most commonly used:

Field Example value Description
sequence 'gaggtgcagctggtg…' Assembled nucleotide sequence (uppercase + lowercase corruption markers).
sequence_aa 'EVQLVESGGG…' Codon-rail translation. Stops emit *, ambiguous codons emit X.
locus 'IGH' Locus code derived from v_call / j_call.
v_call / d_call / j_call 'IGHV3-23*01' Gene calls. Comma-separated tie set when the evidence walker can't disambiguate.
junction / junction_aa 'TGC…GAC' / 'CAR…D' Junction nucleotide + AA. AA includes the V Cys (anchor) through J W/F+3.
productive True / False / None In-frame junction AND no stop codons AND anchors preserved. None when undefined (e.g. junction not present).
v_identity / d_identity / j_identity 0.987 Match rate over each segment's CIGAR M/D ops.
v_cigar / d_cigar / j_cigar '17D279M' CIGAR strings. Only M/I/D ops are emitted — no soft-clips.
n_mutations / n_pcr_errors / n_quality_errors / n_indels 4 / 0 / 2 / 1 Per-record error counts from the trace.

The full schema (plus the *_sequence_start/end, *_alignment_start/end, *_germline_start/end coordinate fields, vj_in_frame, stop_codon, rev_comp, and others) is documented at Interpreting Results.

Advanced: full pipeline state via Outcome

When you need step-by-step IR history or the raw trace of every random draw — debugging an engine bug, building a custom alignment tool, replaying a specific seed — use .run() instead of .run_records(). It returns a list of Outcome objects that carry the full pipeline state:

Accessor Returns Description
outcome.final_simulation() Simulation End-of-pipeline IR snapshot.
outcome.revision(i) Simulation IR after the i-th pass — full step-by-step history.
outcome.revision_after(name) Simulation | None First revision produced by the named pass.
outcome.pass_names() list[str] Names of every pass that ran, in order.
outcome.trace() Trace Addressed log of every random draw.

Each Simulation exposes len(sim) (pool length), sim.bases() → bytes, sim.regions() → list[Region], sim.germline_position(i), sim.v_allele_id() / .d_allele_id() / .j_allele_id(). Each Region carries segment ("V"/"D"/"J"/"NP1"/"NP2"), start/end/len(), frame_phase, and amino_acids() → bytes (codon-rail translation, including stop markers and ambiguous codons).

outcome.trace() supports find(address), prefix_query(prefix), and prefix_count(prefix) — every random draw is keyed by a hierarchical address ("sample_allele.v", "np.np1.length", "np.np1.bases[3]", …). This is the same trace the engine uses internally for replay determinism.

.run_records(...) also exposes these via result.outcomes[i] — so you can have both the AIRR records and the deep introspection from a single call.

Docs: Simulation Pipeline · Metadata Accuracy · Interpreting Results


Supported Species & Chains

GenAIRR ships with 106 built-in configurations covering 23 species (sourced from IMGT and OGRDB).

import GenAIRR as ga
print(ga.list_configs())  # all available configs
Species BCR TCR
Human IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Mouse IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Rat IGH, IGK, IGL
Rabbit IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Dog IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Cat IGK, IGL TCRA, TCRB, TCRD, TCRG
Rhesus IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
All 23 species

Alpaca, Cat, Chicken, Cow, Cynomolgus, Dog, Dromedary, Ferret, Goat, Gorilla, Horse, Human, Mouse (generic + C57BL/6J), Pig, Platypus, Rabbit, Rat, Rhesus, Salmon, Sheep, Trout, Zebrafish.

import GenAIRR as ga

ga.Experiment.on("mouse_igh").recombine().run_records(n=500)
ga.Experiment.on("rabbit_tcrb").recombine().run_records(n=500)
ga.Experiment.on("rhesus_igk").recombine().run_records(n=500)

Docs: Choosing a config · Chain types


Custom reference data

For non-builtin alleles (custom IMGT pulls, in-house references, etc.) you can build a RefDataConfig directly and pass it to Experiment.on(...):

import GenAIRR as ga

cfg = ga.RefDataConfig.vj()
cfg.add_v_allele("v_custom*01", "v_custom", b"GAAGTACAGCTGGTGCAG...", anchor=288)
cfg.add_v_allele("v_custom*02", "v_custom", b"GAAGTACAGCTAGTGCAG...", anchor=288)
cfg.add_j_allele("j_custom*01", "j_custom", b"TGGGGCCAAGGG...",       anchor=10)

result = ga.Experiment.on(cfg).recombine().run_records(n=100, seed=42)

RefDataConfig.vdj() builds a heavy-chain-shaped refdata (with a D pool); add_d_allele(...) populates it. Anchors are 0-based offsets of the V Cys / J W or F codon's first base, used to keep the junction frame-aligned during recombination.


Key Features

  • Rust simulation kernel — persistent IR with full revision history, addressed-trace introspection, cargo test-grade unit coverage.
  • Constraint-aware sampling — contracts prune candidate distributions at sample time so productive sequences come out of the engine by construction; no retry loops.
  • Strict-mode opt-in — surface unsatisfiable plans as StrictSamplingError instead of silently relaxing the bundle.
  • Deterministic seeds — same seed reproduces every byte of the pool and every entry of the trace, across runs and platforms.
  • Full revision historyoutcome.revision(i) exposes the IR after each pass for fine-grained debugging.
  • Addressed trace — every random draw is keyed by a hierarchical string ("np.np1.bases[3]") and survives end-to-end into the returned Outcome.
  • 23 species, 106 configs — built-in IMGT + OGRDB reference pickles ship with the wheel.
  • Zero mandatory Python dependencies — one wheel, everything in the box.

Optional Extras

pip install GenAIRR[all]          # numpy, scipy, graphviz, tqdm, fastmcp
pip install GenAIRR[dataconfig]   # numpy + scipy (custom DataConfig analysis)
pip install GenAIRR[viz]          # graphviz
pip install GenAIRR[mcp]          # fastmcp (for the MCP server, see next section)

MCP server — drive GenAIRR from an LLM agent

GenAIRR ships an MCP server that exposes 14 tools an LLM agent (Claude, Cursor, etc.) can call to discover configs, simulate repertoires, validate AIRR records, and replay specific seeds — all without writing Python. Install the extra, then point your MCP client at python -m GenAIRR.mcp_server:

pip install GenAIRR[mcp]

Config snippets

Claude Code.mcp.json in the project root (or ~/.claude/mcp.json globally):

{
  "mcpServers": {
    "genairr": {
      "type": "stdio",
      "command": "python",
      "args": ["-m", "GenAIRR.mcp_server"]
    }
  }
}

Claude Desktop~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "genairr": {
      "command": "/path/to/venv/bin/python",
      "args": ["-m", "GenAIRR.mcp_server"]
    }
  }
}

Cursor~/.cursor/mcp.json or per-project:

{
  "mcpServers": {
    "genairr": {
      "command": "python",
      "args": ["-m", "GenAIRR.mcp_server"]
    }
  }
}

Use the full path to the venv's python if GenAIRR isn't installed in the system interpreter — the MCP server inherits the launching process's Python environment.

What you get

After reloading the client, the agent has 14 tools available under the genairr namespace:

Category Tools
Discovery list_configs, config_info, list_alleles, inspect_allele
Simulation simulate_repertoire, simulate_preset, simulate_allele
Analysis validate_records, align_to_germline, score_allele_calls, analyze_mutations, classify_regions, summarize_dataset
Reproducibility replay_seed

Every tool returns a uniform {ok, tool, elapsed_ms, result | error} envelope; failures carry a stable error-code token (config_not_found, allele_not_found, invalid_preset, invalid_parameter, malformed_record, seed_replay_mismatch) the agent can branch on.

A quick smoke-test prompt to verify the install: "List the available GenAIRR configs, then simulate 100 productive human heavy-chain sequences with moderate SHM and summarise the V-gene usage." — the agent should chain list_configssimulate_repertoire(config="human_igh", n=100, productive_only=true, mutation_model="s5f", mutation_count_min=5, mutation_count_max=15) and read v_usage_top from the result.


Documentation

The full documentation site is at mutejester.github.io/GenAIRR. Useful starting points:


Citing GenAIRR

If GenAIRR is useful in your research, please cite:

Konstantinovsky T, Peres A, Polak P, Yaari G. An unbiased comparison of immunoglobulin sequence aligners. Briefings in Bioinformatics. 2024;25(6):bbae556. doi:10.1093/bib/bbae556


Contributing

Contributions are welcome. See CONTRIBUTING.md for development setup and guidelines.

License

GPL-3.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genairr-2.1.1.tar.gz (3.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

genairr-2.1.1-cp39-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

genairr-2.1.1-cp39-abi3-manylinux_2_28_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

genairr-2.1.1-cp39-abi3-manylinux_2_28_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

genairr-2.1.1-cp39-abi3-macosx_11_0_arm64.whl (3.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

genairr-2.1.1-cp39-abi3-macosx_10_12_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file genairr-2.1.1.tar.gz.

File metadata

  • Download URL: genairr-2.1.1.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genairr-2.1.1.tar.gz
Algorithm Hash digest
SHA256 5f23bd9f5f00caf95150e1c039010e0d4fd353050e913d2b9c0fa0e1c3626d98
MD5 38536da60119201150103499b0257b85
BLAKE2b-256 ad2acda72948974a2f90317e9ce6a065af57daf888fb9617ff0fe15d4d872123

See more details on using hashes here.

File details

Details for the file genairr-2.1.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: genairr-2.1.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genairr-2.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 72fc1027cbd207c61fb23314c9d68e86a9136f611ea4715a642fb8617805b959
MD5 cf1cab31867913b6a7ecbea36f0ffb79
BLAKE2b-256 6910ea53acd2fe02edb50de08d4d5c4aa0098bfe3715e026b6593e34ad674bf2

See more details on using hashes here.

File details

Details for the file genairr-2.1.1-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for genairr-2.1.1-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ca2e9091859fb99312e648013a78613865c6f212b31fd643ad6fe64f77d2d718
MD5 ade5604f8f924b39d0810e0b4e0e5f55
BLAKE2b-256 1ecf3dcb442eed3e76a55894d6923a6dc64d1d136c192980a45ed5030a687f19

See more details on using hashes here.

File details

Details for the file genairr-2.1.1-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for genairr-2.1.1-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 18f97fff5c1ed78f51053ace4b5650d06af56fd022f7d1ddd8f1eefe9ea30699
MD5 6a8a6e3ae0fdb3f48a9179c952cac643
BLAKE2b-256 90b4a768e660e7d8ac702b14b3c406c3117d592d27d40b544762a58b1ba94c7f

See more details on using hashes here.

File details

Details for the file genairr-2.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genairr-2.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2620fbd3ca930e9c2a5edc51dacb7e762b32495207715900a954a7865bc9af58
MD5 916b98eda37417163b252f64bdf38fbd
BLAKE2b-256 edbb3cc2754427135590032a5bbade0f9f8fb7686da2bcfb6d3bd0d1567420af

See more details on using hashes here.

File details

Details for the file genairr-2.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for genairr-2.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 79c9c6bf7e0862dd1f772b275b083aff5e9d6bcefad1f55d912898455dd6a7b7
MD5 8fe876a16e15a033ee8c5724d2bdde4e
BLAKE2b-256 5dcaa1ca0d737db697259988e408f7b716bda5a8a9d0075730b8eb6f040610ba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page