Skip to main content

Synthetic immune-receptor-sequence simulator for benchmarking alignment models and sequence analysis.

Project description

GenAIRR

Synthetic Adaptive Immune Receptor Repertoire Generator

PyPI Tests Python License

High-performance BCR and TCR sequence simulation with full ground-truth annotations.
Rust kernel · 23 species · constraint-aware sampling · cross-platform wheels

📖 Documentation


Installation

pip install GenAIRR

GenAIRR ships as a single wheel that bundles both the Python API and the Rust simulation kernel — no extra packages, no compiler needed. Pre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64), supporting Python 3.9+.

Building from source needs a stable Rust toolchain (rustup install stable) — see CONTRIBUTING.md.


Quick Start

import GenAIRR as ga

# Generate 1,000 human heavy-chain sequences via standard V(D)J recombination.
outcomes = ga.Experiment.on("human_igh").recombine().run(n=1000, seed=42)

# Each item is an Outcome wrapping the full pipeline state.
o = outcomes[0]
sim = o.final_simulation()

sim.bases()          # b'gaggtgcagctggtg...' — the assembled sequence
sim.regions()        # [<Region V [0..296) frame_phase=0>,
                     #  <Region NP1 [296..309) frame_phase=2>,
                     #  <Region D [309..336) frame_phase=0>,
                     #  <Region NP2 [336..342) frame_phase=0>,
                     #  <Region J [342..388) frame_phase=0>]
sim.v_allele_id()    # 146  (index into the V pool of the active refdata)
sim.d_allele_id()    # 5
sim.j_allele_id()    # 1

# The trace records every random draw the engine made.
o.trace().find("np.np1.length").value     # 13
o.pass_names()                            # ['sample_allele.v', 'sample_allele.d',
                                          #  'sample_allele.j', 'trim.v_3', ...]

Experiment.on(...) accepts a config-name string (e.g. "human_igh", "mouse_tcrb"), a DataConfig loaded from the bundled species pickles, or a RefDataConfig for custom reference data.

See the full walkthrough in the docs: Quick Start · Interpreting Results


Constraint-aware sampling

GenAIRR's signature feature is constraint-aware sampling: contracts that prune the candidate distribution at sample time, not retries after the fact. The canonical bundle is productive() (in-frame junction + no stop codons + V/J anchors preserved):

import GenAIRR as ga

# Every sequence is productive by construction. No retry loops, no
# post-hoc filtering — the engine only ever picks NP lengths, NP bases,
# and mutation substitutions that satisfy the bundle.
outcomes = (
    ga.Experiment.on("human_igh")
      .recombine()
      .run(n=1000, seed=42, respect=ga.productive())
)

Docs: Productive sequences

Strict vs permissive mode

By default, if a contract can't admit any candidate at a sampling step the runtime falls back to unconstrained sampling and the run continues. Pass strict=True to surface the failure as an exception instead — useful for catching unsatisfiable plans early during development:

import GenAIRR as ga

try:
    ga.Experiment.on("human_igh").recombine().run(
        n=10, seed=42, respect=ga.productive(), strict=True
    )
except ga.StrictSamplingError as e:
    pass_name, address, reason = e.args
    # pass_name e.g. "generate_np.np1", address e.g. "np.np1.length",
    # reason in {"empty_admissible_support", "support_unavailable", ...}
    print(f"{pass_name} could not satisfy the contract at {address}: {reason}")

Reproducibility

import GenAIRR as ga

# Same seed → byte-identical outcomes across runs and platforms.
a = ga.Experiment.on("human_igh").recombine().run(n=100, seed=42)
b = ga.Experiment.on("human_igh").recombine().run(n=100, seed=42)
assert a[0].final_simulation().bases() == b[0].final_simulation().bases()

# `n` runs use seeds [seed, seed+1, ..., seed+n-1] so consecutive
# batches stitch together by offsetting the starting seed.
batch_a = ga.Experiment.on("human_igh").recombine().run(n=100, seed=0)
batch_b = ga.Experiment.on("human_igh").recombine().run(n=100, seed=100)
# batch_a[50] is byte-equal to a one-off run at seed=50.

Docs: Reproducibility


Compile once, run many times

For a hot loop, compile() once and reuse the plan. Contracts (respect=) are baked into the compiled plan, so they only need to be passed once:

import GenAIRR as ga

compiled = (
    ga.Experiment.on("human_igk")
      .recombine()
      .compile(respect=ga.productive())
)

# Run 10 batches of 100, seeded so they don't overlap.
for batch in range(10):
    outcomes = compiled.run(n=100, seed=batch * 100)

What you get back

Every outcome in the returned list is an Outcome with:

Accessor Returns Description
outcome.final_simulation() Simulation The end-of-pipeline IR snapshot.
outcome.revision(i) Simulation The IR after the i-th pass — full step-by-step history.
outcome.revision_after(name) Simulation | None First revision produced by the named pass.
outcome.pass_names() list[str] Names of every pass that ran, in order.
outcome.trace() Trace Addressed log of every random draw.

Each Simulation exposes len(sim) (pool length), sim.bases() → bytes, sim.regions() → list[Region], sim.germline_position(i), sim.v_allele_id() / .d_allele_id() / .j_allele_id(). Each Region carries segment ("V"/"D"/"J"/"NP1"/"NP2"), start/end/len(), frame_phase, and amino_acids() → bytes (codon-rail translation, including stop markers and ambiguous codons).

outcome.trace() supports find(address), prefix_query(prefix), and prefix_count(prefix) — every random draw the engine made is keyed by a hierarchical address ("sample_allele.v", "np.np1.length", "np.np1.bases[3]", ...). This is the same trace the engine uses internally for replay determinism, so your downstream tooling sees exactly what the kernel saw.

Docs: Simulation Pipeline · Metadata Accuracy · Interpreting Results


Supported Species & Chains

GenAIRR ships with 106 built-in configurations covering 23 species (sourced from IMGT and OGRDB).

import GenAIRR as ga
print(ga.list_configs())  # all available configs
Species BCR TCR
Human IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Mouse IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Rat IGH, IGK, IGL
Rabbit IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Dog IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
Cat IGK, IGL TCRA, TCRB, TCRD, TCRG
Rhesus IGH, IGK, IGL TCRA, TCRB, TCRD, TCRG
All 23 species

Alpaca, Cat, Chicken, Cow, Cynomolgus, Dog, Dromedary, Ferret, Goat, Gorilla, Horse, Human, Mouse (generic + C57BL/6J), Pig, Platypus, Rabbit, Rat, Rhesus, Salmon, Sheep, Trout, Zebrafish.

import GenAIRR as ga

ga.Experiment.on("mouse_igh").recombine().run(n=500)
ga.Experiment.on("rabbit_tcrb").recombine().run(n=500)
ga.Experiment.on("rhesus_igk").recombine().run(n=500)

Docs: Choosing a config · Chain types


Custom reference data

For non-builtin alleles (custom IMGT pulls, in-house references, etc.) you can build a RefDataConfig directly and pass it to Experiment.on(...):

import GenAIRR as ga

cfg = ga.RefDataConfig.vj()
cfg.add_v_allele("v_custom*01", "v_custom", b"GAAGTACAGCTGGTGCAG...", anchor=288)
cfg.add_v_allele("v_custom*02", "v_custom", b"GAAGTACAGCTAGTGCAG...", anchor=288)
cfg.add_j_allele("j_custom*01", "j_custom", b"TGGGGCCAAGGG...",       anchor=10)

outcomes = ga.Experiment.on(cfg).recombine().run(n=100, seed=42)

RefDataConfig.vdj() builds a heavy-chain-shaped refdata (with a D pool); add_d_allele(...) populates it. Anchors are 0-based offsets of the V Cys / J W or F codon's first base, used to keep the junction frame-aligned during recombination.


Key Features

  • Rust simulation kernel — persistent IR with full revision history, addressed-trace introspection, cargo test-grade unit coverage.
  • Constraint-aware sampling — contracts prune candidate distributions at sample time so productive sequences come out of the engine by construction; no retry loops.
  • Strict-mode opt-in — surface unsatisfiable plans as StrictSamplingError instead of silently relaxing the bundle.
  • Deterministic seeds — same seed reproduces every byte of the pool and every entry of the trace, across runs and platforms.
  • Full revision historyoutcome.revision(i) exposes the IR after each pass for fine-grained debugging.
  • Addressed trace — every random draw is keyed by a hierarchical string ("np.np1.bases[3]") and survives end-to-end into the returned Outcome.
  • 23 species, 106 configs — built-in IMGT + OGRDB reference pickles ship with the wheel.
  • Zero mandatory Python dependencies — one wheel, everything in the box.

Optional Extras

pip install GenAIRR[all]          # numpy, scipy, graphviz, tqdm
pip install GenAIRR[dataconfig]   # numpy + scipy (custom DataConfig analysis)
pip install GenAIRR[viz]          # graphviz

Documentation

The full documentation site is at mutejester.github.io/GenAIRR. Useful starting points:


Citing GenAIRR

If GenAIRR is useful in your research, please cite:

Konstantinovsky T, Peres A, Polak P, Yaari G. An unbiased comparison of immunoglobulin sequence aligners. Briefings in Bioinformatics. 2024;25(6):bbae556. doi:10.1093/bib/bbae556


Contributing

Contributions are welcome. See CONTRIBUTING.md for development setup and guidelines.

License

GPL-3.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genairr-2.0.0.tar.gz (3.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

genairr-2.0.0-cp39-abi3-win_amd64.whl (3.6 MB view details)

Uploaded CPython 3.9+Windows x86-64

genairr-2.0.0-cp39-abi3-manylinux_2_28_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

genairr-2.0.0-cp39-abi3-manylinux_2_28_aarch64.whl (3.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ ARM64

genairr-2.0.0-cp39-abi3-macosx_11_0_arm64.whl (3.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

genairr-2.0.0-cp39-abi3-macosx_10_12_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file genairr-2.0.0.tar.gz.

File metadata

  • Download URL: genairr-2.0.0.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genairr-2.0.0.tar.gz
Algorithm Hash digest
SHA256 609d5bae6c51bd0bcf436e97a2648d35ab99fb920f1fbfc16b2384dd32256dbd
MD5 57d1d4fa3417a792e6fc6c7aec418305
BLAKE2b-256 08eb6ae72962f790ab676e12457df87de5834973a93f474e64d4e111118700f0

See more details on using hashes here.

File details

Details for the file genairr-2.0.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: genairr-2.0.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for genairr-2.0.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f25cb9cd32eb910c45fd9494fc7fa8c22e601beb8a964a5d09260b1a7808c40e
MD5 7c4ac19f10bcb762e71ccc86e2331ed9
BLAKE2b-256 2d8ed2d9838eeec48b1384024d0ec3b0a68c72eee743af6724ed6dbec29adde5

See more details on using hashes here.

File details

Details for the file genairr-2.0.0-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for genairr-2.0.0-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c0db687b4b6cf347675d8c6f70244ac9d5028a718fd35219d5390313a8e429a5
MD5 5e953df8235453c652304a48a52b07d6
BLAKE2b-256 8bea655e36ddd531916fb399c4644e6e97a0213009fd353c75294d1cb2305fc1

See more details on using hashes here.

File details

Details for the file genairr-2.0.0-cp39-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for genairr-2.0.0-cp39-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 540cdc96e40d31a9934aad39209342121e4a3ecb9ef303fa1740cc1e2637daf7
MD5 9f50ca9784604d2fd75a129820882875
BLAKE2b-256 21c368e4277c70b0b7b7aef163fa8be9990d595b1d05b9811890443d5bc291ca

See more details on using hashes here.

File details

Details for the file genairr-2.0.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genairr-2.0.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6c9f124d386b283d829022725c567045ee62974fd665599824cee231ca6418da
MD5 bc15ae34db01b90ab30f7c3c5854eb11
BLAKE2b-256 46ef1b29c92fdac475ac8242ea294ee85a6a722abaffdf989696ffd7a6f4a8ab

See more details on using hashes here.

File details

Details for the file genairr-2.0.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for genairr-2.0.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 088c7a10bf70c6000abf731422b91e52f765989d023fabb67cdd1ca165ccce17
MD5 ca902e095037c902688a4b46ffe0b503
BLAKE2b-256 3b698f647d62319d1ed12457cea21fbba582b2567a7253470c07e2dc9bcdfdc5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page