Skip to main content

An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.

Project description

GenAIRR

GenAIRR

Adaptive Immune Receptor Repertoire sequence simulator
Generate realistic BCR & TCR repertoires in a single line of Python.

PyPI version


📑 Table of Contents

  1. Why GenAIRR?
  2. Key Features
  3. Installation
  4. Quick Start
  5. Examples
  6. Mutation Models
  7. Roadmap
  8. Contributing
  9. Citing GenAIRR
  10. License
  11. Acknowledgements

🧐 Why GenAIRR?

Click to expand

Benchmarking modern aligners, exploring somatic-hypermutation, or stress-testing novel ML pipelines requires large, perfectly-annotated repertoires—not snippets of real data peppered with sequencing error.
GenAIRR fills that gap with a plug-and-play, fully-extensible simulation engine that produces sequences while giving you full ground-truth labels.


✨ Key Features

Category Highlights
Realistic Simulation Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling
Composable Pipelines Chain together built-in & custom AugmentationSteps into simulation pipelines
Multi-Chain Support Heavy & light BCRs plus TCR-β out of the box
Research-ready Output JSON / pandas export, built-in plotting stubs, deterministic seeds
Docs & Tutorials Rich API docs, Jupyter notebooks, step-by-step guides

⚡ Installation

# Python ≥ 3.9
pip install GenAIRR
# or the bleeding edge
pip install git+https://github.com/MuteJester/GenAIRR.git

🚀 Quick Start

Below is a 60-second tour. See /examples for notebooks and CLI usages.

from GenAIRR.pipeline import AugmentationPipeline
from GenAIRR.steps import SimulateSequence, FixVPositionAfterTrimmingIndexAmbiguity
from GenAIRR.mutation import S5F
from GenAIRR.data import HUMAN_IGH_OGRDB
from GenAIRR.steps.StepBase import AugmentationStep

# 1️⃣  Configure built-in germline data
AugmentationStep.set_dataconfig(HUMAN_IGH_OGRDB)

# 2️⃣  Build a minimal pipeline
pipeline = AugmentationPipeline([
    SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), True),
    FixVPositionAfterTrimmingIndexAmbiguity()
])

# 3️⃣  Simulate!
sim = pipeline.execute()
print(sim.get_dict())

🧑‍💻 Examples

1. Full Heavy-Chain Pipeline

from GenAIRR.steps import (
    FixDPositionAfterTrimmingIndexAmbiguity, FixJPositionAfterTrimmingIndexAmbiguity,
    CorrectForVEndCut, CorrectForDTrims, CorruptSequenceBeginning,
    InsertNs, InsertIndels, ShortDValidation, DistillMutationRate
)

pipeline = AugmentationPipeline([
    SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), True),
    FixVPositionAfterTrimmingIndexAmbiguity(),
    FixDPositionAfterTrimmingIndexAmbiguity(),
    FixJPositionAfterTrimmingIndexAmbiguity(),
    CorrectForVEndCut(),
    CorrectForDTrims(),
    CorruptSequenceBeginning(0.7, [0.4, 0.4, 0.2], 576, 210, 310, 50),
    InsertNs(0.02, 0.5),
    ShortDValidation(),
    InsertIndels(0.5, 5, 0.5, 0.5),
    DistillMutationRate()
])
result = pipeline.execute()

2. Naïve Sequence (no SHM)

from GenAIRR.mutation import Uniform
naive_step = SimulateSequence(Uniform(0, 0), True)
pipeline = AugmentationPipeline([naive_step])
naive_seq = pipeline.execute()
print(naive_seq.sequence)

3. Custom Allele Combination

custom_step = SimulateSequence(
    S5F(0.003, 0.25),
    True,
    specific_v=HUMAN_IGH_OGRDB.v_alleles['IGHV1-2*02'][0],  # specific V allele
    specific_d=HUMAN_IGH_OGRDB.d_alleles['IGHD3-10*01'][0], # specific D allele  
    specific_j=HUMAN_IGH_OGRDB.j_alleles['IGHJ4*02'][0]     # specific J allele
)
pipeline = AugmentationPipeline([custom_step])
print(pipeline.execute().get_dict())

🔬 Mutation Models

Model Description When to use
S5F Context-specific somatic hyper-mutation Antibody maturation studies
Uniform Evenly random mutations Baselines / ablation
Your Model ➕ Implement BaseMutationModel Custom evolutionary scenarios
from GenAIRR.mutation import S5F
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
mut_seq, muts, rate = s5f.apply_mutation(naive_seq)

🗺️ Roadmap

  • 🚧 More Complex Mutation Model (With Selection)
  • 🚧 More Built-in Data Configs (e.g., TCR, custom germlines)
  • 🚧 More Built-in Steps (e.g., more mutation models, more data augmentation)
  • 🚧 Deeper Docs (e.g., more examples, more tutorials)

See the open issues. Feel something’s missing? Open a feature request.


🤝 Contributing

Contributions are welcome! 💙 Please read our contributing guide and check the good first issue label.


✏️ Citing GenAIRR

If GenAIRR helps your research, please cite:

Konstantinovsky T, Peres A, Polak P, Yaari G.  
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.  
https://doi.org/10.1093/bib/bbae556  
PMID: 39489605 | PMCID: PMC11531861

📜 License

Distributed under the GPL3 License. See LICENSE for details.


🙏 Acknowledgements

GenAIRR is inspired by and builds upon amazing work from the immunoinformatics community—especially AIRRship.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genairr-0.6.0.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genairr-0.6.0-py3-none-any.whl (2.4 MB view details)

Uploaded Python 3

File details

Details for the file genairr-0.6.0.tar.gz.

File metadata

  • Download URL: genairr-0.6.0.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for genairr-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a56e9e55808e17f037f81a808c038d65cfcd5dcdf9bed0b9549b71ad3ef6cbaf
MD5 c1ed0adbc5a4e701b791fe6879c7bd62
BLAKE2b-256 8b64d4e60a56e276cc56559c58a06b1e48b1339ddad094795e15400b8d375bad

See more details on using hashes here.

File details

Details for the file genairr-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: genairr-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for genairr-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5ec8dba413e4b79490f33c88a4dc4c76370026a7eb8b7fa4ab35df12ab50a1fb
MD5 aac7d7e55fcb58d38b54660cdde8d159
BLAKE2b-256 ec5ea784c56ab05a63080a2bdc683af5501f5df462763efbf6f559dfc5b1e37b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page