An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.
Project description
GenAIRR
Adaptive Immune Receptor Repertoire sequence simulator
Generate realistic BCR & TCR repertoires in a single line of Python.
📑 Table of Contents
- Why GenAIRR?
- Key Features
- Installation
- Quick Start
- Examples
- Mutation Models
- Roadmap
- Contributing
- Citing GenAIRR
- License
- Acknowledgements
🧐 Why GenAIRR?
Click to expand
Benchmarking modern aligners, exploring somatic-hypermutation, or stress-testing novel ML pipelines requires large, perfectly-annotated repertoires—not snippets of real data peppered with sequencing error.
GenAIRR fills that gap with a plug-and-play, fully-extensible simulation engine that produces sequences while giving you full ground-truth labels.
✨ Key Features
| Category | Highlights |
|---|---|
| Realistic Simulation | Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling |
| Composable Pipelines | Chain together built-in & custom AugmentationSteps into simulation pipelines |
| Multi-Chain Support | Heavy & light BCRs plus TCR-β out of the box |
| Research-ready Output | JSON / pandas export, built-in plotting stubs, deterministic seeds |
| Docs & Tutorials | Rich API docs, Jupyter notebooks, step-by-step guides |
⚡ Installation
# Python ≥ 3.9
pip install GenAIRR
# or the bleeding edge
pip install git+https://github.com/your-org/GenAIRR.git
🚀 Quick Start
Below is a 60-second tour. See /examples for notebooks and CLI usages.
from GenAIRR.pipeline import AugmentationPipeline
from GenAIRR.parameters import ChainType,CHAIN_TYPE_INFO
from GenAIRR.steps import SimulateSequence, FixVPositionAfterTrimmingIndexAmbiguity
from GenAIRR.mutation import S5F
from GenAIRR.data import builtin_heavy_chain_data_config
from GenAIRR.steps.StepBase import AugmentationStep
# 1️⃣ Configure built-in germline & chain type
data_cfg = builtin_heavy_chain_data_config()
AugmentationStep.set_dataconfig(config = data_cfg,
chain_type=ChainType.BCR_HEAVY)
# 2️⃣ Build a minimal pipeline
pipeline = AugmentationPipeline([
SimulateSequence(mutation_model=S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
FixVPositionAfterTrimmingIndexAmbiguity()
])
# 3️⃣ Simulate!
sim = pipeline.execute()
print(sim.get_dict())
🧑💻 Examples
1. Full Heavy-Chain Pipeline
from GenAIRR.steps import (
FixDPositionAfterTrimmingIndexAmbiguity, FixJPositionAfterTrimmingIndexAmbiguity,
CorrectForVEndCut, CorrectForDTrims, CorruptSequenceBeginning,
InsertNs, InsertIndels, ShortDValidation, DistillMutationRate
)
pipeline = AugmentationPipeline([
SimulateSequence(mutation_model=S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
FixVPositionAfterTrimmingIndexAmbiguity(),
FixDPositionAfterTrimmingIndexAmbiguity(),
FixJPositionAfterTrimmingIndexAmbiguity(),
CorrectForVEndCut(),
CorrectForDTrims(),
CorruptSequenceBeginning(
corruption_probability=0.7,
corrupt_events_proba=[0.4, 0.4, 0.2],
max_sequence_length=576,
nucleotide_add_coefficient=210,
nucleotide_remove_coefficient=310,
nucleotide_add_after_remove_coefficient=50,
random_sequence_add_proba=1
),
InsertNs(n_ratio=0.02, proba=0.5),
ShortDValidation(short_d_length=5),
InsertIndels(indel_probability=0.5, max_indels=5, insertion_proba=0.5, deletion_proba=0.5),
DistillMutationRate()
])
result = pipeline.execute()
2. Naïve Sequence (no SHM)
from GenAIRR.mutation import Uniform
naive_step = SimulateSequence(mutation_model=Uniform(0, 0), productive=True)
pipeline = AugmentationPipeline([naive_step])
naive_seq = pipeline.execute()
print(naive_seq.sequence)
3. Custom Allele Combination
custom_step = SimulateSequence(
mutation_model=S5F(0.003, 0.25),
productive=True,
specific_v=data_cfg.allele_list('v')[0],# specific V allele (as Allele object)
specific_d=data_cfg.allele_list('d')[0],# specific D allele (as Allele object)
specific_j=data_cfg.allele_list('j')[0] # specific J allele (as Allele object)
)
pipeline = AugmentationPipeline([custom_step])
print(pipeline.execute().get_dict())
🔬 Mutation Models
| Model | Description | When to use |
|---|---|---|
S5F |
Context-specific somatic hyper-mutation | Antibody maturation studies |
Uniform |
Evenly random mutations | Baselines / ablation |
| Your Model ➕ | Implement BaseMutationModel |
Custom evolutionary scenarios |
from GenAIRR.mutation import S5F
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
mut_seq, muts, rate = s5f.apply_mutation(naive_seq)
🗺️ Roadmap
- 🚧 More Complex Mutation Model (With Selection)
- 🚧 More Built-in Data Configs (e.g., TCR, custom germlines)
- 🚧 More Built-in Steps (e.g., more mutation models, more data augmentation)
- 🚧 Deeper Docs (e.g., more examples, more tutorials)
See the open issues. Feel something’s missing? Open a feature request.
🤝 Contributing
Contributions are welcome! 💙 Please read our contributing guide and check the good first issue label.
✏️ Citing GenAIRR
If GenAIRR helps your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861
📜 License
Distributed under the MIT License. See LICENSE for details.
🙏 Acknowledgements
GenAIRR is inspired by and builds upon amazing work from the immunoinformatics community—especially AIRRship.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file GenAIRR-0.4.1.tar.gz.
File metadata
- Download URL: GenAIRR-0.4.1.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
511c8f53e4dd7c71e2640edfbb88931c9caae1fd38c48e96ecb7e3e27dd30216
|
|
| MD5 |
ff47ade49d8a26a86bec591b2da15e84
|
|
| BLAKE2b-256 |
422b57e068e2f4003a6ee443b26b133188cc58fd07aa97b47acb071ad973a132
|