An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.
Project description
GenAIRR
Adaptive Immune Receptor Repertoire Sequence Simulator
Generate realistic BCR & TCR repertoires with full ground-truth annotations in Python.
Why GenAIRR?
Benchmarking sequence aligners, studying somatic hypermutation, or training ML models on immune repertoires requires large, perfectly-annotated datasets — not noisy snippets of real sequencing data.
GenAIRR is a plug-and-play, fully-extensible simulation engine that produces realistic immunoglobulin and TCR sequences while giving you complete ground-truth labels for every position, mutation, and gene segment.
Key Features
| Category | Highlights |
|---|---|
| Realistic Simulation | Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling |
| Composable Pipelines | Chain together built-in & custom steps into simulation pipelines |
| Multi-Chain Support | Heavy chain, kappa/lambda light chains, and TCR-beta out of the box |
| Research-ready Output | Full ground-truth annotations, JSON/pandas export, deterministic seeds |
| Docs & Tutorials | Step-by-step guides, Jupyter notebooks, API reference |
Installation
# Python >= 3.9
pip install GenAIRR
Quick Start
One-liner
from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F
result = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25))
print(result.sequence)
CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGGGACCCTGTCCCTCACCTGCGCTG...
Generate multiple sequences at once:
results = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25), n=100)
Pipeline (Full Control)
For complete control over the simulation, use the Pipeline API:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
]
)
sim = pipeline.execute()
print(sim.get_dict())
{
'sequence': 'CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCG...',
'v_call': ['IGHVF3-G8*04'],
'd_call': ['IGHD6-6*01'],
'j_call': ['IGHJ4*02'],
'productive': True,
'mutation_rate': 0.0027,
'mutations': {142: 'T>C'},
'v_sequence_start': 0,
'v_sequence_end': 293,
'd_sequence_start': 298,
'd_sequence_end': 316,
'j_sequence_start': 323,
'j_sequence_end': 367,
# ... and more fields
}
Every output includes the full sequence, V/D/J gene calls, mutation positions, region boundaries, and quality metrics — ready for downstream analysis.
Examples
Full Heavy-Chain Pipeline
A production-ready pipeline that simulates sequences with biological corrections and sequencing artifacts:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
# Core: generate sequence with somatic hypermutation
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
# Correct ground-truth positions after trimming ambiguities
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
# Calculate final mutation rate
steps.DistillMutationRate(),
# Simulate sequencing artifacts
steps.CorruptSequenceBeginning(), # 5' end degradation
steps.EnforceSequenceLength(), # read-length limit
steps.InsertNs(), # ambiguous base calls
steps.ShortDValidation(), # D-region QC
steps.InsertIndels(), # sequencing indels
]
)
result = pipeline.execute()
Naive Sequence (No Mutations)
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, Uniform
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[steps.SimulateSequence(Uniform(0, 0), productive=True)]
)
naive_seq = pipeline.execute()
Light Chain
from GenAIRR import Pipeline, steps, HUMAN_IGK_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGK_OGRDB, # kappa light chain (no D segment)
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.DistillMutationRate(),
]
)
Custom Allele Combination
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(0.003, 0.25),
productive=True,
specific_v=HUMAN_IGH_OGRDB.v_alleles['IGHVF1-G1'][0],
specific_d=HUMAN_IGH_OGRDB.d_alleles['IGHD1-1'][0],
specific_j=HUMAN_IGH_OGRDB.j_alleles['IGHJ1'][0]
)
]
)
Batch Generation
import pandas as pd
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
]
)
# Generate 1000 sequences as a DataFrame
df = pd.DataFrame([pipeline.execute().get_dict() for _ in range(1000)])
df.to_csv('simulated_repertoire.csv', index=False)
Mutation Models
| Model | Description | When to use |
|---|---|---|
S5F |
Context-dependent somatic hypermutation based on empirical 5-mer frequencies | Realistic antibody maturation studies |
Uniform |
Uniform random mutations | Baselines, ablation experiments |
| Custom | Implement BaseMutationModel |
Your own evolutionary scenarios |
from GenAIRR import S5F, Uniform
# Realistic context-aware SHM
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
# Simple uniform mutations
uniform = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)
Available Data Configurations
| Config | Chain | Source |
|---|---|---|
HUMAN_IGH_OGRDB |
Heavy chain (BCR) | OGRDB |
HUMAN_IGH_EXTENDED |
Heavy chain extended | OGRDB |
HUMAN_IGK_OGRDB |
Kappa light chain | OGRDB |
HUMAN_IGL_OGRDB |
Lambda light chain | OGRDB |
HUMAN_TCRB_IMGT |
TCR-beta | IMGT |
Reproducibility
from GenAIRR import set_seed, get_seed, reset_seed
set_seed(42) # deterministic results
print(get_seed()) # check current seed
reset_seed() # back to random
Documentation
- Getting Started — Overview and first pipeline
- Step-by-Step Tutorial — Build a pipeline from scratch
- API Reference — All classes, parameters, and defaults
- Migration Guide — Upgrading from older versions
- Biological Context — What biological processes are simulated
Roadmap
- Selection-aware mutation model
- Additional germline databases
- Sphinx auto-generated API docs from docstrings
See open issues. Feel something's missing? Open a feature request.
Contributing
Contributions are welcome! Please read our contributing guide and check the good first issue label.
Citing GenAIRR
If GenAIRR helps your research, please cite:
Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861
License
Distributed under the GPL-3.0 License. See LICENSE for details.
Acknowledgements
GenAIRR is inspired by and builds upon work from the immunoinformatics community — especially AIRRship.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genairr-0.6.1.tar.gz.
File metadata
- Download URL: genairr-0.6.1.tar.gz
- Upload date:
- Size: 2.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea3577f3ba28cc1c7591e9aed98d5265dcf4bd39d1dcdff7fd9ed81ac4626ad9
|
|
| MD5 |
e483da761391bcb3df8e3d454accaa34
|
|
| BLAKE2b-256 |
e7034ba752105ba0724fc818c0b056de57b804c688535c5c7ebdb7020095ef30
|
File details
Details for the file genairr-0.6.1-py3-none-any.whl.
File metadata
- Download URL: genairr-0.6.1-py3-none-any.whl
- Upload date:
- Size: 2.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d3034c53e4dc1715a09e3c66c7d3555c51c8be653f2aae92af63ad9e8d0b5d2
|
|
| MD5 |
6039336d35e670ee23aac9a10fe9f325
|
|
| BLAKE2b-256 |
d5a78b308f47c37c936672aa24a41b2db819937c5f31d80633f842d4b9d81b16
|