An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.

These details have not been verified by PyPI

Project links

Project description

GenAIRR

Adaptive Immune Receptor Repertoire Sequence Simulator
Generate realistic BCR & TCR repertoires with full ground-truth annotations in Python.

Why GenAIRR?

Benchmarking sequence aligners, studying somatic hypermutation, or training ML models on immune repertoires requires large, perfectly-annotated datasets — not noisy snippets of real sequencing data.

GenAIRR is a plug-and-play, fully-extensible simulation engine that produces realistic immunoglobulin and TCR sequences while giving you complete ground-truth labels for every position, mutation, and gene segment.

Key Features

Category	Highlights
Realistic Simulation	Context-aware S5F mutations, indels, allele-specific trimming, NP-region modelling
Composable Pipelines	Chain together built-in & custom steps into simulation pipelines
Multi-Chain Support	Heavy chain, kappa/lambda light chains, and TCR-beta out of the box
Research-ready Output	Full ground-truth annotations, JSON/pandas export, deterministic seeds
Docs & Tutorials	Step-by-step guides, Jupyter notebooks, API reference

Installation

# Python >= 3.9
pip install GenAIRR

Quick Start

One-liner

from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F

result = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25))
print(result.sequence)

CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGGGACCCTGTCCCTCACCTGCGCTG...

Generate multiple sequences at once:

results = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25), n=100)

Pipeline (Full Control)

For complete control over the simulation, use the Pipeline API:

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[
        steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixDPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),
        steps.CorrectForVEndCut(),
        steps.CorrectForDTrims(),
        steps.DistillMutationRate(),
    ]
)

sim = pipeline.execute()
print(sim.get_dict())

{
    'sequence': 'CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCG...',
    'v_call': ['IGHVF3-G8*04'],
    'd_call': ['IGHD6-6*01'],
    'j_call': ['IGHJ4*02'],
    'productive': True,
    'mutation_rate': 0.0027,
    'mutations': {142: 'T>C'},
    'v_sequence_start': 0,
    'v_sequence_end': 293,
    'd_sequence_start': 298,
    'd_sequence_end': 316,
    'j_sequence_start': 323,
    'j_sequence_end': 367,
    # ... and more fields
}

Every output includes the full sequence, V/D/J gene calls, mutation positions, region boundaries, and quality metrics — ready for downstream analysis.

Examples

Full Heavy-Chain Pipeline

A production-ready pipeline that simulates sequences with biological corrections and sequencing artifacts:

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[
        # Core: generate sequence with somatic hypermutation
        steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),

        # Correct ground-truth positions after trimming ambiguities
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixDPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),
        steps.CorrectForVEndCut(),
        steps.CorrectForDTrims(),

        # Calculate final mutation rate
        steps.DistillMutationRate(),

        # Simulate sequencing artifacts
        steps.CorruptSequenceBeginning(),   # 5' end degradation
        steps.EnforceSequenceLength(),      # read-length limit
        steps.InsertNs(),                   # ambiguous base calls
        steps.ShortDValidation(),           # D-region QC
        steps.InsertIndels(),               # sequencing indels
    ]
)

result = pipeline.execute()

Naive Sequence (No Mutations)

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, Uniform

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[steps.SimulateSequence(Uniform(0, 0), productive=True)]
)
naive_seq = pipeline.execute()

Light Chain

from GenAIRR import Pipeline, steps, HUMAN_IGK_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGK_OGRDB,  # kappa light chain (no D segment)
    steps=[
        steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),
        steps.CorrectForVEndCut(),
        steps.DistillMutationRate(),
    ]
)

Custom Allele Combination

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[
        steps.SimulateSequence(
            S5F(0.003, 0.25),
            productive=True,
            specific_v=HUMAN_IGH_OGRDB.v_alleles['IGHVF1-G1'][0],
            specific_d=HUMAN_IGH_OGRDB.d_alleles['IGHD1-1'][0],
            specific_j=HUMAN_IGH_OGRDB.j_alleles['IGHJ1'][0]
        )
    ]
)

Batch Generation

import pandas as pd
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[
        steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixDPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),
        steps.CorrectForVEndCut(),
        steps.CorrectForDTrims(),
        steps.DistillMutationRate(),
    ]
)

# Generate 1000 sequences as a DataFrame
df = pd.DataFrame([pipeline.execute().get_dict() for _ in range(1000)])
df.to_csv('simulated_repertoire.csv', index=False)

Mutation Models

Model	Description	When to use
`S5F`	Context-dependent somatic hypermutation based on empirical 5-mer frequencies	Realistic antibody maturation studies
`Uniform`	Uniform random mutations	Baselines, ablation experiments
Custom	Implement `BaseMutationModel`	Your own evolutionary scenarios

from GenAIRR import S5F, Uniform

# Realistic context-aware SHM
s5f = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)

# Simple uniform mutations
uniform = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)

Available Data Configurations

Config	Chain	Source
`HUMAN_IGH_OGRDB`	Heavy chain (BCR)	OGRDB
`HUMAN_IGH_EXTENDED`	Heavy chain extended	OGRDB
`HUMAN_IGK_OGRDB`	Kappa light chain	OGRDB
`HUMAN_IGL_OGRDB`	Lambda light chain	OGRDB
`HUMAN_TCRB_IMGT`	TCR-beta	IMGT

Reproducibility

from GenAIRR import set_seed, get_seed, reset_seed

set_seed(42)         # deterministic results
print(get_seed())    # check current seed
reset_seed()         # back to random

Documentation

Getting Started — Overview and first pipeline
Step-by-Step Tutorial — Build a pipeline from scratch
API Reference — All classes, parameters, and defaults
Migration Guide — Upgrading from older versions
Biological Context — What biological processes are simulated

Roadmap

Selection-aware mutation model
Additional germline databases
Sphinx auto-generated API docs from docstrings

See open issues. Feel something's missing? Open a feature request.

Contributing

Contributions are welcome! Please read our contributing guide and check the good first issue label.

Citing GenAIRR

If GenAIRR helps your research, please cite:

Konstantinovsky T, Peres A, Polak P, Yaari G.
An unbiased comparison of immunoglobulin sequence aligners.
Briefings in Bioinformatics. 2024 Sep 23; 25(6): bbae556.
https://doi.org/10.1093/bib/bbae556
PMID: 39489605 | PMCID: PMC11531861

License

Distributed under the GPL-3.0 License. See LICENSE for details.

Acknowledgements

GenAIRR is inspired by and builds upon work from the immunoinformatics community — especially AIRRship.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Mar 10, 2026

This version

0.6.3

Feb 25, 2026

0.6.1

Jan 27, 2026

0.6.0

Sep 20, 2025

0.5.2

Aug 6, 2025

0.5.1

Jul 17, 2025

0.5

Jul 17, 2025

0.4.1

May 21, 2025

0.4

May 21, 2025

0.3.2

Dec 7, 2024

0.3.0

Nov 12, 2024

0.2.0

Sep 2, 2024

0.1.4

Jun 28, 2024

0.1.3

May 26, 2024

0.1.2

May 25, 2024

0.1.1

May 25, 2024

0.1.0

May 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genairr-0.6.3.tar.gz (2.5 MB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

genairr-0.6.3-py3-none-any.whl (2.4 MB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file genairr-0.6.3.tar.gz.

File metadata

Download URL: genairr-0.6.3.tar.gz
Upload date: Feb 25, 2026
Size: 2.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for genairr-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`4219849b1f49441660e86d6a0a1db5736f297edfd4b816a91b8150ae8815bb50`
MD5	`091bd13e2663f72032185acedfaf4793`
BLAKE2b-256	`5cf0d7ec48e0fe49a629dae7597d71946ffbff1ee5490bc784fe601a9533e167`

See more details on using hashes here.

File details

Details for the file genairr-0.6.3-py3-none-any.whl.

File metadata

Download URL: genairr-0.6.3-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 2.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for genairr-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`569de4b3b38ca8f86550edee88cde4ce2d58994b187bce6c38a1d2964e193905`
MD5	`51bf00d2da68b0137a893c4e7aeb0c67`
BLAKE2b-256	`c8bff63c75fd6fb81b353ac172ef9e49f84a829c51aab6e3a3d85bc7f4163cf9`

See more details on using hashes here.

GenAIRR 0.6.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GenAIRR

Why GenAIRR?

Key Features

Installation

Quick Start

One-liner

Pipeline (Full Control)

Examples

Full Heavy-Chain Pipeline

Naive Sequence (No Mutations)

Light Chain

Custom Allele Combination

Batch Generation

Mutation Models

Available Data Configurations

Reproducibility

Documentation

Roadmap

Contributing

Citing GenAIRR

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes