Skip to main content

An advanced immunoglobulin sequence simulation suite for benchmarking alignment models and sequence analysis.

Project description

GenAIRR: AIRR Sequence Simulator

GenAIRR is a Python module designed to generate synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences for the purpose of benchmarking alignment algorithms and conducting sequence analysis in a non-biased manner.

  • Realistic Sequence Simulation: Generate heavy and light immunoglobulin chain sequences with extensive customization options.
  • Advanced Mutation and Augmentation: Introduce mutations and augment sequences to closely mimic the natural diversity and sequencing artifacts.
  • Precision in Allele-Specific Corrections: Utilize sophisticated correction maps to accurately handle allele-specific trimming events and ambiguities.
  • Indel Simulation Capability: Reflect the intricacies of sequencing data by simulating insertions and deletions within sequences.

Visit GenAIRR's Documentation

GenAIRR's ReadTheDocs

Acknowledgements

Some parts of the code were inspired and adapted from https://github.com/Cowanlab/airrship

Quick Start Guide to GenAIRR

Welcome to the Quick Start Guide for GenAIRR, a Python module designed for generating synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences. This guide will walk you through the basic usage of GenAIRR, including setting up your environment, simulating heavy and light chain sequences, and customizing your simulations.

Installation

Before you begin, ensure that you have Python 3.x installed on your system. GenAIRR can be installed using pip, Python's package installer. Execute the following command in your terminal:

import pandas as pd
# Install GenAIRR using pip
!pip install GenAIRR

Setting Up

To start using GenAIRR, you need to import the necessary classes from the module. We'll also set up a DataConfig object to specify our configuration.

# Importing GenAIRR classes
from GenAIRR.simulation import HeavyChainSequenceAugmentor, LightChainSequenceAugmentor, SequenceAugmentorArguments
from GenAIRR.utilities import DataConfig
from GenAIRR.data import builtin_heavy_chain_data_config,builtin_kappa_chain_data_config,builtin_lambda_chain_data_config
# Initialize DataConfig with the path to your configuration
#data_config = DataConfig('/path/to/your/config')
# Or Use one of Our Builtin Data Configs
data_config_builtin = builtin_heavy_chain_data_config()


# Set up augmentation arguments (if you have specific requirements)
args = SequenceAugmentorArguments()

Simulating Heavy Chain Sequences

Let's simulate a heavy chain sequence using HeavyChainSequenceAugmentor. This example demonstrates a simple simulation with default settings.

# Initialize the HeavyChainSequenceAugmentor
heavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, args)

# Simulate a heavy chain sequence
heavy_sequence = heavy_augmentor.simulate_augmented_sequence

# Print the simulated heavy chain sequence
print("Simulated Heavy Chain Sequence:", heavy_sequence)
Simulated Heavy Chain Sequence: <bound method HeavyChainSequenceAugmentor.simulate_augmented_sequence of <GenAIRR.simulation.heavy_chain_sequence_augmentor.HeavyChainSequenceAugmentor object at 0x000001FD56378D90>>

Customizing Simulations

GenAIRR allows for extensive customization to closely mimic the natural diversity of immune sequences. Below is an example of how to customize mutation rates and indel simulations.

# Customize augmentation arguments
custom_args = SequenceAugmentorArguments(min_mutation_rate=0.01, max_mutation_rate=0.05, simulate_indels=True, max_indels=3,
                                         corrupt_proba=0.7,save_ns_record=True,save_mutations_record=True)

# Use custom arguments to simulate a heavy chain sequence
custom_heavy_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)
custom_heavy_sequence = custom_heavy_augmentor.simulate_augmented_sequence()

# Print the customized heavy chain sequence
print("Customized Simulated Heavy Chain Sequence:", custom_heavy_sequence)
Customized Simulated Heavy Chain Sequence: {'sequence': 'GTGTTGGAGTACGAACGCGGAGTTCTGTTGTGAATTGGGCGGTGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGCCCCTGNGACTCTCCTGTGCAGCCTCTGGANTCACCTTTAGTAGCTATTGGNTGAGGTGNGTCCGCCAGGCTCCAGGGAAGGGACTGGAGTGGGTGGCCAACATAAAACAAGATGGAAGTGAGAAATACTATGTNGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACNCGGCNGTGTATTACTGTGCGAGAGTCCGACAGGAGCAGCCAAATCGTCTCTTCGGCTACTCAGGGACCCTTTCTGGTTNGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG', 'v_sequence_start': 43, 'v_sequence_end': 338, 'd_sequence_start': 347, 'd_sequence_end': 353, 'j_sequence_start': 386, 'j_sequence_end': 433, 'v_call': 'IGHVF10-G49*03,IGHVF10-G49*04', 'd_call': 'IGHD6-13*01,IGHD6-25*01,IGHD6-6*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.02771362586605081, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 6, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'add', 'corruption_add_amount': 43, 'corruption_remove_amount': 0, 'mutations': b'ezkxOiAnVD5DJywgMTQ3OiAnQz5HJywgMTc0OiAnRz5BJywgMTk4OiAnRz5BJ30=', 'Ns': b'ezk3OiAnQT5OJywgMTIxOiAnVD5OJywgMTQyOiAnQT5OJywgMTUwOiAnRz5OJywgMjI1OiAnRz5OJywgMzEzOiAnQT5OJywgMzE4OiAnVD5OJywgMzkyOiAnQz5OJ30=', 'indels': {}}

Generating Naïve Sequences

In immunogenetics, a naïve sequence refers to an antibody sequence that has not undergone the process of somatic hypermutation. GenAIRR allows you to simulate such naïve sequences using the HeavyChainSequence class. Let's start by generating a naïve heavy chain sequence.

from GenAIRR.sequence import HeavyChainSequence

# Create a naive heavy chain sequence
naive_heavy_sequence = HeavyChainSequence.create_random(data_config_builtin)

# Access the generated naive sequence
naive_sequence = naive_heavy_sequence

print("Naïve Heavy Chain Sequence:", naive_sequence)
print('Ungapped Sequence: ')
print(naive_sequence.ungapped_seq)
Naïve Heavy Chain Sequence: 0|-----------------------------------------------------------------------------V(IGHVF3-G8*01)|294|296|----D(IGHD2-8*02)|312|332|------------J(IGHJ2*01)|381
Ungapped Sequence: 
CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG

Applying Mutations

To mimic the natural diversity and evolution of immune sequences, GenAIRR supports the simulation of mutations through various models. Here, we demonstrate how to apply mutations to a naïve sequence using the S5F and Uniform mutation models from the mutations submodule.

Using the S5F Mutation Model

The S5F model is a sophisticated mutation model that considers context-dependent mutation probabilities. It's particularly useful for simulating realistic somatic hypermutations.

from GenAIRR.mutation import S5F

# Initialize the S5F mutation model with custom mutation rates
s5f_model = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)

# Apply mutations to the naive sequence using the S5F model
s5f_mutated_sequence, mutations, mutation_rate = s5f_model.apply_mutation(naive_heavy_sequence)

print("S5F Mutated Heavy Chain Sequence:", s5f_mutated_sequence)
print("S5F Mutation Details:", mutations)
print("S5F Mutation Rate:", mutation_rate)
S5F Mutated Heavy Chain Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGCTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCTAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCAGAGCTCTGTGACCGCCGCGGACTCGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG
S5F Mutation Details: {270: 'A>T', 192: 'A>T', 247: 'T>A', 76: 'G>C'}
S5F Mutation Rate: 0.011222406361310347

Using the Uniform Mutation Model

The Uniform mutation model applies mutations at a uniform rate across the sequence, providing a simpler alternative to the context-dependent models.

from GenAIRR.mutation import Uniform

# Initialize the Uniform mutation model with custom mutation rates
uniform_model = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)

# Apply mutations to the naive sequence using the Uniform model
uniform_mutated_sequence, mutations, mutation_rate = uniform_model.apply_mutation(naive_heavy_sequence)

print("Uniform Mutated Heavy Chain Sequence:", uniform_mutated_sequence)
print("Uniform Mutation Details:", mutations)
print("Uniform Mutation Rate:", mutation_rate)
Uniform Mutated Heavy Chain Sequence: CAGGTGCACCTGCAGGAGTCGGGCCGAGGAGTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTTACTTGTGGAGTTGGGTCCGCCAGCCACCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAAAACTGGTGGTGTATGCTCCTATCTCCCACATAGGGCTTGGTACTTCGATCTGTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAG
Uniform Mutation Details: {122: 'C>A', 8: 'G>C', 100: 'G>T', 96: 'A>T', 25: 'C>G', 346: 'C>G', 30: 'C>G'}
Uniform Mutation Rate: 0.019802269687583134

Common Use Cases

GenAIRR is a versatile tool designed to meet a broad range of needs in immunogenetics research. This section provides examples and explanations for some common use cases, including generating multiple sequences, simulating specific allele combinations, and more.

Generating Many Sequences

One common requirement is to generate a large dataset of synthetic AIRR sequences for analysis or benchmarking. Below is an example of how to generate multiple sequences using GenAIRR in a loop.

num_sequences = 5  # Number of sequences to generate

heavy_sequences = []
for _ in range(num_sequences):
    # Simulate a heavy chain sequence
    heavy_sequence = heavy_augmentor.simulate_augmented_sequence()
    heavy_sequences.append(heavy_sequence)

# Display the generated sequences
for i, seq in enumerate(heavy_sequences, start=1):
    print(f"Heavy Chain Sequence {i}: {seq}")
Heavy Chain Sequence 1: {'sequence': 'TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCGANGCTGAACTCCACAACGCCTCCCTCAAAACCTGACCCACCACGTCCAGGGACCCGTCCGGTAGTCACATGGTCCTGACACTGTCGAACATGGACCCTGTGGACACAGTCACACATTACTGTGCACCGATNCCCCCCCCTACGANGATTCCGGCCGGGCCCTGGCTAATCCAATCACTTGTTGGAGGTCTGGGGCAAAGGGACCACGGCCACCGACTCNTAAG', 'v_sequence_start': 9, 'v_sequence_end': 176, 'd_sequence_start': 185, 'd_sequence_end': 199, 'j_sequence_start': 207, 'j_sequence_end': 269, 'v_call': 'IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04', 'd_call': 'IGHD3-10*03', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.241635687732342, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 4, 'd_trim_3': 13, 'j_trim_5': 2, 'j_trim_3': 0, 'corruption_event': 'remove_before_add', 'corruption_add_amount': 9, 'corruption_remove_amount': 132, 'indels': {}}
Heavy Chain Sequence 2: {'sequence': 'CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGGAGGCCCAGTCCCTCTCGTACCCTGTCTCTGGTGACTCCATCAGCAATAGTGGTTACTCCTGGGGCTGAATCCGTCCCCCCNCAGGGAAGGGGCTGGAGTGGATNGCGACTATANATTATAGGGGCAGCTCCTGCTACAACCCGTCCCTCAAGAGTCGAGTCACCATCTCCACAGACACGTCCAAGAAGCAGGTCTCCCTGATGCTGAGCTCTATGACCGCCGCANACACGACTGTNTATTACTGTGCGAGAGTCATGGTTCTGATGTTTTGGAGCAACTGGTTCGACCCCTGGGACCAGGGAAGCCTGGTCACCCTCTCCTCAN', 'v_sequence_start': 0, 'v_sequence_end': 297, 'd_sequence_start': 308, 'd_sequence_end': 317, 'j_sequence_start': 320, 'j_sequence_end': 370, 'v_call': 'IGHVF3-G10*06', 'd_call': 'IGHD3-9*01', 'j_call': 'IGHJ5*02', 'mutation_rate': 0.11621621621621622, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 15, 'j_trim_5': 1, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
Heavy Chain Sequence 3: {'sequence': 'CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGTCGTAATTGAAACTTTCACGAAAGACTTCTGAATACACTCGCACANACCGCTATCTGCACTGGGTCCGACAGGCCCCCAGACGGGCGTTTGAGTGGGTGGGGNGGATCACGCCTTTCAGTGGTAACACCCACTACGTGCAGACGTCCCAGGACAGAGTCCCCATTACCAGGNACAAGTNTACGAGTCCAGCCTATATAGAACTGAACACCCTNAAATGCGAGGACACAGACATATATTAATGCGCANGATCCACGGGAACCCCAGCNGAGAACTGGTACTTCGATCTTTGGGGCCGTGGCCCCCTGATCACCGTCTACTCTG', 'v_sequence_start': 0, 'v_sequence_end': 295, 'd_sequence_start': 295, 'd_sequence_end': 305, 'j_sequence_start': 316, 'j_sequence_end': 367, 'v_call': 'IGHVF6-G20*02', 'd_call': 'IGHD4-11*01,IGHD4-4*01', 'j_call': 'IGHJ2*01', 'mutation_rate': 0.1989100817438692, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 3, 'd_trim_3': 3, 'j_trim_5': 3, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
Heavy Chain Sequence 4: {'sequence': 'CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGANTACACCTTCACCAGGTATGATATNAGCTNGGTTCGACAGGCCCCTGGACAAGGGCTTGAGTGGGTGGGATGGATCAGCGCTTACAAGGGTAACACAAACTATGAACAGAAGCTCCAGGGCAGAGTCACCATGACCACTGACACATCCACGAGCACAGCCTACATAGAGCTGAGGAGTCTGAGATCTGACGACACGGCCGTGTATCACTGTGCGAGAATCGGCGGCAGGGACGAGTCCGCAGATATCTCGCATCCCTATTGCTACTCCGGTATGGACGTCTGGGGCCAAGNNACCACGGTCACCGTCTCCTCAG', 'v_sequence_start': 0, 'v_sequence_end': 294, 'd_sequence_start': 316, 'd_sequence_end': 323, 'j_sequence_start': 332, 'j_sequence_end': 391, 'v_call': 'IGHVF6-G25*02', 'd_call': 'IGHD5-18*01,IGHD5-5*01', 'j_call': 'IGHJ6*02', 'mutation_rate': 0.0639386189258312, 'v_trim_5': 0, 'v_trim_3': 2, 'd_trim_5': 8, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
Heavy Chain Sequence 5: {'sequence': 'GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGGAGTCCCTCANACTGCCCTGTGCAGCCTCTGGATTCACCTGGTGAANGNATGCCGTGAATTGGGGCCGGCGGCCTCCAGGGATGGGACTTGATTGGGTCTCAGTTCTNAGTGCTAGTGGTGAGAGAACNTTCTCCATAGACTCCATGAAGGGCCGGGTCACCACCTCCAGGGTCAATTGCAAGAGTACGCTGTATCTGAAAATGAAGGGCCTGAGAGCCGAGGACGCGGCTGTTTATTATTGAGCGAGAGAGGCCTTAGGGTCGGATTACTACTCCTTTTACATGGACGTCTGGGGCACAGGGACCGCGGNCACCGTCTCGTCAC', 'v_sequence_start': 0, 'v_sequence_end': 296, 'd_sequence_start': 302, 'd_sequence_end': 306, 'j_sequence_start': 310, 'j_sequence_end': 370, 'v_call': 'IGHVF10-G41*02', 'd_call': 'Short-D', 'j_call': 'IGHJ6*03', 'mutation_rate': 0.1972972972972973, 'v_trim_5': 0, 'v_trim_3': 0, 'd_trim_5': 7, 'd_trim_3': 6, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'no-corruption', 'corruption_add_amount': 0, 'corruption_remove_amount': 0, 'indels': {}}
import pandas as pd
pd.DataFrame(heavy_sequences)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sequence v_sequence_start v_sequence_end d_sequence_start d_sequence_end j_sequence_start j_sequence_end v_call d_call j_call ... v_trim_5 v_trim_3 d_trim_5 d_trim_3 j_trim_5 j_trim_3 corruption_event corruption_add_amount corruption_remove_amount indels
0 TTGGNAAGCCAGGCCCTGGAGTGACTTTCACACACTGATCGGTGCG... 9 176 185 199 207 269 IGHVF1-G3*06,IGHVF1-G3*05,IGHVF1-G3*04 IGHD3-10*03 IGHJ6*03 ... 0 2 4 13 2 0 remove_before_add 9 132 {}
1 CAGCTGCAGTTGCAGGAGTCGGGCCCNGGACTGGTGAAGCCTTTGG... 0 297 308 317 320 370 IGHVF3-G10*06 IGHD3-9*01 IGHJ5*02 ... 0 2 8 15 1 0 no-corruption 0 0 {}
2 CAGAAGAGACTGGTGCAGTCTGGGGTTGACATGAAGACGACTGGGT... 0 295 295 305 316 367 IGHVF6-G20*02 IGHD4-11*01,IGHD4-4*01 IGHJ2*01 ... 0 1 3 3 3 0 no-corruption 0 0 {}
3 CAGTTTCAGCTGGTGCCGTCTGGAGCTGAGGTGAAGAAGNCTGNGG... 0 294 316 323 332 391 IGHVF6-G25*02 IGHD5-18*01,IGHD5-5*01 IGHJ6*02 ... 0 2 8 6 4 0 no-corruption 0 0 {}
4 GAGGTGCAACTGCTGCAGACCGGGTCAGACTTGATACAGCCAGGGG... 0 296 302 306 310 370 IGHVF10-G41*02 Short-D IGHJ6*03 ... 0 0 7 6 4 0 no-corruption 0 0 {}

5 rows × 21 columns

Generating a Specific Allele Combination Sequence

In some cases, you might want to simulate sequences with specific V, D, and J allele combinations. Here's how to specify alleles for your simulations.

# Define your specific alleles
v_allele = 'IGHVF6-G21*01'
d_allele = 'IGHD5-18*01'
j_allele = 'IGHJ6*03'

# Extract the allele objects from data_config
v_allele = next((allele for family in data_config_builtin.v_alleles.values() for allele in family if allele.name == v_allele), None)
d_allele = next((allele for family in data_config_builtin.d_alleles.values() for allele in family if allele.name == d_allele), None)
j_allele = next((allele for family in data_config_builtin.j_alleles.values() for allele in family if allele.name == j_allele), None)

# Check if all alleles were found
if not v_allele or not d_allele or not j_allele:
    raise ValueError("One or more specified alleles could not be found in the data config.")


# Generate a sequence with the specified allele combination
specific_allele_sequence = HeavyChainSequence([v_allele, d_allele, j_allele], data_config_builtin)
specific_allele_sequence.mutate(s5f_model)



print("Specific Allele Combination Sequence:", specific_allele_sequence.mutated_seq)
Specific Allele Combination Sequence: CAGGTGCAGTTGGTGCAGTCTGGGACTGAGTTGAAGACGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTAGAGGCACCTTCAGCAGCTCTGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGATAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGATACGGCCGTGTATTACTGTGCGAGAGAGGATGGGTCCGGATCCCACCCCATTTACTATTACTACTACTACATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCCTCAG

Simulating Sequences with Custom Mutation Rates

Adjusting mutation rates allows for the simulation of sequences at various stages of affinity maturation. Here's how to customize mutation rates in your simulations.

# Customize augmentation arguments with your desired mutation rates
custom_args = SequenceAugmentorArguments(min_mutation_rate=0.15, max_mutation_rate=0.3)

# Initialize the augmentor with custom arguments
custom_augmentor = HeavyChainSequenceAugmentor(data_config_builtin, custom_args)

# Generate a sequence with the custom mutation rates
custom_mutation_sequence = custom_augmentor.simulate_augmented_sequence()

print("Custom Mutation Rate Sequence:", custom_mutation_sequence)
Custom Mutation Rate Sequence: {'sequence': 'GGTTGGAGCTCATTGGGAGCTNCTATTCTAGTGGGACTACCTAGTACAACCTGTCCCTCAAGAATCGCGTCACCATATCAGTCGACACGTCCAAGAATCANTCCTCCCTGGAGCTGAGCTCCGTGACCGCAGCGGACACGGCCGTGCCTNGTTGNGCGGGAAAGTTGAATATAGTGGCTAACTCTGCCTTTTGCTCTCTGGGGCCAGGGGACAGTGGCCACTGTTTTTTCAG', 'v_sequence_start': 0, 'v_sequence_end': 161, 'd_sequence_start': 165, 'd_sequence_end': 180, 'j_sequence_start': 186, 'j_sequence_end': 232, 'v_call': 'IGHVF3-G10*04', 'd_call': 'IGHD5-12*01,IGHD5-18*02', 'j_call': 'IGHJ3*02', 'mutation_rate': 0.15517241379310345, 'v_trim_5': 0, 'v_trim_3': 1, 'd_trim_5': 4, 'd_trim_3': 8, 'j_trim_5': 4, 'j_trim_3': 0, 'corruption_event': 'remove', 'corruption_add_amount': 0, 'corruption_remove_amount': 136, 'indels': {}}

Generating Naïve vs. Mutated Sequence Pairs

Comparing naïve and mutated versions of the same sequence can be useful for studying somatic hypermutation effects. Here's how to generate such pairs with GenAIRR.

# Generate a naive sequence
sequence_object = HeavyChainSequence.create_random(data_config_builtin)
sequence_object.mutate(s5f_model)

print("Naïve Sequence:", sequence_object.ungapped_seq)
print("Mutated Sequence:", sequence_object.mutated_seq)
Naïve Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAAAGCCACTCGGTCACACTACGGTGGTAACTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG
Mutated Sequence: CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGAACATCCATTATAGTGGGAGCATCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCACTAGACACGTCCAAGAACCAGTTCTCCCTGAAACTGAGCTCTGTGGCCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAACGCCACTCGGTCACACTACGGTGGTAATTCATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCACCGTCTCTTCAG

Simulating TCR-Beta Sequences

GenAIRR also support TCRB sequence simulation. Here's how you can simulate TCRB Sequences.

# Customize augmentation arguments with your desired mutation rates
from GenAIRR.TCR.simulation import TCRHeavyChainSequenceAugmentor, SequenceAugmentorArguments
from GenAIRR.data import builtin_tcrb_data_config

tcr_data_config = builtin_tcrb_data_config()
custom_args = SequenceAugmentorArguments(simulate_indels=0.2)

# Initialize the augmentor with custom arguments
custom_augmentor = TCRHeavyChainSequenceAugmentor(tcr_data_config, custom_args)

# Generate 100 sequences
generated_seqs = []
for _ in range(100):
    generated_seqs.append(custom_augmentor.simulate_augmented_sequence())

print("Generated Sequences:", generated_seqs)

Conclusion

This section highlighted some common use cases for GenAIRR, demonstrating its flexibility in simulating AIRR sequences for various research purposes. Whether you need large datasets, specific allele combinations, custom mutation rates, or comparative analyses of naïve and mutated sequences, GenAIRR provides the necessary tools to achieve your objectives.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GenAIRR-0.3.0.tar.gz (2.1 MB view details)

Uploaded Source

File details

Details for the file GenAIRR-0.3.0.tar.gz.

File metadata

  • Download URL: GenAIRR-0.3.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for GenAIRR-0.3.0.tar.gz
Algorithm Hash digest
SHA256 90b595fa93a563c749f54c2f59e64dc4ffa01509ebb349b8ba062270cb05692d
MD5 80d441138efdb887aee4f8ab933b64cd
BLAKE2b-256 e2acb4f16e007dce7153aa2d9953ee7d8404493a40d831f1e0d2f08386157027

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page