Profiling genes encoding the adaptive immune receptor repertoire with gAIRR Suite.
Project description
Updated: Nov 02, 2025
gAIRR-suite: Profiling genes encoding the adaptive immune receptor repertoire
gAIRR-annotate provides annotations of IG and TR genes on personal assembly data
gAIRR-call can genotype IG and TR genes for short read data sequenced from germline cell. It is designed for probe captured gAIRR-seq data, but it can also be applied to other type of short read enriched from IG and TR regions.
- The database materials are updated to the latest (by 2025/11/02) IMGT/GENE-DB.
- Single-end read input is allowed in this version
Prerequisite programs:
- BWA aligner (0.7.17)
- SPAdes assembler (v3.13.0)
Installation
pip install gAIRR-suite
pip install gAIRR-suite
git clone https://github.com/maojanlin/gAIRRsuite.git
cd gAIRRsuite
Though optional, it is a good practice to install in a virtual environment to manage the dependancies:
python -m venv venv
source venv/bin/activate
Now a virtual environment (named venv) is activated:
python setup.py install
gAIRR-annotate usage
$ gAIRR_annotate -wd <work_dir> -id <sample_id> -a1 <assembly_h1.fa> -a2 <assembly_h2.fa>
The -a2 argument is optional for diploid personal assemblies.
The final annotation report is work_dir/sample_id/group_genes.1.bed, and work_dir/sample_id/group_genes.2.bed if the second haplotype of the assembly is provided.
The novel alleles in the annotation will be maked with parentheses, indicating the edit distance of the novel allele to the documented genes. For example, TRAV8-3*01(i:1) means the annotated gene has $1$ edit-distance to TRAV8-3*01, TRAV19*01(i:0,h:17) has no edit-distance to original allele, but has $17$ bases being clipped.
The novel allele sequence can be found in work_dir/sample_id/novel_sample_ID_geneLocus.fasta.1/2.bed
If only IG or TR genes are prefered, option -lc IG or -lc TR can be specified.
gAIRR-call usage
$ gAIRR_call -wd <work_dir> -id <sample_ID> -rd1 <read.R1.fastq.gz> -rd2 <read.R2.fastq.gz> -lc <TRV TRJ TRD IGV IGJ IGD>
The prefered IG/TR locus and V/D/J genes can be specified with -lc option.
The final calling report is work_dir/sample_ID/gene-locus/gAIRR-call_report.rpt, gene-locus is the targeted gene set, like TCRV, TCRJ, etc.
The novel allele sequence from the gAIRR-call_report.rpt can be found in work_dir/sample_ID/gene-locus_novel/gene-locus_with_novel.fasta
--flanking option can be specified to run the flanking sequence assembly algorithm. The flanking sequnece information can be found in work_dir/sample_ID/gene-locus_flanking/flanking_result/flanking_haplotypes.fasta
Note that the seriel numbers provided by the gAIRR_call and gAIRR_annotate are not necessary corresponding. When there are multiple novel alleles, the seril number starting from $0$ can be in different order.
checking RSS
Usage:
./scripts/check_RSS.sh
The check_RSS.sh pipeline uses the RSS and separated heptamer and nonamer sequences downloaded from IMGT database to check if there are proper RSS pattern in the flanking sequences
To run the check_RSS.sh pipeline, BWA aligner should be installed.
The check_RSS.sh pipeline first align all the known IMGT RSS to the flanking sequences to check if there are identical or near-identical RSS pattern. The flanking sequences missing RSS are then recorded in RSS_checking/first_scan/missing_RSS_HG002-part_TCRJ_first_scan.fasta and passed to second scanning. The second scanning aligned heptamer and nonamer sequences separately to the flanking sequences and try to identify heptamer-nonamer pairs that resemble proper RSSs.
Generated files:
RSS_checking/first_scan/database_HG002-part_TCRJ_first_scan.csv is the RSS report file of HG002-part. It indicate if the RSS are known, novel or could not be found after first scanning.
RSS_checking/second_scan/database_HG002-part_TCRJ_second_scan.csv is the RSS report file of the flanking sequences that missed RSS in the first scanning.
Data collection pipeline
Usage:
./scripts/database_collect.sh
./scripts/allele_consensus.sh
The database_collect.sh pipeline collects the novel and flanking sequence database into database files. The duplicated novel or flanking sequences will be collapsed into one. Taking TRV novel allele as an example, generated file database_novel_TRV.tsv indicates which samples possess which novel allele, and database_novel_TRV.fasta recorded the novel allele sequence.
For samples with multiple assembly. Consensus allele result can be get from allele_consensus.sh pipeline. Taking database_novel_TRV.tsv and database_novel_TRV.fasta as input, allele_consensus.sh will generate database_novel_TRV_consensus.tsv and database_novel_TRV_consensus.fasta as output according to ./example/samples/consensus_name_HGSVC.log.
In ./example/samples/consensus_name_HGSVC.log, terms are separated by space. The first term is the consensus name while the later terms indicate the samples' different assembly id.
Example
The gAIRR_suite/material/ directory contains IMGT allele sequences and RSS information.
The example/samples/ containts two miniature samples. HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta are a small part of the pair-end gAIRR-seq reads sequenced from HG002. HG002-S22-H1-000000F_1900000-2900000.fasta is a genome assembly sequence extracted from (Garg, S. et al, 2021). The genome sequence is the 1900000:2900000 segment from the contig HG002-S22-H1 of HG002's maternal haplotype assembly.
In the example settings. Running
$ gAIRR_call -wd target_call -id HG002-part -rd1 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R1.fastq.gz -rd2 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R2.fastq.gz
or ./scripts/AIRRCall.sh
will gAIRR-call the HG002's AIRR alleles based on HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta.
Running
$ gAIRR_annotate -wd target_annotate -id HG002-part -a1 gAIRR_suite/example/samples/HG002-S22-H1-000000F_1900000-2900000.fasta
or ./scripts/AIRRAnnotate.sh
will gAIRR-annotate part of the HG002's genome assembly HG002-S22-H1-000000F_1900000-2900000.fasta. In ./scripts/AIRRAnnotate.sh , several shell script commands are commented. The commented commands are the settings to gAIRR-annotate two phased assemblies while in the example is to gAIRR-annotate single strend genome assembly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gairr_suite-0.3.2.tar.gz.
File metadata
- Download URL: gairr_suite-0.3.2.tar.gz
- Upload date:
- Size: 217.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4700055835efcb2ba6a507a23c42754a1cf7e90cb31d3dfe4310222138efde36
|
|
| MD5 |
5ebac31b93bc1d7fa422991959dae88b
|
|
| BLAKE2b-256 |
20d7eff3643ae33ca9fce0349914f2a7ceee34d8352d702ad2628fe4d25cb6c1
|