Profiling genes encoding the adaptive immune receptor repertoire with gAIRR Suite.

These details have not been verified by PyPI

Project links

Project description

Updated: Nov 02, 2025

gAIRR-suite: Profiling genes encoding the adaptive immune receptor repertoire

gAIRR-annotate provides annotations of IG and TR genes on personal assembly data

gAIRR-call can genotype IG and TR genes for short read data sequenced from germline cell. It is designed for probe captured gAIRR-seq data, but it can also be applied to other type of short read enriched from IG and TR regions.

The database materials are updated to the latest (by 2025/11/02) IMGT/GENE-DB.
Single-end read input is allowed in this version

Prerequisite programs:

BWA aligner (0.7.17)
SPAdes assembler (v3.13.0)

Installation

pip install gAIRR-suite

pip install gAIRR-suite

Github

git clone https://github.com/maojanlin/gAIRRsuite.git
cd gAIRRsuite

Though optional, it is a good practice to install in a virtual environment to manage the dependancies:

python -m venv venv
source venv/bin/activate

Now a virtual environment (named venv) is activated:

python setup.py install

gAIRR-annotate usage

$ gAIRR_annotate -wd <work_dir> -id <sample_id> -a1 <assembly_h1.fa> -a2 <assembly_h2.fa>

The -a2 argument is optional for diploid personal assemblies.

The final annotation report is work_dir/sample_id/group_genes.1.bed, and work_dir/sample_id/group_genes.2.bed if the second haplotype of the assembly is provided.

The novel alleles in the annotation will be maked with parentheses, indicating the edit distance of the novel allele to the documented genes. For example, TRAV8-3*01(i:1) means the annotated gene has $1$ edit-distance to TRAV8-3*01, TRAV19*01(i:0,h:17) has no edit-distance to original allele, but has $17$ bases being clipped.

The novel allele sequence can be found in work_dir/sample_id/novel_sample_ID_geneLocus.fasta.1/2.bed

If only IG or TR genes are prefered, option -lc IG or -lc TR can be specified.

gAIRR-call usage

$ gAIRR_call -wd <work_dir> -id <sample_ID> -rd1 <read.R1.fastq.gz> -rd2 <read.R2.fastq.gz> -lc <TRV TRJ TRD IGV IGJ IGD>

The prefered IG/TR locus and V/D/J genes can be specified with -lc option.

The final calling report is work_dir/sample_ID/gene-locus/gAIRR-call_report.rpt, gene-locus is the targeted gene set, like TCRV, TCRJ, etc.

The novel allele sequence from the gAIRR-call_report.rpt can be found in work_dir/sample_ID/gene-locus_novel/gene-locus_with_novel.fasta

--flanking option can be specified to run the flanking sequence assembly algorithm. The flanking sequnece information can be found in work_dir/sample_ID/gene-locus_flanking/flanking_result/flanking_haplotypes.fasta

Note that the seriel numbers provided by the gAIRR_call and gAIRR_annotate are not necessary corresponding. When there are multiple novel alleles, the seril number starting from $0$ can be in different order.

checking RSS

Usage:

./scripts/check_RSS.sh

The check_RSS.sh pipeline uses the RSS and separated heptamer and nonamer sequences downloaded from IMGT database to check if there are proper RSS pattern in the flanking sequences

To run the check_RSS.sh pipeline, BWA aligner should be installed.

The check_RSS.sh pipeline first align all the known IMGT RSS to the flanking sequences to check if there are identical or near-identical RSS pattern. The flanking sequences missing RSS are then recorded in RSS_checking/first_scan/missing_RSS_HG002-part_TCRJ_first_scan.fasta and passed to second scanning. The second scanning aligned heptamer and nonamer sequences separately to the flanking sequences and try to identify heptamer-nonamer pairs that resemble proper RSSs.

Generated files: RSS_checking/first_scan/database_HG002-part_TCRJ_first_scan.csv is the RSS report file of HG002-part. It indicate if the RSS are known, novel or could not be found after first scanning. RSS_checking/second_scan/database_HG002-part_TCRJ_second_scan.csv is the RSS report file of the flanking sequences that missed RSS in the first scanning.

Data collection pipeline

Usage:

./scripts/database_collect.sh
./scripts/allele_consensus.sh

The database_collect.sh pipeline collects the novel and flanking sequence database into database files. The duplicated novel or flanking sequences will be collapsed into one. Taking TRV novel allele as an example, generated file database_novel_TRV.tsv indicates which samples possess which novel allele, and database_novel_TRV.fasta recorded the novel allele sequence.

For samples with multiple assembly. Consensus allele result can be get from allele_consensus.sh pipeline. Taking database_novel_TRV.tsv and database_novel_TRV.fasta as input, allele_consensus.sh will generate database_novel_TRV_consensus.tsv and database_novel_TRV_consensus.fasta as output according to ./example/samples/consensus_name_HGSVC.log.

In ./example/samples/consensus_name_HGSVC.log, terms are separated by space. The first term is the consensus name while the later terms indicate the samples' different assembly id.

Example

The gAIRR_suite/material/ directory contains IMGT allele sequences and RSS information. The example/samples/ containts two miniature samples. HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta are a small part of the pair-end gAIRR-seq reads sequenced from HG002. HG002-S22-H1-000000F_1900000-2900000.fasta is a genome assembly sequence extracted from (Garg, S. et al, 2021). The genome sequence is the 1900000:2900000 segment from the contig HG002-S22-H1 of HG002's maternal haplotype assembly.

In the example settings. Running

$ gAIRR_call -wd target_call -id HG002-part -rd1 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R1.fastq.gz -rd2 gAIRR_suite/example/samples/HG002_part_gAIRR-seq_R2.fastq.gz

or ./scripts/AIRRCall.sh will gAIRR-call the HG002's AIRR alleles based on HG002_part_gAIRR-seq_R1.fasta and HG002_part_gAIRR-seq_R2.fasta.

Running

$ gAIRR_annotate -wd target_annotate -id HG002-part -a1 gAIRR_suite/example/samples/HG002-S22-H1-000000F_1900000-2900000.fasta

or ./scripts/AIRRAnnotate.sh will gAIRR-annotate part of the HG002's genome assembly HG002-S22-H1-000000F_1900000-2900000.fasta. In ./scripts/AIRRAnnotate.sh , several shell script commands are commented. The commented commands are the settings to gAIRR-annotate two phased assemblies while in the example is to gAIRR-annotate single strend genome assembly.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.3

Nov 17, 2025

This version

0.3.2

Nov 3, 2025

0.3.1

Nov 3, 2025

0.3.0

Nov 3, 2025

0.2.0

Sep 18, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gairr_suite-0.3.2.tar.gz (217.4 kB view details)

Uploaded Nov 3, 2025 Source

File details

Details for the file gairr_suite-0.3.2.tar.gz.

File metadata

Download URL: gairr_suite-0.3.2.tar.gz
Upload date: Nov 3, 2025
Size: 217.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.16

File hashes

Hashes for gairr_suite-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`4700055835efcb2ba6a507a23c42754a1cf7e90cb31d3dfe4310222138efde36`
MD5	`5ebac31b93bc1d7fa422991959dae88b`
BLAKE2b-256	`20d7eff3643ae33ca9fce0349914f2a7ceee34d8352d702ad2628fe4d25cb6c1`

See more details on using hashes here.

gAIRR-suite 0.3.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

gAIRR-suite: Profiling genes encoding the adaptive immune receptor repertoire

Prerequisite programs:

Installation

gAIRR-annotate usage

gAIRR-call usage

checking RSS

Data collection pipeline

Example

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes