VDJ-Insights provides a robust framework for the accurate annotation of complex genomic immune regions.
Project description
VDJ-Insights
Introduction
VDJ-Insights is a robust software package for accurate annotation of the V, D, and J gene segments within immunoglobulin (IG) and T-cell receptor (TCR) genomic regions. In addition to segment annotation, it evaluates gene functionality, detects recombination signal sequences (RSS), and annotates complementary-determining regions 1 and 2 (CDR1 and CDR2). These features extend the utility of VDJ-Insights beyond gene annotation, providing a powerful framework for functional immunogenetics and enabling evolutionary and comparative analyses at individual, population, and species levels.
Installation
VDJ-Insights is currently only supported on Linux systems. Before running the pipeline, please ensure that Python (version 3.7 or higher) and Conda are installed on your system. You can install VDJ-Insights using one of the following methods:
Option 1: Clone the repository
-
Clone the VDJ-Insights repository:
git clone https://github.com/BPRC-CGR/VDJ-insights
-
Navigate to the repository directory:
cd vdj_insights
-
Run the pipeline using Python's -m option:
python -m vdj_insights <annotation|html> [arguments]
Note: When cloning the repository, the pipeline must always be executed using the python -m option. This ensures that Python correctly recognizes the package structure and runs the pipeline without additional installation steps.
Option 2: Install via pip
- Use pip to install VDJ-Insights:
pip install vdj_insights
- Run the pipeline:
vdj_insights <annotation|html> [arguments]
Using VDJ-Insights
Use the following command to run the annotation script:
python vdj-insights annotation -a <assembly_directory> | -i <region_directory> -l <library_directory/library.fasta> -r <receptor_type> -s <species_name> -f <flanking_genes> -t <threads> -m <mappingtool, mapping_tool> -M <metadata_directory> -o <output_directory> --default
Required Arguments:
| Argument | Description | Example |
|---|---|---|
-r,--receptor-type |
Type of receptor to analyze. Choices: IG (immunoglobulin) or TR (T-cell receptor).Required when using --default. |
-r TR |
-i,--input or -a, --assembly |
Directory containing either extracted sequence regions (--input), referring to sequences of the region of interest already isolated from a genome assembly or complete genome assembly files ( --assembly). |
-i /path/to/region -a /path/to/assembly |
-l,--library |
Path to the FASTA library file containing reference V(D)J segment sequences. | -l /path/to/library.fasta |
-f,--flanking-genes |
Comma-separated list of flanking genes provided as key-value pairs in JSON format. If only one flanking gene is present, use "-" as a placeholder for the missing side. |
-f '{"IGH": ["PACS2", "-"], "IGK": ["RPIA", "PAX8"], "IGL": ["GANZ", "TOP3B"]}' |
-s,--species |
Scientific species name (e.g., Homo sapiens). |
-s "Homo sapiens" |
Optional Arguments:
| Argument | Description | Example |
|---|---|---|
-M,--metadata |
Path to the metadata file (.xlsx). Download example template |
-M metadata.xlsx |
-o,--output |
Output directory for the results (Default: annotation_results). |
-o /path/to/output |
-m,--mapping-tool |
Available mapping tools: minimap2, bowtie, bowtie2. (Default: all). |
-m minimap2 |
-t,--threads |
Number of threads for parallel processing (Default: 8). |
-t 16 |
--default |
Use default settings (cannot be used with --flanking-genes). |
--default |
-S,--scaffolding |
Path to reference genome (FASTA). Only supported for phased assembly files. |
-S /path/to/reference.fasta |
Important notes
- If using the
-i/--inputflag, do not specify-f/--flanking-genes, as flanking genes are only required when defining regions of interest from a complete genome assembly using-a/--assembly. - If using the
-i/--inputflag, input file(s) should be named in the format<sample-name>_<region>.fastaand must be located in the indicated directory. - If using the
--defaultflag, do not specify-f/--flanking-genesas they are mutually exclusive. - If using the
--defaultflag, the annotation tool automatically downloads the appropriate V(D)J gene segment library based on the specified receptor type (-r) and species (-s). There is no need to define flanking genes manually or provide a local library file. - If using the
--scaffoldingflag, RagTag scaffolding requires a phased assembly as input. If the input assembly contains contigs of both haplotypes, it should be phased beforehand.
Example
-
Download the T2T-CHM13v2.0 assembly file from the T2T Consortium (GCA_009914755.4) using the following command:
wget https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-unmasked.fa.gz -
Extract the assembly file:
gunzip Homo_sapiens-GCA_009914755.4-unmasked.fa.gz -
Run VDJ-Insights using the T2T assembly:
python -m vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
or
vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
When the --default flag is used, VDJ-Insights automatically downloads the appropriate V(D)J segment library for the specified receptor type (-r) and species (-s) from the IMGT, when available. It is not necessary to specify flanking genes or provide a local library file.
Annotation results
The results generated by VDJ-Insights are stored in the annotation directory. This directory includes the following Excel files:
annotation_report_known.xlsxcontains information on known V, D, and J gene segments, including recombination signal sequences.annotation_report_novel.xlsxcontains information on novel V, D, and J gene segments, including recombination signal sequences.annotation_report_all.xlsxcombines information on both known and novel V, D, and J gene segments.tmp/blast_results.xlsxcontains the BLAST search results used for validation of annotations.tmp/report.xlsxprovides a summary of the overall findings from the alignment analyses.
Each annotation report (known or novel) includes the following columns, providing detailed information about the identified segments:
| Column | Explanation | Example |
|---|---|---|
| Sample | The name of the sample. | Sample_001 |
| Haplotype | The haplotype ID (maternal and paternal). | 1 or mat |
| Region | The annotated region. | IGHV |
| Segment | The gene segment type. | V |
| Start coord | The start coordinate on the annotated contig. | 12345 |
| End coord | The end coordinate on the annotated contig. | 12789 |
| Strand | Segment orientation: + indicates 5' to 3' direction, and - indicates 3' to 5' direction. |
+ |
| Library name | The closest reference gene segment name associated with the identified segment. | IGHV3-23*01 |
| Target name | The name assigned to the novel gene segment, based on the closest reference gene, with "like" appended to indicate similarity. | IGHV3-23-like |
| Short name | The gene name, as defined by IMGT nomenclature standards. | IGHV3*01 |
| Similar references | Other reference gene segments sharing the same start and end coordinates; the best match is selected based on the mutation count and the reference gene name. | IGHV3-33*02 |
| Target sequence | The nucleotide sequence of the novel gene segment. | ATGGTGCAAGC... |
| Library sequence | The nucleotide sequence of the closest reference gene segment. | ATGGTGCAAAC... |
| Mismatches | The total number of mismatches observed between the novel segment and the reference sequence. | 3 |
| % Mismatches of total alignment | The percentage of mismatches relative to the total alignment length between the identified segment and the reference. | 1.5% |
| % identity | The percentage of identical bases between the identified segment and the reference over the full alignment. | 98.5% |
| BTOP | BLAST traceback string that describes the exact location of substitutions, insertions, and deletions in the alignment. | 10A5G3T |
| SNPs | The number of single nucleotide polymorphisms (SNPs) relative to the reference. | 2 |
| Insertions | The number of insertions relative to the reference. | 1 |
| Deletions | The number of deletions relative to the reference. | 0 |
| Mapping tool | The name(s) of the mapping tool(s) used for gene segment annotation. | Minimap2 |
| Function | The functional classification of the segment: "F/ORF" for functional/open reading frame, "P" for potentially functional/open reading frame, or "pseudogene" if an early stop codon is detected. | F/ORF |
| Status | Indicates whether the gene segment is classified as Known or Novel. | Novel |
| Message | A generated message for the segment if stop codons are detected at critical positions. | The STOP-CODON at the 3' end of the V-REGION can be deleted by rearrangement |
| Population | The population group associated with the sample, if metadata is provided. | Dutch |
Web interface report
The pipeline includes an interactive web interface for visualizing and exploring the annotation results. The web-based Flask report can be generated and opened using the following command:
python -m vdj_insights.html -i /path/to/output --show
or
vdj_insights html -i /path/to/output --show
Citing VDJ-Insights
If VDJ-Insights contributes to your research, please cite:
Acknowledgements
VDJ-Insights was developed by the department of Comparative Genetics & Refinement of the Biomedical Primate Research Centre (BPRC) in Rijswijk, the Netherlands.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vdj_insights-0.1.0.tar.gz.
File metadata
- Download URL: vdj_insights-0.1.0.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03b88303a551564b71473f6a506a4b5bbaea163637cf40d88932026bf53acbef
|
|
| MD5 |
21e1cab386a84a53ec5cac7cdfe64926
|
|
| BLAKE2b-256 |
0e06f02d5109ac56efd635f37cafdd94da3df110d80c85d407b7f4f7ef5cab62
|
File details
Details for the file vdj_insights-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vdj_insights-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2f84e5aa17e13e4417bbb61e5b699a98b1ce4212dbd8e559b4dbb9f2472a710
|
|
| MD5 |
45011f0f6b23606b21b4e05752a82f49
|
|
| BLAKE2b-256 |
ac014ceebc2e66c3bee10da2d9f60cc9d1fb65f9413e997263b74dd7fa60258f
|