Skip to main content

VDJ-Insights provides a robust framework for the accurate annotation of complex genomic immune regions.

Project description

VDJ-Insights

Introduction

VDJ-Insights is a robust software package for accurate annotation of the V, D, and J gene segments within immunoglobulin (IG) and T-cell receptor (TCR) genomic regions. In addition to segment annotation, it evaluates gene functionality, detects recombination signal sequences (RSS), and annotates complementary-determining regions 1 and 2 (CDR1 and CDR2). These features extend the utility of VDJ-Insights beyond gene annotation, providing a powerful framework for functional immunogenetics and enabling evolutionary and comparative analyses at individual, population, and species levels.


Installation

VDJ-Insights is currently only supported on Linux systems. Before running the pipeline, please ensure that Python (version 3.7 or higher) and Conda are installed on your system. You can install VDJ-Insights using one of the following methods:

Option 1: Clone the repository

  1. Clone the VDJ-Insights repository:

    git clone https://github.com/BPRC-CGR/VDJ-insights
    
  2. Navigate to the repository directory:

    cd vdj_insights
    
  3. Run the pipeline using Python's -m option:

    python -m vdj_insights <annotation|html> [arguments]
    

Note: When cloning the repository, the pipeline must always be executed using the python -m option. This ensures that Python correctly recognizes the package structure and runs the pipeline without additional installation steps.

Option 2: Install via pip

  1. Use pip to install VDJ-Insights:
    pip install vdj_insights
    
  2. Run the pipeline:
    vdj_insights <annotation|html> [arguments]
    

Using VDJ-Insights

Use the following command to run the annotation script:

python vdj-insights annotation -a <assembly_directory> | -i <region_directory> -l <library_directory/library.fasta> -r <receptor_type> -s <species_name> -f <flanking_genes> -t <threads> -m <mappingtool, mapping_tool> -M <metadata_directory> -o <output_directory> --default

Required Arguments:

Argument                 Description Example
-r,
--receptor-type
Type of receptor to analyze. Choices: IG (immunoglobulin) or TR (T-cell receptor).
Required when using --default.
-r TR
-i,
--input

or

-a, --assembly
Directory containing either extracted sequence regions (--input), referring to sequences of the region of interest already isolated from a genome assembly

or

complete genome assembly files (--assembly).
-i /path/to/region
-a /path/to/assembly
-l,
--library
Path to the FASTA library file containing reference V(D)J segment sequences. -l /path/to/library.fasta
-f,
--flanking-genes
Comma-separated list of flanking genes provided as key-value pairs in JSON format. If only one flanking gene is present, use "-" as a placeholder for the missing side. -f '{"IGH": ["PACS2", "-"], "IGK": ["RPIA", "PAX8"], "IGL": ["GANZ", "TOP3B"]}'
-s,
--species
Scientific species name (e.g., Homo sapiens). -s "Homo sapiens"

Optional Arguments:

Argument                 Description Example
-M,
--metadata
Path to the metadata file (.xlsx).
Download example template
-M metadata.xlsx
-o,
--output
Output directory for the results (Default: annotation_results). -o /path/to/output
-m,
--mapping-tool
Available mapping tools: minimap2, bowtie, bowtie2. (Default: all). -m minimap2
-t,
--threads
Number of threads for parallel processing (Default: 8). -t 16
--default Use default settings (cannot be used with --flanking-genes). --default
-S,
--scaffolding
Path to reference genome (FASTA).
Only supported for phased assembly files.
-S /path/to/reference.fasta

Important notes

  • If using the -i/--input flag, do not specify -f/--flanking-genes, as flanking genes are only required when defining regions of interest from a complete genome assembly using -a/--assembly.
  • If using the -i/--input flag, input file(s) should be named in the format <sample-name>_<region>.fasta and must be located in the indicated directory.
  • If using the --default flag, do not specify -f/--flanking-genes as they are mutually exclusive.
  • If using the --default flag, the annotation tool automatically downloads the appropriate V(D)J gene segment library based on the specified receptor type (-r) and species (-s). There is no need to define flanking genes manually or provide a local library file.
  • If using the --scaffolding flag, RagTag scaffolding requires a phased assembly as input. If the input assembly contains contigs of both haplotypes, it should be phased beforehand.

Example

  1. Download the T2T-CHM13v2.0 assembly file from the T2T Consortium (GCA_009914755.4) using the following command:

    wget https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/genome/Homo_sapiens-GCA_009914755.4-unmasked.fa.gz
    
  2. Extract the assembly file:

    gunzip Homo_sapiens-GCA_009914755.4-unmasked.fa.gz
    
  3. Run VDJ-Insights using the T2T assembly:

    python -m vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
    

    or

    vdj-insights annotation -a /path/to/GCA_009914755.4-unmasked.fa -r IG -s "Homo sapiens" --default
    

When the --default flag is used, VDJ-Insights automatically downloads the appropriate V(D)J segment library for the specified receptor type (-r) and species (-s) from the IMGT, when available. It is not necessary to specify flanking genes or provide a local library file.

Annotation results

The results generated by VDJ-Insights are stored in the annotation directory. This directory includes the following Excel files:

  • annotation_report_known.xlsx contains information on known V, D, and J gene segments, including recombination signal sequences.
  • annotation_report_novel.xlsx contains information on novel V, D, and J gene segments, including recombination signal sequences.
  • annotation_report_all.xlsx combines information on both known and novel V, D, and J gene segments.
  • tmp/blast_results.xlsx contains the BLAST search results used for validation of annotations.
  • tmp/report.xlsx provides a summary of the overall findings from the alignment analyses.

Each annotation report (known or novel) includes the following columns, providing detailed information about the identified segments:

Column Explanation Example
Sample The name of the sample. Sample_001
Haplotype The haplotype ID (maternal and paternal). 1 or mat
Region The annotated region. IGHV
Segment The gene segment type. V
Start coord The start coordinate on the annotated contig. 12345
End coord The end coordinate on the annotated contig. 12789
Strand Segment orientation: + indicates 5' to 3' direction, and - indicates 3' to 5' direction. +
Library name The closest reference gene segment name associated with the identified segment. IGHV3-23*01
Target name The name assigned to the novel gene segment, based on the closest reference gene, with "like" appended to indicate similarity. IGHV3-23-like
Short name The gene name, as defined by IMGT nomenclature standards. IGHV3*01
Similar references Other reference gene segments sharing the same start and end coordinates; the best match is selected based on the mutation count and the reference gene name. IGHV3-33*02
Target sequence The nucleotide sequence of the novel gene segment. ATGGTGCAAGC...
Library sequence The nucleotide sequence of the closest reference gene segment. ATGGTGCAAAC...
Mismatches The total number of mismatches observed between the novel segment and the reference sequence. 3
% Mismatches of total alignment The percentage of mismatches relative to the total alignment length between the identified segment and the reference. 1.5%
% identity The percentage of identical bases between the identified segment and the reference over the full alignment. 98.5%
BTOP BLAST traceback string that describes the exact location of substitutions, insertions, and deletions in the alignment. 10A5G3T
SNPs The number of single nucleotide polymorphisms (SNPs) relative to the reference. 2
Insertions The number of insertions relative to the reference. 1
Deletions The number of deletions relative to the reference. 0
Mapping tool The name(s) of the mapping tool(s) used for gene segment annotation. Minimap2
Function The functional classification of the segment: "F/ORF" for functional/open reading frame, "P" for potentially functional/open reading frame, or "pseudogene" if an early stop codon is detected. F/ORF
Status Indicates whether the gene segment is classified as Known or Novel. Novel
Message A generated message for the segment if stop codons are detected at critical positions. The STOP-CODON at the 3' end of the V-REGION can be deleted by rearrangement
Population The population group associated with the sample, if metadata is provided. Dutch

Web interface report

The pipeline includes an interactive web interface for visualizing and exploring the annotation results. The web-based Flask report can be generated and opened using the following command:

python -m vdj_insights.html -i /path/to/output --show

or

vdj_insights html -i /path/to/output --show

Citing VDJ-Insights

If VDJ-Insights contributes to your research, please cite:

Acknowledgements

VDJ-Insights was developed by the department of Comparative Genetics & Refinement of the Biomedical Primate Research Centre (BPRC) in Rijswijk, the Netherlands.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdj_insights-0.1.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vdj_insights-0.1.0-py3-none-any.whl (2.2 MB view details)

Uploaded Python 3

File details

Details for the file vdj_insights-0.1.0.tar.gz.

File metadata

  • Download URL: vdj_insights-0.1.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for vdj_insights-0.1.0.tar.gz
Algorithm Hash digest
SHA256 03b88303a551564b71473f6a506a4b5bbaea163637cf40d88932026bf53acbef
MD5 21e1cab386a84a53ec5cac7cdfe64926
BLAKE2b-256 0e06f02d5109ac56efd635f37cafdd94da3df110d80c85d407b7f4f7ef5cab62

See more details on using hashes here.

File details

Details for the file vdj_insights-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vdj_insights-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.5

File hashes

Hashes for vdj_insights-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2f84e5aa17e13e4417bbb61e5b699a98b1ce4212dbd8e559b4dbb9f2472a710
MD5 45011f0f6b23606b21b4e05752a82f49
BLAKE2b-256 ac014ceebc2e66c3bee10da2d9f60cc9d1fb65f9413e997263b74dd7fa60258f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page