Skip to main content

An integrated tool for annotating the motif variation and complex patterns in tandem repeats.

Project description

VAMPIRE

Getting Started

# Install VAMPIRE
pip install vampire

# Annotate STRs with
vampire anno tests/001-anno_STR.fa tests/001-anno_STR

# Generate simulated TR sequences
vampire generator -m GGC -l 1000 -r 0.01 -p tests/002-generator_reference
vampire generator -m GGC GGT -l 1000 -r 0.01 -p tests/002-generator_reference

# Create reference motifset from VAMPIRE annotation files
vampire mkref tests/003-mkref_data tests/003-mkref_reference.fa

# Evaluate the quality of annotation
vampire evaluate tests/001-anno_STR tests/004-evaluate

# Refine the annotation
vampire refine tests/001-anno_STR tests/005-refine_action.tsv -o tests/005-anno_STR.revised

# Plotting sequence logos to visualize motif variation
vampire logo tests/001-anno_STR tests/006-anno_STR_motif
vampire logo --type annotation tests/001-anno_STR tests/006-anno_STR_annotation

# Calculate the identity matrix for TR sequences
vampire identity -w 5 tests/001-anno_STR tests/007-anno_STR

See Cookbook for more details.

Table of Contents

Introduction

VAMPIRE is a unified framework for de novo TR motif annotation, structural decomposition, and variation profiling.

Why VAMPIRE?

  • VAMPIRE reveals tandem repeat (TR) variation beyond simple copy number differences.
  • VAMPIRE's flexible parameter settings allow annotation of TRs with diverse characteristics, ranging from STRs, VNTRs to Megabase-scale satellites.
  • VAMPIRE produces detailed results in standard .tsv format, enabling seamless integration with custom analyses and in-depth research.

Installation

# Use singularity (recommended)
singularity pull docker://zikun-yang/vampire:latest

# Install by pip 
pip install vampire # need to install mafft for using logo #######################################################################

# Install by conda
conda install vampire #######################################################################

Usage

VAMPIRE now contains 7 subcommands: anno, generator, mkref, evaluate, refine, logo, and identity.

anno - Annotate TR sequences

One basic use of VAMPIRE is to annotate TR sequences in FASTA format. A typical command is as follows:

# de novo annotate TR sequences
vampire anno -t 8 tests/001-anno_STR.fa tests/001-anno_STR

where -t sets the number of threads, tests/001-anno_STR.fa is the input sequences, and tests/001-anno_STR is the output prefix. By default, VAMPIRE use the built-in base motif database to refine and label motifs. This database includes pCht/StSat in Pan and human alpha-satellite mononers from the paper:

Altemose N, Logsdon G A, Bzikadze A V, et al. Complete genomic and epigenetic maps of human centromeres[J]. Science, 2022, 376(6588): eabl4178.

To use a custom motif database, specify it with the -m option.

This command will generate five output files:

  • tests/001-anno_STR.settings.json: annotation parameters used.
  • tests/001-anno_STR.anno.tsv: detailed annotation, including motif, strand, and actual sequence.
  • tests/001-anno_STR.concise.tsv: brief annotation results.
  • tests/001-anno_STR.motif.tsv: motif statistics.
  • tests/001-anno_STR.dist.tsv: motif distance in plus and minus strands.

Besides, VAMPIRE also supports adding motif database into motif set to annotate TR sequences in both de novo and non-de novo modes.

# Use motifs from database to annotate (combined with de novo annotation)
vampire anno -f -t 8 [prefix] [output_prefix]

# Only use motifs from database to annotate (without de novo annotation)
vampire anno -f --no-denovo -t 8 [prefix] [output_prefix]

For more detailed instructions and examples, refer to the VAMPIRE Cookbook.

generator - Generate simulated TR sequences

VAMPIRE can generate simulated TR sequences with single or multiple given motif(s), user-defined length and mutation rate. The default random seed is 42. To change the random seed, use the -s option.

# Generate simulated TR sequences
vampire generator -m GGC -l 1000 -r 0.01 -p tests/002-generator_reference
vampire generator -m GGC GGT -l 1000 -r 0.01 -p tests/002-generator_reference

This command will output three files:

  • tests/002-generator_reference.fa: the simulated TR sequences in FASTA format.
  • tests/002-generator_reference.anno.tsv: the annotation results with mutations.
  • tests/002-generator_reference.fa.anno_woMut.tsv: the annotation results without mutations.

mkref - Create reference motifset

The mkref function can generate motif database (in FASTA format) from VAMPIRE annotation results. It can corporate with the anno function to annotate TR sequences in a two-step approach: firstly, use anno to annotate TR sequences, then use mkref to generate motif database from the annotation results. Then, the anno function can use this motif database in non-de novo mode to annotate TR sequences. This two-step approach can generate a motif database on population level and annotate TR sequences with high accuracy.

# Create reference motif set from annotation results
vampire mkref tests/003-mkref_data tests/003-mkref_reference.fa

Evaluate annotation quality

VAMPIRE evaluates the quality of annotation in a edit distance matrix method. See the VAMPIRE Cookbook for more details.

# Evaluate the quality of annotation
vampire evaluate tests/001-anno_STR tests/004-evaluate

Four figures will be generated, combining two modes (raw and normalized) with strand options (merge and seperate). For detailed machanisms, usage and interpretation of the raw and normalized modes as well as the merge and seperate strand settings, please refer to the VAMPIRE Cookbook.

refine - Refine annotation

This refinement process will generate a new annotation file with the same format as the input with the refinement action provided by user. Three operations (MERGE, REPLACE and DELETE) are supported.

# Refine the annotation
vampire refine tests/001-anno_STR tests/005-refine_action.tsv -o tests/005-anno_STR.revised

logo - Plotting sequence logos to visualize motif variation

VAMPIRE plots sequence logos in three types: count, probability, and information score. By default, VAMPIRE plot sequence logos using the motif statistics file *.motif.tsv. If you want to plot sequence logos using the annotation file *.anno.tsv to show the true motif variation, use the --type annotation option.

# Plotting sequence logos to visualize motif variation
vampire logo tests/001-anno_STR tests/006-anno_STR_motif
vampire logo --type annotation tests/001-anno_STR tests/006-anno_STR_annotation

identity - Calculate the identity matrix for TR sequences

VAMPIRE uses alignment-based method to calculate the identity matrix for TR sequences.

# Calculate the identity matrix for TR sequences
vampire identity -t 20 -w 30 tests/001-anno_STR tests/007-anno_STR

By default, VAMPIRE do not account for insertion and deletion events when generating the identity matrix. To include such events within a specific length range, use the --max-indel and --min-indel options to set the maximum and minimum indel lengths to consider.

After generating the identity matrix, you can visualize the heatmap with repeatmasker annotation and TR strand information using this command:

python scripts/get_visualization_data.py --prefix [annotation_prefix] --repeat [repeatmasker_annotation] --output [output_prefix]
Rscript scripts/SG_aln_plot.R -t 30 -b [identity_bed_file] -a [visualization_data] -p [figure_output_prefix]

Results

heatmap ####################################################################################

Getting Help

For detailed description of options, please see the VAMPIRE Cookbook. If you have further questions, want to report a bug, or suggest a new feature, please raise an issue at the issue page.

Limitations

  • VAMPIRE is designed for annotating the variation of TRs. While it can be used for genome-wide TR annotation with basic information, it can be time-consuming due to the additional data it processes. To address this, we plan to develop a scan function optimized for whole-genome TR annotation.
  • TRs with very low copy numbers may be challenging to annotate accurately due to the limited availability of k-mers.

Citating VAMPIRE

If you use VAMPIRE in your work, please cite:

To be updated

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vampire_tr-0.3.0.tar.gz (41.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vampire_tr-0.3.0-py3-none-any.whl (43.6 kB view details)

Uploaded Python 3

File details

Details for the file vampire_tr-0.3.0.tar.gz.

File metadata

  • Download URL: vampire_tr-0.3.0.tar.gz
  • Upload date:
  • Size: 41.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for vampire_tr-0.3.0.tar.gz
Algorithm Hash digest
SHA256 1bb289afaed9119e67f7bb3ceec1ecfc03e465ce4102843f04ced845d1d0b7c0
MD5 8c8db3a0715eaece241ca438551ebedb
BLAKE2b-256 b036af57d5cae2bcc446938dd06ad6d67dc9283afa3f7ff207ba486578db7cfb

See more details on using hashes here.

File details

Details for the file vampire_tr-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: vampire_tr-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 43.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for vampire_tr-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ceb9d919c43b4f193b71fd8b8ce4f445f8fd03613deb56df898c28aaad0584e8
MD5 7c969f18fc511c44245341eb78ffbd3f
BLAKE2b-256 c039b290313c54c4e1ada998114423e30f6759ec4f40ef6cb98124df0b18a0c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page