Skip to main content

STR annotation tool for VCF files

Project description

STR (Short Tandem Repeat) annotation tool for VCF files.

strvcf_annotator is a Python library and CLI tool for annotating variants in VCF files that overlap short tandem repeat (STR) regions. The tool converts SNPs, SNVs and indels into full repeat sequences and adds STR metadata.

Features

  • Dual Usage: Works as both a library and CLI tool

  • Extensible: Easy to add custom parsers for different VCF formats

  • Efficient: Streaming support for large files

Installation

Package is available through PyPI. To install, type:

pip install strvcf-annotator
# Install from source
git clone https://github.com/acg-team/strvcf_annotator.git
cd strvcf_annotator
pip install -e .
# Dev dependencies
pip install -r requirements_dev.txt

Quick Start

Command Line

# Annotate a single VCF
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf

# Batch-process a directory
strvcf-annotator --input-dir vcf_files/ --str-bed repeats.bed --output-dir annotated/

# With verbose logging
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf --verbose

Library Usage

from strvcf_annotator import STRAnnotator

# Create the annotator
annotator = STRAnnotator('repeats.bed')

# Annotate a single file
annotator.annotate_vcf_file('input.vcf', 'output.vcf')

# Batch processing
annotator.process_directory('vcf_files/', 'annotated/')

# Streaming processing
import pysam
vcf_in = pysam.VariantFile('input.vcf')
for record in annotator.annotate_vcf_stream(vcf_in):
    print(f"Repeat unit: {record.info['RU']}")

Input format

BED file with STR regions

CHROM   START   END     PERIOD  RU
chr1    100     115     3       CAG
chr1    200     212     4       ATCG
chr2    300     318     3       GAT
  • CHROM: Chromosome name

  • START: Start position (0-based, BED format)

  • END: End position (0-based, exclusive)

  • PERIOD: Repeat unit length

  • RU: Repeat unit sequence

VCF file

A standard VCF with variants. Must contain:

  • FORMAT field GT (genotype)

  • Optional: AD (allelic depth), DP (total depth)

Output format

The annotated VCF contains additional fields:

INFO fields

  • RU: Repeat unit

  • PERIOD: Repeat period (unit length)

  • REF: Reference copy number

  • PERFECT: TRUE if both alleles are perfect repeats

FORMAT fields

  • REPCN: Genotype expressed as repeat copy numbers

Example

##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Repeat period">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=PERFECT,Number=1,Type=String,Description="Perfect repeat indicator">
##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Repeat copy number">

#CHROM  POS  ID  REF         ALT             QUAL  FILTER  INFO                              FORMAT      Sample1
chr1    101  .   CAGCAGCAG   CAGCAGCAGCAG    .     .       RU=CAG;PERIOD=3;REF=3;PERFECT=TRUE  GT:REPCN    0/1:3,4

Architecture

src/strvcf_annotator/
├── __init__.py          # Public API
├── api.py               # Library API
├── cli.py               # CLI interface
├── core/                # Core functionality
│   ├── annotation.py    # Annotation engine
│   ├── vcf_processor.py # VCF processing
│   ├── str_reference.py # STR reference management
│   └── repeat_utils.py  # Repeat sequence utilities
├── parsers/             # Parser system
│   ├── base.py          # Abstract parser interface
│   └── generic.py       # Generic VCF parser
└── utils/               # Utilities
    ├── vcf_utils.py     # VCF helpers
    └── validation.py    # Input validation

Extending functionality

Creating a custom parser

from strvcf_annotator.parsers.base import BaseVCFParser

class CustomParser(BaseVCFParser):
    def get_genotype(self, record, sample_idx):
        # Your logic for extracting the genotype
        pass

    def has_variant(self, record, sample_idx):
        # Your logic for determining if there is a variant
        pass

    def extract_info(self, record, sample_idx):
        # Your logic for extracting additional fields
        pass

    def validate_record(self, record):
        # Your logic for validating the record
        pass

# Usage
annotator = STRAnnotator('repeats.bed', parser=CustomParser())

Performance

  • Streaming processing: Does not load the entire VCF into memory

  • Efficient lookup: Uses sorted data for fast STR searches

  • Batch processing: Supports processing multiple files

Troubleshooting

Issue: ModuleNotFoundError

# Install the package in editable (dev) mode
pip install -e .

Issue: Unnormalized VCF

This tool only accepts normalized VCFs. Please normalize with bcftools before running. Example (produces a normalized, indexed VCF):

# Replace reference.fa with the exact reference used for the VCF
bcftools norm -f reference.fa -m input.vcf

Issue: Unsorted VCF

The tool automatically sorts the VCF in memory, but for large files pre-sorting is recommended:

bcftools sort input.vcf -o sorted.vcf

Issue: Reference mismatch

If you see warnings about a reference mismatch, check:

  • The correctness of the STR BED file

  • Matching reference genome versions

Documentation

Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you’d like to change. Please ensure:

  1. All tests pass

  2. Code follows existing style

  3. New features include tests

  4. Documentation is updated

License

MIT License

Credits

Test bed files were taken from ConSTRain repository https://github.com/acg-team/ConSTRain.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strvcf_annotator-0.2.0.tar.gz (25.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strvcf_annotator-0.2.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file strvcf_annotator-0.2.0.tar.gz.

File metadata

  • Download URL: strvcf_annotator-0.2.0.tar.gz
  • Upload date:
  • Size: 25.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for strvcf_annotator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e135f4c6ebc84c8674346b7c84aa6c345125aca277fadda1c7810fbdf8beac8b
MD5 169fc8ff25842fba60097aaeb9733723
BLAKE2b-256 16e42509e1c5e0ab0a8bd01b7f06ab5b3cecbb1f98f79bc0ef8653cf92211d21

See more details on using hashes here.

File details

Details for the file strvcf_annotator-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for strvcf_annotator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3abf6fb4a696367231e4d09fb4896cb4a96fece6ade809e472b08f25fb62a875
MD5 4444d91d59230f5f9d2b48daae701a06
BLAKE2b-256 73cd3ff65f0220e1bd34dd4d9ac775a13abdeb49291c2735ee64f00cc0e88e39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page