strvcf-annotator

STR annotation tool for VCF files

These details have not been verified by PyPI

Project links

Project description

STR (Short Tandem Repeat) annotation tool for VCF files.

strvcf_annotator is a Python library and CLI tool for annotating variants in VCF files that overlap short tandem repeat (STR) regions. The tool converts SNPs, SNVs and indels into full repeat sequences and adds STR metadata.

Features

Dual Usage: Works as both a library and CLI tool
Extensible: Easy to add custom parsers for different VCF formats
Efficient: Streaming support for large files

Installation

# Install from source
git clone https://github.com/KondratievaOlesya/strvcf_annotator.git
cd strvcf_annotator
pip install -e .

# Dev dependencies
pip install -r requirements_dev.txt

Quick Start

Command Line

# Annotate a single VCF
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf

# Batch-process a directory
strvcf-annotator --input-dir vcf_files/ --str-bed repeats.bed --output-dir annotated/

# With verbose logging
strvcf-annotator --input input.vcf --str-bed repeats.bed --output output.vcf --verbose

Library Usage

from strvcf_annotator import STRAnnotator

# Create the annotator
annotator = STRAnnotator('repeats.bed')

# Annotate a single file
annotator.annotate_vcf_file('input.vcf', 'output.vcf')

# Batch processing
annotator.process_directory('vcf_files/', 'annotated/')

# Streaming processing
import pysam
vcf_in = pysam.VariantFile('input.vcf')
for record in annotator.annotate_vcf_stream(vcf_in):
    print(f"Repeat unit: {record.info['RU']}")

Input format

BED file with STR regions

CHROM   START   END     PERIOD  RU
chr1    100     115     3       CAG
chr1    200     212     4       ATCG
chr2    300     318     3       GAT

CHROM: Chromosome name
START: Start position (0-based, BED format)
END: End position (0-based, exclusive)
PERIOD: Repeat unit length
RU: Repeat unit sequence

VCF file

A standard VCF with variants. Must contain:

FORMAT field GT (genotype)
Optional: AD (allelic depth), DP (total depth)

Output format

The annotated VCF contains additional fields:

INFO fields

RU: Repeat unit
PERIOD: Repeat period (unit length)
REF: Reference copy number
PERFECT: TRUE if both alleles are perfect repeats

FORMAT fields

REPCN: Genotype expressed as repeat copy numbers

Example

##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
##INFO=<ID=PERIOD,Number=1,Type=Integer,Description="Repeat period">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference copy number">
##INFO=<ID=PERFECT,Number=1,Type=String,Description="Perfect repeat indicator">
##FORMAT=<ID=REPCN,Number=2,Type=Integer,Description="Repeat copy number">

#CHROM  POS  ID  REF         ALT             QUAL  FILTER  INFO                              FORMAT      Sample1
chr1    101  .   CAGCAGCAG   CAGCAGCAGCAG    .     .       RU=CAG;PERIOD=3;REF=3;PERFECT=TRUE  GT:REPCN    0/1:3,4

Architecture

src/strvcf_annotator/
├── __init__.py          # Public API
├── api.py               # Library API
├── cli.py               # CLI interface
├── core/                # Core functionality
│   ├── annotation.py    # Annotation engine
│   ├── vcf_processor.py # VCF processing
│   ├── str_reference.py # STR reference management
│   └── repeat_utils.py  # Repeat sequence utilities
├── parsers/             # Parser system
│   ├── base.py          # Abstract parser interface
│   └── generic.py       # Generic VCF parser
└── utils/               # Utilities
    ├── vcf_utils.py     # VCF helpers
    └── validation.py    # Input validation

Extending functionality

Creating a custom parser

from strvcf_annotator.parsers.base import BaseVCFParser

class CustomParser(BaseVCFParser):
    def get_genotype(self, record, sample_idx):
        # Your logic for extracting the genotype
        pass

    def has_variant(self, record, sample_idx):
        # Your logic for determining if there is a variant
        pass

    def extract_info(self, record, sample_idx):
        # Your logic for extracting additional fields
        pass

    def validate_record(self, record):
        # Your logic for validating the record
        pass

# Usage
annotator = STRAnnotator('repeats.bed', parser=CustomParser())

Performance

Streaming processing: Does not load the entire VCF into memory
Efficient lookup: Uses sorted data for fast STR searches
Batch processing: Supports processing multiple files

Troubleshooting

Issue: ModuleNotFoundError

# Install the package in editable (dev) mode
pip install -e .

Issue: Unnormalized VCF

This tool only accepts normalized VCFs. Please normalize with bcftools before running. Example (produces a normalized, indexed VCF):

# Replace reference.fa with the exact reference used for the VCF
bcftools norm -f reference.fa -m input.vcf

Issue: Unsorted VCF

The tool automatically sorts the VCF in memory, but for large files pre-sorting is recommended:

bcftools sort input.vcf -o sorted.vcf

Issue: Reference mismatch

If you see warnings about a reference mismatch, check:

The correctness of the STR BED file
Matching reference genome versions

Documentation

Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you’d like to change. Please ensure:

All tests pass
Code follows existing style
New features include tests
Documentation is updated

License

MIT License

Credits

Test bed files were taken from ConSTRain repository https://github.com/acg-team/ConSTRain.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Feb 26, 2026

0.3.0

Jan 12, 2026

0.2.2

Dec 5, 2025

0.2.1

Nov 27, 2025

0.2.0

Nov 25, 2025

This version

0.1.0

Nov 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strvcf_annotator-0.1.0.tar.gz (25.0 MB view details)

Uploaded Nov 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

strvcf_annotator-0.1.0-py3-none-any.whl (24.9 kB view details)

Uploaded Nov 7, 2025 Python 3

File details

Details for the file strvcf_annotator-0.1.0.tar.gz.

File metadata

Download URL: strvcf_annotator-0.1.0.tar.gz
Upload date: Nov 7, 2025
Size: 25.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for strvcf_annotator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`873d073664a2a4012b5fa84d818ebd57e47538892bb621cc235579755427d0c1`
MD5	`409a0e543fd08247d9767c08d864cb17`
BLAKE2b-256	`5db9e672412f2538b347b4ce60d3501b8a23df2c150df095a03ba4804bf3d6f3`

See more details on using hashes here.

File details

Details for the file strvcf_annotator-0.1.0-py3-none-any.whl.

File metadata

Download URL: strvcf_annotator-0.1.0-py3-none-any.whl
Upload date: Nov 7, 2025
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for strvcf_annotator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07eacb40212db719b6dad0b6700e8d2a1de157ba0d6049c86c07d3afcd3697a6`
MD5	`70131bfb358293d983a86f70deb6a094`
BLAKE2b-256	`8c53d0c2981fbdd19fd50e467a8e1b9c7bcb4bbdcefb90477c8dec383ca7a3a6`

See more details on using hashes here.

strvcf-annotator 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Quick Start

Command Line

Library Usage

Input format

BED file with STR regions

VCF file

Output format

INFO fields

FORMAT fields

Example

Architecture

Extending functionality

Performance

Troubleshooting

Issue: ModuleNotFoundError

Issue: Unnormalized VCF

Issue: Unsorted VCF

Issue: Reference mismatch

Documentation

Contributing

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes