Extract STR signatures from annotated VCF
Project description
STR Mutation Signature Analysis.
Python package for analysis of Short Tandem Repeat (STR) mutation signatures from VCF files. It extracts somatic STR mutation events from paired tumor–normal VCFs, builds count matrices, filters them, performs NMF-based signature decomposition, and projects new samples onto learned STR mutation signatures.
Contents
Installation
Quick start
Input format
Somatic STR calls (tumor–normal)
Annotating standard VCFs
Matrix construction
Filtering mutation matrices
NMF signatures and projection
Command line interface
Python API
Output
Contributing
License
Installation
From PyPI
The package is available through PyPI. Install with:
pip install str_mut_signatures
From source
git clone https://github.com/acg-team/str_mut_signatures
cd str_mut_signatures
pip install -e .
Development installation
pip install -r requirements_dev.txt
Quick start
Command Line
Extract somatic STR mutation counts from paired tumor–normal VCFs:
str_mut_signatures extract \
--vcf-dir data/vcfs/ \
--out-matrix counts_raw.tsv \
--ru length \
--ref-length \
--change
This produces a count matrix (TSV) with:
rows = samples
columns = STR mutation features such as:
LEN{motif_length}_{ref_length}_{change}
For example: LEN1_10_+1 means:
motif length = 1 bp
reference repeat length = 10 copies
tumor has +1 copy relative to normal.
Filter the matrix to remove extremely rare features/samples:
str_mut_signatures filter \
--matrix counts_raw.tsv \
--out-matrix counts_filtered.tsv \
--feature-method elbow
Run NMF to learn STR mutation signatures:
str_mut_signatures nmf \
--matrix counts_filtered.tsv \
--outdir nmf_results \
--n-signatures 5
This writes:
nmf_results/signatures.tsv – STR mutation signatures (features x K)
nmf_results/exposures.tsv – sample exposures (samples x K)
nmf_results/metadata.json – parameters and metadata.
Project new samples onto existing signatures:
str_mut_signatures project \
--matrix new_counts.tsv \
--nmf-dir nmf_results \
--out-exposures new_exposures.tsv
Python Library Usage
Basic pipeline
from str_mut_signatures import (
parse_vcf_files,
build_mutation_matrix,
filter_mutation_matrix,
run_nmf,
save_nmf_result,
load_nmf_result,
project_onto_signatures,
)
# 1) Parse annotated paired tumor–normal VCF files into a long table
mutations = parse_vcf_files("vcf_directory/")
# 2) Build a mutation count matrix
# ru:
# None -> ignore motif
# "length" -> use only motif length (LEN1, LEN2, ...)
# "ru" -> use full repeat unit sequence (e.g. AT, AAT)
# "AT" -> AT-rich vs non-AT-rich classification
# ref_length:
# include reference repeat length as a feature component
# change:
# include tumor–normal repeat-length change
matrix = build_mutation_matrix(
mutations,
ru="length",
ref_length=True,
change=True,
)
# 3) Filter the matrix (e.g. manual thresholds)
matrix_filt, summary = filter_mutation_matrix(
matrix,
feature_method="manual",
min_feature_total=10,
min_samples_with_feature=3,
min_sample_total=0,
)
# 4) Run NMF
nmf_res = run_nmf(matrix_filt, n_signatures=5, random_state=11)
# Access signatures and exposures
signatures = nmf_res.signatures # features x K
exposures = nmf_res.exposures # samples x K
# 5) Save NMF result for reuse
save_nmf_result(nmf_res, "nmf_results")
# 6) Load NMF result and project new samples
nmf_loaded = load_nmf_result("nmf_results")
new_exposures = project_onto_signatures(
new_matrix=new_counts_df,
signatures=nmf_loaded.signatures,
method="nnls",
)
Input format
To be processed by str_mut_signatures, VCF files must:
Contain paired samples (normal and tumor) per record.
Be annotated with STR-specific fields that describe the repeat unit and allele-level repeat counts.
Required structure
Each VCF record must contain at least two samples:
Sample 1 (first column after FORMAT): normal
Sample 2 (second column after FORMAT): tumor
By default, str_mut_signatures assumes this order and computes somatic changes as tumor vs normal. Only loci with differences between tumor and normal are used (somatic STR mutations).
Required annotations:
INFO fields
RU: Repeat unit / motif (e.g. A, AT, AAT).
REF: Reference repeat count (copy number of the motif in the reference genome) or equivalent information if available.
FORMAT fields
REPCN: Comma-separated repeat copy numbers for each allele in that sample.
Example schema
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit">
##INFO=<ID=REF,Number=1,Type=Integer,Description="Reference repeat count">
##FORMAT=<ID=REPCN,Number=R,Type=Integer,Description="Per-allele repeat copy number">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR
chr1 100 . A AT . . RU=A;REF=10 REPCN 10,10 10,11
From this, str_mut_signatures:
Compares NORMAL vs TUMOR REPCN.
Identifies loci where tumor repeat copy number differs from normal.
Encodes the net repeat-length change as tumor–normal (e.g. +1).
Uses only these somatic events for downstream count matrices and signatures.
Somatic STR calls (tumor–normal)
Key points:
The package focuses on somatic STR mutations.
For each locus, tumor and normal alleles are compared:
If there is no difference between tumor and normal (based on REPCN), the site is ignored.
If there is a difference, a somatic STR event is recorded.
The change feature encodes tumor–normal repeat-length difference, not reference–sample:
change = f(REPCN_tumor, REPCN_normal)Heterozygous states and phasing are handled internally:
If phased genotypes are present (GT uses |), allele-specific changes are used when possible.
If unphased or no phasing information is available (/ or missing), a combined per-locus change (total tumor vs total normal) is used.
Annotating standard VCFs
If your VCFs lack RU, REF, or REPCN, you can annotate them using the companion tool strvcf_annotator:
Takes standard VCF + STR reference.
Produces STR-annotated VCFs compatible with str_mut_signatures.
For details see: strvcf_annotator.
Matrix construction
build_mutation_matrix provides a flexible way to define the feature space (columns) using simple flags.
Core components
Given:
RU: repeat unit sequence (e.g. A, AT, AAT)
len(RU): motif length
REF: reference repeat count
change: tumor–normal repeat-length change at that locus
you can select:
ru:
None:
Do not use motif information.
"length":
Use only motif length.
Features start with LEN{motif_length}.
Example: LEN1_10_+1 for motif length 1, REF=10, change=+1.
"ru":
Use full repeat unit sequence.
Example: A_10_+1, AT_20_-2.
"AT":
Collapse motifs into AT-rich vs non-AT-rich:
AT_rich: motif consists only of A and T.
non_AT_rich: motif contains any C or G.
ref_length (bool):
If True, include the reference repeat length as part of the feature key.
For phased genotypes, this is per-allele normal repeat count.
For unphased genotypes, this is the combined normal repeat count.
change (bool):
If True, include tumor–normal change as part of the feature key.
Only somatic events (non-zero change) are counted.
If False, no change term is added and loci are not filtered by change.
Examples
Motif length + ref length + somatic change:
m = build_mutation_matrix(
mutations,
ru="length",
ref_length=True,
change=True,
)
# Columns: LEN{motif_length}_{ref_length}_{change}
# e.g. LEN1_10_+1
Full motif + change only:
m = build_mutation_matrix(
mutations,
ru="ru",
ref_length=False,
change=True,
)
# Columns: {RU}_{change}
# e.g. AT_+2
AT-rich vs non-AT-rich, with ref length and change:
m = build_mutation_matrix(
mutations,
ru="AT",
ref_length=True,
change=True,
)
# Columns: AT_rich_10_+1, non_AT_rich_20_-2, ...
Motif length only (no change, e.g. for presence/absence-style summaries):
m = build_mutation_matrix(
mutations,
ru="length",
ref_length=False,
change=False,
)
# Columns: LEN1, LEN2, ...
Filtering mutation matrices
Large STR feature spaces can be sparse. filter_mutation_matrix provides several strategies to reduce noise before NMF.
Supported methods
from str_mut_signatures import filter_mutation_matrix
filtered, summary = filter_mutation_matrix(
matrix,
feature_method="manual",
min_feature_total=10,
min_samples_with_feature=3,
min_sample_total=0,
feature_percentile=0.9, # used for percentile method
)
Methods:
feature_method="manual"
Keep features with:
total count across samples >= min_feature_total
present (non-zero) in at least min_samples_with_feature samples.
Drop samples with total counts < min_sample_total.
feature_method="elbow"
Compute feature totals.
Use an “elbow” heuristic to choose a count threshold.
Keep features above that threshold.
Apply min_samples_with_feature and min_sample_total as in manual mode.
feature_method="percentile"
Compute feature totals.
Keep features above a chosen percentile of totals (e.g. feature_percentile=0.9 keeps the top 10% by total count).
Apply min_samples_with_feature and min_sample_total as in manual mode.
The function returns:
filtered – filtered count matrix.
summary – small dataclass with filtering statistics (e.g. numbers of features/samples before/after, thresholds used).
NMF signatures and projection
NMF decomposition
NMF is used to decompose the filtered matrix into:
Signatures: STR mutation patterns (features x K)
Exposures: how much each sample uses each signature (samples x K)
from str_mut_signatures import run_nmf
nmf_res = run_nmf(
matrix,
n_signatures=5,
init="nndsvd",
max_iter=200,
random_state=0,
alpha_W=0.0,
alpha_H=0.0,
l1_ratio=0.0,
)
signatures = nmf_res.signatures # DataFrame: features x K
exposures = nmf_res.exposures # DataFrame: samples x K
params = nmf_res.model_params
Saving and loading NMF results
You can save and reload NMF results in a stable format (TSV + JSON):
from str_mut_signatures import save_nmf_result, load_nmf_result
save_nmf_result(nmf_res, "nmf_results")
nmf_loaded = load_nmf_result("nmf_results")
# nmf_loaded.signatures, nmf_loaded.exposures, nmf_loaded.model_params
Projecting new samples
Given a previously learned set of signatures, you can compute exposures for new samples (e.g. a new cohort or single sample):
from str_mut_signatures import project_onto_signatures
new_exposures = project_onto_signatures(
new_matrix=new_counts_df,
signatures=nmf_loaded.signatures,
method="nnls", # non-negative least squares
)
Rows in new_exposures are new samples, columns are signatures.
Command line interface
Global options
-v / --verbose: Enable verbose logging.
--version: Show package version.
Extract
str_mut_signatures extract \
--vcf-dir PATH \
--out-matrix OUTPUT.tsv \
[--ru {none,length,ru,AT}] \
[--ref-length] \
[--change]
Key options:
--vcf-dir: Directory with STR-annotated, paired tumor–normal VCF files.
--ru:
none: ignore motif.
length: use motif length (LEN1, LEN2, …).
ru: use full motif sequence.
AT: use AT-rich vs non-AT-rich labeling.
--ref-length: Include reference repeat length in feature labels.
--change: Encode tumor–normal repeat-length change and restrict to somatic events.
--out-matrix: Output TSV with samples as rows and STR mutation features as columns.
Filter
str_mut_signatures filter \
--matrix INPUT.tsv \
--out-matrix FILTERED.tsv \
[--feature-method {manual,elbow,percentile}] \
[--min-feature-total INT] \
[--min-samples-with-feature INT] \
[--min-sample-total INT] \
[--feature-percentile FLOAT]
Examples:
# Simple manual thresholds
str_mut_signatures filter \
--matrix counts_raw.tsv \
--out-matrix counts_filtered.tsv \
--feature-method manual \
--min-feature-total 10 \
--min-samples-with-feature 3 \
--min-sample-total 0
# Percentile-based filtering
str_mut_signatures filter \
--matrix counts_raw.tsv \
--out-matrix counts_filtered.tsv \
--feature-method percentile \
--feature-percentile 0.9
NMF
str_mut_signatures nmf \
--matrix counts_filtered.tsv \
--outdir nmf_results \
--n-signatures 5 \
[--max-iter 200] \
[--random-state 0] \
[--init nndsvd] \
[--alpha-W 0.0] \
[--alpha-H 0.0] \
[--l1-ratio 0.0]
Outputs:
nmf_results/signatures.tsv – signatures (features x K)
nmf_results/exposures.tsv – exposures (samples x K)
nmf_results/metadata.json – parameters and metadata
Project
str_mut_signatures project \
--matrix NEW_COUNTS.tsv \
--nmf-dir nmf_results \
--out-exposures NEW_EXPOSURES.tsv
--matrix: New count matrix (samples x features).
--nmf-dir: Directory with an existing NMF result (signatures.tsv, metadata.json).
--out-exposures: Output TSV with new sample exposures.
Python API
Main functions
from str_mut_signatures import (
parse_vcf_files,
build_mutation_matrix,
filter_mutation_matrix,
run_nmf,
save_nmf_result,
load_nmf_result,
project_onto_signatures,
)
parse_vcf_files(vcf_dir) → DataFrame of per-locus STR mutation data.
build_mutation_matrix(mutations, ...) → samples x features count matrix.
filter_mutation_matrix(matrix, ...) → filtered matrix + summary.
run_nmf(matrix, n_signatures, ...) → NMFResult(signatures, exposures, model_params).
save_nmf_result(result, outdir) / load_nmf_result(outdir) for persistence.
project_onto_signatures(new_matrix, signatures, method="nnls") → new exposures.
Output
Typical outputs include:
Count matrices (TSV): samples x STR mutation features.
Filtered matrices (TSV): reduced feature space for robust NMF.
NMF signatures and exposures (TSV).
Metadata (JSON) describing NMF runs and parameters.
These can be used to:
Characterize somatic STR mutation processes.
Compare STR signatures across cohorts.
Associate STR signatures with clinical or genomic features.
Apply learned STR signatures to new datasets.
Contributing
Contributions are welcome!
For major changes, please open an issue first to discuss what you’d like to change.
Please ensure:
All tests pass (including integration tests).
Code follows existing style and module structure.
New features include unit tests and, where appropriate, integration tests.
Documentation and examples are updated.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file str_mut_signatures-0.2.1.tar.gz.
File metadata
- Download URL: str_mut_signatures-0.2.1.tar.gz
- Upload date:
- Size: 82.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b962a11225f757ec01990171c9ff7b2376ad6bcf31fb4bf3172ab79ffa77a80
|
|
| MD5 |
2142124d82c5589dafe1f0aa724ea896
|
|
| BLAKE2b-256 |
a51b8bc5df387379c9f5ceb1100f1c899f26f8d4fd5ef69ff50b8c8104275540
|
File details
Details for the file str_mut_signatures-0.2.1-py3-none-any.whl.
File metadata
- Download URL: str_mut_signatures-0.2.1-py3-none-any.whl
- Upload date:
- Size: 30.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6a77763c6d7d2eff70b1460cd012f9bf22ef48cbc49ece399684f7289cdba9c
|
|
| MD5 |
9697e05b5a78ac3ab0928acf8cde63cc
|
|
| BLAKE2b-256 |
0e591d8a7434d895e9c84bc2a3674eabcc5a6b30af16b72601fe1b7e4301ec84
|