A production-quality platform for downstream genomic variant interpretation and prioritization
Project description
VariantFlow
A production-quality platform for downstream genomic variant interpretation and prioritization
Accepts ANNOVAR multianno files and automatically performs variant filtering, ACMG classification, candidate gene prioritization, pathway enrichment, and generates publication-ready reports and interactive dashboards — with full reproducibility tracking.
Made with ❤️ in INDIA || Dr Prabudh Goel Lab, AIIMS New Delhi
Table of Contents
- Overview
- Key Features
- Architecture
- Installation
- Quick Start
- CLI Reference
- Input Formats
- Output Structure
- Configuration
- Variant Scoring
- Pathway Enrichment
- Dashboard
- Cohort and Family Analysis
- Reproducibility
- Citation
- License
Overview
VariantFlow is a modular, extensible Python platform designed for downstream analysis of ANNOVAR-annotated genomic variant files. It is built for clinical genomics research and is intended for publication in journals such as BMC Genomics, Bioinformatics, and Briefings in Bioinformatics.
The platform takes a standard ANNOVAR multianno file as input and executes a complete, auditable analysis pipeline — from raw variant filtering through to HTML, Excel, and PDF reports — without requiring any manual column mapping or configuration.
Key Features
| Feature | Description |
|---|---|
| Automatic column detection | ColumnMapper uses regex pattern matching across 50+ field types — never hardcodes ANNOVAR column names. Supports gnomAD v2/v3/v4.1.1, ClinVar date-stamped columns, SIFT4G, and more |
| Multi-tier filtering | Sequential quality → population frequency → functional consequence → exonic consequence → ACMG benign removal pipeline with full audit trail |
| ClinVar interpretation | Parses CLNSIG text; auto-detects presence-flag columns and falls back to InterVar classification |
| InterVar ACMG | Full evidence extraction — PVS1, PS1–4, PM1–6, PP1–5, BA1, BS1–4, BP1–7 |
| Transparent scoring | Configurable multi-factor variant score with per-variant breakdown |
| Gene prioritization | Ranked candidate gene tables with natural-language score explanations |
| GO / KEGG / Reactome | Enrichment via gseapy with bubble plots and bar charts per database |
| Interactive dashboard | Multi-page Dash app with live filters, drill-down tables, and export |
| Multi-format reports | HTML, Excel (multi-sheet), PDF — all with lab branding |
| Cohort analysis | Shared/unique variants, gene burden, recurrent genes across samples |
| Family analysis | De novo, autosomal recessive, compound het, X-linked detection |
| Reproducibility | project.json manifest + auto-generated methods text for manuscripts |
| 3D visualizations | 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape |
Architecture
variantflow/
├── core/ # Data models (Pydantic), exceptions, logging, pipeline orchestrator
├── io/ # ColumnMapper, MultiannoReader — auto-detect all ANNOVAR fields
├── filters/ # Quality, population frequency, functional, exonic, ACMG filters
├── annotations/ # ClinVar engine, InterVar ACMG evidence parser
├── scoring/ # Transparent multi-factor variant scorer
├── prioritization/ # Gene ranker with score explanation
├── enrichment/ # GO BP/MF/CC, KEGG, Reactome via gseapy
├── statistics/ # Summary statistics engine → statistics.json
├── visualization/ # Plotly 2D figures + 3D landscapes + enrichment plots
├── dashboard/ # Multi-page Dash app with live callbacks
├── reports/ # HTML (Jinja2), Excel (openpyxl), PDF (ReportLab)
├── cohort/ # Multi-sample shared/unique/burden analysis
├── family/ # Pedigree-based inheritance detection
├── config/ # Pydantic v2 settings — fully configurable, env-var overridable
└── cli/ # Typer CLI — analyze / dashboard / cohort / family
Installation
From source (recommended)
git clone https://github.com/imrobintomar/VariantFlow.git
cd VariantFlow
pip install -e . --no-build-isolation
Dependencies
pip install pandas numpy scipy plotly dash dash-bootstrap-components \
gseapy openpyxl reportlab jinja2 pydantic pydantic-settings \
typer rich loguru tqdm
Docker
docker build -t variantflow:1.0.0 .
docker run --rm -v $(pwd)/data:/data -v $(pwd)/results:/results \
variantflow:1.0.0 analyze /data/sample.hg38_multianno.txt --output /results
Quick Start
# Single-sample analysis
python variantflow_run.py analyze sample.hg38_multianno.txt \
--output results/ --sample-id SAMPLE01
# Launch interactive dashboard
python variantflow_run.py dashboard results/ --port 8050
# Cohort analysis (directory of multianno files)
python variantflow_run.py cohort cohort_dir/ --output cohort_results/
# Family / trio analysis
python variantflow_run.py family family_dir/ \
--proband PROBAND01 --father DAD01 --mother MOM01
CLI Reference
analyze
python variantflow_run.py analyze <input_file> [OPTIONS]
Arguments:
input_file ANNOVAR multianno file (.txt or .txt.gz)
Options:
-o, --output Output directory [default: variantflow_results]
-s, --sample-id Sample identifier [default: sample]
-g, --genome Genome build: hg38 / hg19 [default: hg38]
--af AF threshold (rare variant) [default: 0.01]
--min-dp Minimum read depth [default: 10]
--nonframeshift Include nonframeshift indels
--no-enrichment Skip pathway enrichment
--no-pdf Skip PDF report
-c, --config JSON configuration file
-v, --verbose Verbose logging
dashboard
python variantflow_run.py dashboard <results_dir> [OPTIONS]
Options:
--host Dashboard host [default: 127.0.0.1]
--port Dashboard port [default: 8050]
--debug Enable debug mode
cohort
python variantflow_run.py cohort <cohort_dir> [OPTIONS]
Options:
-o, --output Output directory [default: cohort_results]
--pattern File glob pattern [default: *.txt]
family
python variantflow_run.py family <family_dir> [OPTIONS]
Options:
-p, --proband Proband sample ID [required]
-f, --father Father sample ID
-m, --mother Mother sample ID
-o, --output Output directory [default: family_results]
Input Formats
VariantFlow accepts standard ANNOVAR multianno files:
| Format | Example |
|---|---|
| Plain text | sample.hg38_multianno.txt |
| Plain text | sample.hg19_multianno.txt |
| Gzip compressed | sample.hg38_multianno.txt.gz |
| Tab-separated | sample.hg38_multianno.tsv |
Automatically detected fields include:
- Genomic coordinates:
Chr,Start,End,Ref,Alt - Gene annotations:
Gene.refGene,Func.refGene,ExonicFunc.refGene,AAChange.refGene - Population frequencies:
gnomad411_exome_AF,gnomAD_exome_ALL,ExAC_ALL,1000g2015aug_all - ClinVar:
CLNSIG,clinvar_20260503(date-stamped),CLNDN - InterVar:
InterVar_automated,InterVar_ACMG - Predictors:
REVEL_score,CADD_phred,SIFT_score,SIFT4G_score,Polyphen2_HDIV_score - Other:
GERP++_RS,phyloP100way_vertebrate,MutationTaster_pred,SpliceAI_DS_max
Output Structure
results/
├── report.html # Self-contained interactive HTML report
├── VariantFlow_Report.xlsx # Multi-sheet Excel workbook
│ ├── Summary # Key statistics and metadata
│ ├── CandidateVariants # Top 500 variants ranked by score
│ ├── CandidateGenes # Ranked candidate genes
│ ├── ClinVar_Pathogenic # Pathogenic / Likely Pathogenic variants
│ ├── InterVar_Pathogenic # ACMG Pathogenic / LP variants
│ └── go_* / kegg / reactome # Enrichment results per database
├── report.pdf # PDF report with tables and methods
├── CandidateVariants.tsv # Tab-separated candidate variants
├── CandidateGenes.tsv # Tab-separated ranked genes
├── statistics.json # Full summary statistics
├── project.json # Reproducibility manifest
├── methods.txt # Auto-generated methods section
└── figures/
├── filtering_funnel.html
├── clinvar_distribution.html
├── intervar_distribution.html
├── gene_ranking.html
├── chromosome_distribution.html
├── variant_score_histogram.html
├── af_distribution.html
├── acmg_evidence.html
├── variant_landscape_3d.html
├── enrichment_go_biological_process_dot.html
├── enrichment_go_biological_process_bar.html
├── enrichment_go_cellular_component_dot.html
├── enrichment_kegg_dot.html
├── enrichment_reactome_dot.html
└── pathway_landscape_3d.html
Configuration
VariantFlow uses a Pydantic v2 settings system. All parameters can be overridden via:
- JSON config file (
--config my_config.json) - Environment variables (prefix
VF_)
Example config.json
{
"project_name": "Rare Disease Study",
"sample_id": "PATIENT_001",
"genome_build": "hg38",
"output_dir": "results/",
"filters": {
"active_af_threshold": 0.001,
"min_dp": 20,
"include_nonframeshift": true
},
"scoring": {
"clinvar_pathogenic": 10.0,
"revel_high": 3.0,
"cadd_very_high": 3.0
},
"enrichment": {
"organism": "human",
"qvalue_cutoff": 0.05,
"top_n_terms": 20
}
}
Environment variable override
export VF_FILTERS__ACTIVE_AF_THRESHOLD=0.001
export VF_FILTERS__MIN_DP=20
export VF_LOG_LEVEL=DEBUG
python variantflow_run.py analyze sample.txt
Variant Scoring
VariantFlow uses a transparent, configurable multi-factor scoring system. Every score contribution is stored in a score_breakdown column for full auditability.
| Source | Criterion | Score |
|---|---|---|
| ClinVar | Pathogenic | +10.0 |
| ClinVar | Likely Pathogenic | +8.0 |
| ClinVar | VUS | +3.0 |
| ClinVar | Likely Benign | -2.0 |
| ClinVar | Benign | -5.0 |
| InterVar | Pathogenic | +8.0 |
| InterVar | Likely Pathogenic | +6.0 |
| Consequence | Stop-gain / Stop-loss / Start-loss | +5.0 |
| Consequence | Frameshift indel | +5.0 |
| Consequence | Splicing | +3.0 |
| Consequence | Nonsynonymous SNV | +2.0 |
| Population AF | < 0.0001 (ultra-rare) | +4.0 |
| Population AF | < 0.001 (very rare) | +3.0 |
| Population AF | < 0.01 (rare) | +1.5 |
| REVEL | ≥ 0.75 | +3.0 |
| REVEL | 0.50 – 0.75 | +1.5 |
| CADD | ≥ 30 | +3.0 |
| CADD | 20 – 30 | +2.0 |
| SIFT | Deleterious | +1.0 |
| PolyPhen-2 | Damaging | +1.0 |
All weights are configurable in config.json under the scoring key.
Note: Variants classified as Benign or Likely Benign by InterVar are automatically removed from the candidate set after annotation.
Pathway Enrichment
Enrichment analysis is performed using gseapy against:
| Database | Gene Sets |
|---|---|
| Gene Ontology | GO Biological Process 2023 |
| Gene Ontology | GO Molecular Function 2023 |
| Gene Ontology | GO Cellular Component 2023 |
| KEGG | KEGG 2021 Human |
| Reactome | Reactome 2022 |
Each database produces:
- Bubble plot — x = -log₁₀(adj. p-value), size = gene count, color = odds ratio
- Bar chart — ranked terms colored by gene count
- Full results table with export
Significance threshold: adjusted p-value ≤ 0.2 (Benjamini-Hochberg). Requires ≥ 5 candidate genes.
Dashboard
The interactive Dash dashboard (http://127.0.0.1:8050) provides eight analysis pages:
| Page | Content |
|---|---|
| Overview | KPI cards, filtering funnel, ClinVar/InterVar/chromosome distribution |
| Variant Explorer | Live-filtered table with score slider, ClinVar and InterVar dropdowns, histogram |
| Genes | Ranked bar chart (color-coded by ClinVar), Top N slider, full gene table |
| ClinVar | Classification distribution pie chart, filtered variant table |
| InterVar | ACMG classification bar chart, evidence criterion heatmap |
| Enrichment | Bubble + bar plots per database (GO CC, GO BP, GO MF, KEGG, Reactome) |
| 3D Landscape | 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape |
| Chromosome | Variant density by chromosome |
All tables support column filtering, sorting, and Excel export.
Cohort and Family Analysis
Cohort
python variantflow_run.py cohort cohort_dir/ --output cohort_results/
Outputs:
cohort_shared_variants.tsv— variants present in ≥ 2 samplescohort_unique_variants.tsv— sample-private variantscohort_gene_burden.tsv— per-gene variant counts per samplecohort_recurrent_genes.tsv— genes affected in ≥ 2 samples
Family (Trio/Quad)
python variantflow_run.py family family_dir/ \
--proband PROBAND --father FATHER --mother MOTHER
Detects and outputs:
family_de_novo.tsvfamily_autosomal_recessive.tsvfamily_compound_heterozygous.tsvfamily_x_linked.tsv
Reproducibility
Every analysis generates a project.json manifest containing:
{
"run_id": "16e51f54",
"variantflow_version": "1.0.0",
"created_at": "2026-06-03T10:36:51",
"python_version": "3.13.11",
"genome_build": "hg38",
"input_files": ["sample.hg38_multianno.txt"],
"filters_applied": ["quality", "population_frequency", "functional_consequence",
"exonic_consequence", "acmg_benign_removal"],
"total_input_variants": 86299,
"total_output_variants": 917,
"total_candidate_genes": 100,
"config": { "..." }
}
A methods.txt file is also generated, ready to paste into a manuscript Methods section.
Citation
If you use VariantFlow in your research, please cite:
Tomar R. (2024). VariantFlow: A production-quality platform for genomic variant interpretation and prioritization. Dr Prabudh Goel Lab, AIIMS New Delhi. GitHub. https://github.com/imrobintomar/VariantFlow
Contributing
Contributions are welcome. Please open an issue before submitting a pull request. All contributors must follow the existing code style (black, ruff) and include unit tests.
# Run tests
pytest tests/unit/ -v --cov=variantflow
# Lint
ruff check variantflow/
black variantflow/
License
MIT License © 2024 Robin Tomar — Dr Prabudh Goel Lab, AIIMS New Delhi
See LICENSE for full terms.
github.com/imrobintomar/VariantFlow
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file variantflow-1.0.0.tar.gz.
File metadata
- Download URL: variantflow-1.0.0.tar.gz
- Upload date:
- Size: 63.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee1e40f42d2120a0ecd4add90f871ba925b9ab8cb8940d6d4f46f3b2cde08b74
|
|
| MD5 |
1ffa1747cb93dca0b8eb4328c271ac84
|
|
| BLAKE2b-256 |
b1fdc7b816f6d02a15037e6bb6bb1d30a9fd2cc79377b6601392ff181a469547
|
File details
Details for the file variantflow-1.0.0-py3-none-any.whl.
File metadata
- Download URL: variantflow-1.0.0-py3-none-any.whl
- Upload date:
- Size: 69.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67ebd8d1698fdae38649ac630739b47ab1eaa347f9cb068c373f8e443e0184f8
|
|
| MD5 |
9132e95ce4e8385ab1cb463ad2b24b96
|
|
| BLAKE2b-256 |
a5aeb62ab5506fc321710c673f5b9bc98d1eb0a40b02fcfa9d19c1ca57542e5d
|