Skip to main content

A production-quality platform for downstream genomic variant interpretation and prioritization

Project description

VariantFlow

A production-quality platform for downstream genomic variant interpretation and prioritization

Python 3.11+ License: MIT CI Code style: black Genome Build


Accepts ANNOVAR multianno files and automatically performs variant filtering, ACMG classification, candidate gene prioritization, pathway enrichment, and generates publication-ready reports and interactive dashboards — with full reproducibility tracking.


Made with ❤️ in INDIA  ||  Dr Prabudh Goel Lab, AIIMS New Delhi


Table of Contents


Overview

VariantFlow is a modular, extensible Python platform designed for downstream analysis of ANNOVAR-annotated genomic variant files. It is built for clinical genomics research and is intended for publication in journals such as BMC Genomics, Bioinformatics, and Briefings in Bioinformatics.

The platform takes a standard ANNOVAR multianno file as input and executes a complete, auditable analysis pipeline — from raw variant filtering through to HTML, Excel, and PDF reports — without requiring any manual column mapping or configuration.


Key Features

Feature Description
Automatic column detection ColumnMapper uses regex pattern matching across 50+ field types — never hardcodes ANNOVAR column names. Supports gnomAD v2/v3/v4.1.1, ClinVar date-stamped columns, SIFT4G, and more
Multi-tier filtering Sequential quality → population frequency → functional consequence → exonic consequence → ACMG benign removal pipeline with full audit trail
ClinVar interpretation Parses CLNSIG text; auto-detects presence-flag columns and falls back to InterVar classification
InterVar ACMG Full evidence extraction — PVS1, PS1–4, PM1–6, PP1–5, BA1, BS1–4, BP1–7
Transparent scoring Configurable multi-factor variant score with per-variant breakdown
Gene prioritization Ranked candidate gene tables with natural-language score explanations
GO / KEGG / Reactome Enrichment via gseapy with bubble plots and bar charts per database
Interactive dashboard Multi-page Dash app with live filters, drill-down tables, and export
Multi-format reports HTML, Excel (multi-sheet), PDF — all with lab branding
Cohort analysis Shared/unique variants, gene burden, recurrent genes across samples
Family analysis De novo, autosomal recessive, compound het, X-linked detection
Reproducibility project.json manifest + auto-generated methods text for manuscripts
3D visualizations 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape

Architecture

variantflow/
├── core/               # Data models (Pydantic), exceptions, logging, pipeline orchestrator
├── io/                 # ColumnMapper, MultiannoReader — auto-detect all ANNOVAR fields
├── filters/            # Quality, population frequency, functional, exonic, ACMG filters
├── annotations/        # ClinVar engine, InterVar ACMG evidence parser
├── scoring/            # Transparent multi-factor variant scorer
├── prioritization/     # Gene ranker with score explanation
├── enrichment/         # GO BP/MF/CC, KEGG, Reactome via gseapy
├── statistics/         # Summary statistics engine → statistics.json
├── visualization/      # Plotly 2D figures + 3D landscapes + enrichment plots
├── dashboard/          # Multi-page Dash app with live callbacks
├── reports/            # HTML (Jinja2), Excel (openpyxl), PDF (ReportLab)
├── cohort/             # Multi-sample shared/unique/burden analysis
├── family/             # Pedigree-based inheritance detection
├── config/             # Pydantic v2 settings — fully configurable, env-var overridable
└── cli/                # Typer CLI — analyze / dashboard / cohort / family

Installation

From source (recommended)

git clone https://github.com/imrobintomar/VariantFlow.git
cd VariantFlow
pip install -e . --no-build-isolation

Dependencies

pip install pandas numpy scipy plotly dash dash-bootstrap-components \
            gseapy openpyxl reportlab jinja2 pydantic pydantic-settings \
            typer rich loguru tqdm

Docker

docker build -t variantflow:1.0.0 .
docker run --rm -v $(pwd)/data:/data -v $(pwd)/results:/results \
  variantflow:1.0.0 analyze /data/sample.hg38_multianno.txt --output /results

Quick Start

# Single-sample analysis
python variantflow_run.py analyze sample.hg38_multianno.txt \
  --output results/ --sample-id SAMPLE01

# Launch interactive dashboard
python variantflow_run.py dashboard results/ --port 8050

# Cohort analysis (directory of multianno files)
python variantflow_run.py cohort cohort_dir/ --output cohort_results/

# Family / trio analysis
python variantflow_run.py family family_dir/ \
  --proband PROBAND01 --father DAD01 --mother MOM01

CLI Reference

analyze

python variantflow_run.py analyze <input_file> [OPTIONS]

Arguments:
  input_file          ANNOVAR multianno file (.txt or .txt.gz)

Options:
  -o, --output        Output directory           [default: variantflow_results]
  -s, --sample-id     Sample identifier          [default: sample]
  -g, --genome        Genome build: hg38 / hg19  [default: hg38]
  --af                AF threshold (rare variant) [default: 0.01]
  --min-dp            Minimum read depth          [default: 10]
  --nonframeshift     Include nonframeshift indels
  --no-enrichment     Skip pathway enrichment
  --no-pdf            Skip PDF report
  -c, --config        JSON configuration file
  -v, --verbose       Verbose logging

dashboard

python variantflow_run.py dashboard <results_dir> [OPTIONS]

Options:
  --host    Dashboard host  [default: 127.0.0.1]
  --port    Dashboard port  [default: 8050]
  --debug   Enable debug mode

cohort

python variantflow_run.py cohort <cohort_dir> [OPTIONS]

Options:
  -o, --output   Output directory  [default: cohort_results]
  --pattern      File glob pattern [default: *.txt]

family

python variantflow_run.py family <family_dir> [OPTIONS]

Options:
  -p, --proband  Proband sample ID  [required]
  -f, --father   Father sample ID
  -m, --mother   Mother sample ID
  -o, --output   Output directory  [default: family_results]

Input Formats

VariantFlow accepts standard ANNOVAR multianno files:

Format Example
Plain text sample.hg38_multianno.txt
Plain text sample.hg19_multianno.txt
Gzip compressed sample.hg38_multianno.txt.gz
Tab-separated sample.hg38_multianno.tsv

Automatically detected fields include:

  • Genomic coordinates: Chr, Start, End, Ref, Alt
  • Gene annotations: Gene.refGene, Func.refGene, ExonicFunc.refGene, AAChange.refGene
  • Population frequencies: gnomad411_exome_AF, gnomAD_exome_ALL, ExAC_ALL, 1000g2015aug_all
  • ClinVar: CLNSIG, clinvar_20260503 (date-stamped), CLNDN
  • InterVar: InterVar_automated, InterVar_ACMG
  • Predictors: REVEL_score, CADD_phred, SIFT_score, SIFT4G_score, Polyphen2_HDIV_score
  • Other: GERP++_RS, phyloP100way_vertebrate, MutationTaster_pred, SpliceAI_DS_max

Output Structure

results/
├── report.html                  # Self-contained interactive HTML report
├── VariantFlow_Report.xlsx       # Multi-sheet Excel workbook
│   ├── Summary                  # Key statistics and metadata
│   ├── CandidateVariants        # Top 500 variants ranked by score
│   ├── CandidateGenes           # Ranked candidate genes
│   ├── ClinVar_Pathogenic        # Pathogenic / Likely Pathogenic variants
│   ├── InterVar_Pathogenic       # ACMG Pathogenic / LP variants
│   └── go_* / kegg / reactome   # Enrichment results per database
├── report.pdf                   # PDF report with tables and methods
├── CandidateVariants.tsv        # Tab-separated candidate variants
├── CandidateGenes.tsv           # Tab-separated ranked genes
├── statistics.json              # Full summary statistics
├── project.json                 # Reproducibility manifest
├── methods.txt                  # Auto-generated methods section
└── figures/
    ├── filtering_funnel.html
    ├── clinvar_distribution.html
    ├── intervar_distribution.html
    ├── gene_ranking.html
    ├── chromosome_distribution.html
    ├── variant_score_histogram.html
    ├── af_distribution.html
    ├── acmg_evidence.html
    ├── variant_landscape_3d.html
    ├── enrichment_go_biological_process_dot.html
    ├── enrichment_go_biological_process_bar.html
    ├── enrichment_go_cellular_component_dot.html
    ├── enrichment_kegg_dot.html
    ├── enrichment_reactome_dot.html
    └── pathway_landscape_3d.html

Configuration

VariantFlow uses a Pydantic v2 settings system. All parameters can be overridden via:

  1. JSON config file (--config my_config.json)
  2. Environment variables (prefix VF_)

Example config.json

{
  "project_name": "Rare Disease Study",
  "sample_id": "PATIENT_001",
  "genome_build": "hg38",
  "output_dir": "results/",
  "filters": {
    "active_af_threshold": 0.001,
    "min_dp": 20,
    "include_nonframeshift": true
  },
  "scoring": {
    "clinvar_pathogenic": 10.0,
    "revel_high": 3.0,
    "cadd_very_high": 3.0
  },
  "enrichment": {
    "organism": "human",
    "qvalue_cutoff": 0.05,
    "top_n_terms": 20
  }
}

Environment variable override

export VF_FILTERS__ACTIVE_AF_THRESHOLD=0.001
export VF_FILTERS__MIN_DP=20
export VF_LOG_LEVEL=DEBUG
python variantflow_run.py analyze sample.txt

Variant Scoring

VariantFlow uses a transparent, configurable multi-factor scoring system. Every score contribution is stored in a score_breakdown column for full auditability.

Source Criterion Score
ClinVar Pathogenic +10.0
ClinVar Likely Pathogenic +8.0
ClinVar VUS +3.0
ClinVar Likely Benign -2.0
ClinVar Benign -5.0
InterVar Pathogenic +8.0
InterVar Likely Pathogenic +6.0
Consequence Stop-gain / Stop-loss / Start-loss +5.0
Consequence Frameshift indel +5.0
Consequence Splicing +3.0
Consequence Nonsynonymous SNV +2.0
Population AF < 0.0001 (ultra-rare) +4.0
Population AF < 0.001 (very rare) +3.0
Population AF < 0.01 (rare) +1.5
REVEL ≥ 0.75 +3.0
REVEL 0.50 – 0.75 +1.5
CADD ≥ 30 +3.0
CADD 20 – 30 +2.0
SIFT Deleterious +1.0
PolyPhen-2 Damaging +1.0

All weights are configurable in config.json under the scoring key.

Note: Variants classified as Benign or Likely Benign by InterVar are automatically removed from the candidate set after annotation.


Pathway Enrichment

Enrichment analysis is performed using gseapy against:

Database Gene Sets
Gene Ontology GO Biological Process 2023
Gene Ontology GO Molecular Function 2023
Gene Ontology GO Cellular Component 2023
KEGG KEGG 2021 Human
Reactome Reactome 2022

Each database produces:

  • Bubble plot — x = -log₁₀(adj. p-value), size = gene count, color = odds ratio
  • Bar chart — ranked terms colored by gene count
  • Full results table with export

Significance threshold: adjusted p-value ≤ 0.2 (Benjamini-Hochberg). Requires ≥ 5 candidate genes.


Dashboard

The interactive Dash dashboard (http://127.0.0.1:8050) provides eight analysis pages:

Page Content
Overview KPI cards, filtering funnel, ClinVar/InterVar/chromosome distribution
Variant Explorer Live-filtered table with score slider, ClinVar and InterVar dropdowns, histogram
Genes Ranked bar chart (color-coded by ClinVar), Top N slider, full gene table
ClinVar Classification distribution pie chart, filtered variant table
InterVar ACMG classification bar chart, evidence criterion heatmap
Enrichment Bubble + bar plots per database (GO CC, GO BP, GO MF, KEGG, Reactome)
3D Landscape 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape
Chromosome Variant density by chromosome

All tables support column filtering, sorting, and Excel export.


Cohort and Family Analysis

Cohort

python variantflow_run.py cohort cohort_dir/ --output cohort_results/

Outputs:

  • cohort_shared_variants.tsv — variants present in ≥ 2 samples
  • cohort_unique_variants.tsv — sample-private variants
  • cohort_gene_burden.tsv — per-gene variant counts per sample
  • cohort_recurrent_genes.tsv — genes affected in ≥ 2 samples

Family (Trio/Quad)

python variantflow_run.py family family_dir/ \
  --proband PROBAND --father FATHER --mother MOTHER

Detects and outputs:

  • family_de_novo.tsv
  • family_autosomal_recessive.tsv
  • family_compound_heterozygous.tsv
  • family_x_linked.tsv

Reproducibility

Every analysis generates a project.json manifest containing:

{
  "run_id": "16e51f54",
  "variantflow_version": "1.0.0",
  "created_at": "2026-06-03T10:36:51",
  "python_version": "3.13.11",
  "genome_build": "hg38",
  "input_files": ["sample.hg38_multianno.txt"],
  "filters_applied": ["quality", "population_frequency", "functional_consequence",
                       "exonic_consequence", "acmg_benign_removal"],
  "total_input_variants": 86299,
  "total_output_variants": 917,
  "total_candidate_genes": 100,
  "config": { "..." }
}

A methods.txt file is also generated, ready to paste into a manuscript Methods section.


Citation

If you use VariantFlow in your research, please cite:

Tomar R. (2024). VariantFlow: A production-quality platform for genomic variant interpretation and prioritization. Dr Prabudh Goel Lab, AIIMS New Delhi. GitHub. https://github.com/imrobintomar/VariantFlow


Contributing

Contributions are welcome. Please open an issue before submitting a pull request. All contributors must follow the existing code style (black, ruff) and include unit tests.

# Run tests
pytest tests/unit/ -v --cov=variantflow

# Lint
ruff check variantflow/
black variantflow/

License

MIT License © 2024 Robin Tomar — Dr Prabudh Goel Lab, AIIMS New Delhi

See LICENSE for full terms.


Made with ❤️ in INDIA  ||  Dr Prabudh Goel Lab, AIIMS New Delhi
github.com/imrobintomar/VariantFlow

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

variantflow-1.0.0.tar.gz (63.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

variantflow-1.0.0-py3-none-any.whl (69.8 kB view details)

Uploaded Python 3

File details

Details for the file variantflow-1.0.0.tar.gz.

File metadata

  • Download URL: variantflow-1.0.0.tar.gz
  • Upload date:
  • Size: 63.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for variantflow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ee1e40f42d2120a0ecd4add90f871ba925b9ab8cb8940d6d4f46f3b2cde08b74
MD5 1ffa1747cb93dca0b8eb4328c271ac84
BLAKE2b-256 b1fdc7b816f6d02a15037e6bb6bb1d30a9fd2cc79377b6601392ff181a469547

See more details on using hashes here.

File details

Details for the file variantflow-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: variantflow-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 69.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for variantflow-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 67ebd8d1698fdae38649ac630739b47ab1eaa347f9cb068c373f8e443e0184f8
MD5 9132e95ce4e8385ab1cb463ad2b24b96
BLAKE2b-256 a5aeb62ab5506fc321710c673f5b9bc98d1eb0a40b02fcfa9d19c1ca57542e5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page