Skip to main content

Production-quality Whole Exome Sequencing analysis pipeline

Project description

ExomeFlow

Production-quality Whole Exome Sequencing (WES) analysis pipeline

Author: Robin Tomar, AIIMS New Delhi
License: MIT


Overview

ExomeFlow is a Python package that wraps a complete WES analysis workflow into a single, reproducible CLI command. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution.

FASTQ
 └─ fastp (QC + trimming)
     └─ BWA MEM (alignment)
         └─ GATK SortSam (coordinate sort)
             └─ samtools flagstat (alignment QC)
                 └─ GATK MarkDuplicates
                     └─ GATK BuildBamIndex
                         └─ GATK BQSR (BaseRecalibrator + ApplyBQSR)
                             └─ GATK HaplotypeCaller (variant calling)
                                 └─ GATK VariantFiltration (hard filters)
                                     └─ ANNOVAR (functional annotation)

Requirements

System dependencies (must be on PATH)

Tool Version tested
bwa ≥ 0.7.17
samtools ≥ 1.17
gatk 4.6.x
fastp ≥ 0.23
Perl + ANNOVAR table_annovar.pl

Python

  • Python ≥ 3.9
  • See requirements.txt for Python dependencies

Installation

From PyPI

pip install exomeflow

From source

git clone https://github.com/robintomar/exomeflow.git
cd exomeflow
pip install -e .

Reference files required

File Description
hg38.fa BWA-indexed reference genome
dbsnp.vcf.gz dbSNP (bgzipped + tabix-indexed)
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz Mills indels
Homo_sapiens_assembly38.known_indels.vcf.gz Known indels
Exome capture BED e.g. Illumina_Exome_TargetedRegions_v1.2.hg38.bed
ANNOVAR humandb hg38 annotation databases

Input FASTQ naming convention

ExomeFlow automatically detects samples from paired-end FASTQ files:

fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz

Pattern: <sample_id>_1.fastq.gz / <sample_id>_2.fastq.gz


Usage

Minimal example

exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb

Full example with all options

exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --intervals /refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
  --interval-padding 100 \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb \
  --threads 32 \
  --fastp-threads 8 \
  --annovar-threads 24 \
  --max-workers 2 \
  --java-opts "-Xmx80g"

Check version

exomeflow --version

Help

exomeflow run --help

Output files

After a successful run the results/ directory contains:

results/
├── QC/                          # fastp HTML/JSON reports (reserved)
├── filtered_fastp/
│   ├── <sample>_1_filtered.fastq.gz
│   ├── <sample>_2_filtered.fastq.gz
│   ├── <sample>_fastp.html
│   └── <sample>_fastp.json
├── Mapsam/
│   ├── <sample>_recalibrated.bam   ← use in IGV for variant validation
│   └── <sample>_recalibrated.bam.bai
├── VCF/
│   ├── <sample>.vcf                          ← raw HaplotypeCaller output
│   ├── <sample>_PASS.vcf                     ← PASS-only hard-filtered variants
│   ├── <sample>.annovar.hg38_multianno.vcf   ← annotated VCF
│   └── <sample>.annovar.hg38_multianno.txt   ← annotated tab-delimited table
├── logs/
│   ├── analysis_<timestamp>.log   ← full pipeline log
│   ├── errors_<timestamp>.log     ← errors only
│   └── <sample>_<timestamp>.log   ← per-sample log
└── .checkpoints/                  ← resume state (do not delete during a run)

Checkpointing & resuming

ExomeFlow writes a checkpoint file for every completed step. If the pipeline is interrupted (power failure, wall-time limit, etc.) simply re-run the exact same command — completed steps are skipped automatically.


GATK hard-filter thresholds

SNPs

Filter name Expression
SNP_LowQD QD < 2.0
SNP_StrandBias FS > 60.0
SNP_StrandOddsRatio SOR > 3.0
SNP_LowMQ MQ < 40.0
SNP_MQRankSum MQRankSum < -12.5
SNP_ReadPosRankSum ReadPosRankSum < -8.0
LowDepth DP < 10
LowGQ (genotype) GQ < 20

INDELs

Filter name Expression
INDEL_LowQD QD < 2.0
INDEL_StrandBias FS > 200.0
INDEL_StrandOddsRatio SOR > 10.0
INDEL_ReadPosRankSum ReadPosRankSum < -20.0
LowDepth DP < 10
LowGQ (genotype) GQ < 20

ANNOVAR annotation databases (default)

refGene, dbnsfp47a, clinvar_20240416, gnomad41_exome,
gnomad41_genome, avsnp150, cosmic84_coding, exac03

Citation

If you use ExomeFlow in your research, please cite:

Robin Tomar. ExomeFlow: a production-quality whole exome sequencing pipeline. AIIMS New Delhi, 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exomeflow-1.0.1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exomeflow-1.0.1-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file exomeflow-1.0.1.tar.gz.

File metadata

  • Download URL: exomeflow-1.0.1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e32548301c3fe02821d9d6aba5659c1652de35387cf0d81b759d1784eadc1cd3
MD5 0f6ec3183c0acff2ea8978cda3f9f228
BLAKE2b-256 f81b09ea5aab9ff2be154417dde426b5751c45aa9187e3c9bac23013f8c210d7

See more details on using hashes here.

File details

Details for the file exomeflow-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: exomeflow-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27dec35da08f838140f0319af07f5ddba0ed4b188a34041f7b2accf5d46eef57
MD5 fdae24daf0ea89b0a8d8609ca0f6caec
BLAKE2b-256 426207d36eab60148af5b92f9e92c3054ba38c57f9263ef22c26a812acf3b6fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page