Skip to main content

Production-quality Whole Exome Sequencing analysis pipeline

Project description

ExomeFlow

Production-quality Whole Exome Sequencing (WES) analysis pipeline

Author: Robin Tomar, AIIMS New Delhi
License: MIT


Overview

ExomeFlow is a Python package that wraps a complete WES analysis workflow into a single, reproducible CLI command. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution.

FASTQ
 └─ fastp (QC + trimming)
     └─ BWA MEM (alignment)
         └─ GATK SortSam (coordinate sort)
             └─ samtools flagstat (alignment QC)
                 └─ GATK MarkDuplicates
                     └─ GATK BuildBamIndex
                         └─ GATK BQSR (BaseRecalibrator + ApplyBQSR)
                             └─ GATK HaplotypeCaller (variant calling)
                                 └─ GATK VariantFiltration (hard filters)
                                     └─ ANNOVAR (functional annotation)

Requirements

System dependencies (must be on PATH)

Tool Version tested
bwa ≥ 0.7.17
samtools ≥ 1.17
gatk 4.6.x
fastp ≥ 0.23
Perl + ANNOVAR table_annovar.pl

Python

  • Python ≥ 3.9
  • See requirements.txt for Python dependencies

Installation

From PyPI

pip install exomeflow

From source

git clone https://github.com/robintomar/exomeflow.git
cd exomeflow
pip install -e .

Reference files required

File Description
hg38.fa BWA-indexed reference genome
dbsnp.vcf.gz dbSNP (bgzipped + tabix-indexed)
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz Mills indels
Homo_sapiens_assembly38.known_indels.vcf.gz Known indels
Exome capture BED e.g. Illumina_Exome_TargetedRegions_v1.2.hg38.bed
ANNOVAR humandb hg38 annotation databases

Input FASTQ naming convention

ExomeFlow automatically detects samples from paired-end FASTQ files:

fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz

Pattern: <sample_id>_1.fastq.gz / <sample_id>_2.fastq.gz


Usage

Minimal example

exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb

Full example with all options

exomeflow run \
  --input-dir fastq/ \
  --output results/ \
  --reference /refs/hg38.fa \
  --dbsnp /refs/dbsnp.vcf.gz \
  --mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --intervals /refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
  --interval-padding 100 \
  --annovar-bin /tools/annovar \
  --annovar-db /tools/annovar/humandb \
  --threads 32 \
  --fastp-threads 8 \
  --annovar-threads 24 \
  --max-workers 2 \
  --java-opts "-Xmx80g"

Check version

exomeflow --version

Help

exomeflow run --help

Output files

After a successful run the results/ directory contains:

results/
├── QC/                          # fastp HTML/JSON reports (reserved)
├── filtered_fastp/
│   ├── <sample>_1_filtered.fastq.gz
│   ├── <sample>_2_filtered.fastq.gz
│   ├── <sample>_fastp.html
│   └── <sample>_fastp.json
├── Mapsam/
│   ├── <sample>_recalibrated.bam   ← use in IGV for variant validation
│   └── <sample>_recalibrated.bam.bai
├── VCF/
│   ├── <sample>.vcf                          ← raw HaplotypeCaller output
│   ├── <sample>_PASS.vcf                     ← PASS-only hard-filtered variants
│   ├── <sample>.annovar.hg38_multianno.vcf   ← annotated VCF
│   └── <sample>.annovar.hg38_multianno.txt   ← annotated tab-delimited table
├── logs/
│   ├── analysis_<timestamp>.log   ← full pipeline log
│   ├── errors_<timestamp>.log     ← errors only
│   └── <sample>_<timestamp>.log   ← per-sample log
└── .checkpoints/                  ← resume state (do not delete during a run)

Checkpointing & resuming

ExomeFlow writes a checkpoint file for every completed step. If the pipeline is interrupted (power failure, wall-time limit, etc.) simply re-run the exact same command — completed steps are skipped automatically.


GATK hard-filter thresholds

SNPs

Filter name Expression
SNP_LowQD QD < 2.0
SNP_StrandBias FS > 60.0
SNP_StrandOddsRatio SOR > 3.0
SNP_LowMQ MQ < 40.0
SNP_MQRankSum MQRankSum < -12.5
SNP_ReadPosRankSum ReadPosRankSum < -8.0
LowDepth DP < 10
LowGQ (genotype) GQ < 20

INDELs

Filter name Expression
INDEL_LowQD QD < 2.0
INDEL_StrandBias FS > 200.0
INDEL_StrandOddsRatio SOR > 10.0
INDEL_ReadPosRankSum ReadPosRankSum < -20.0
LowDepth DP < 10
LowGQ (genotype) GQ < 20

ANNOVAR annotation databases (default)

refGene, dbnsfp47a, clinvar_20240416, gnomad41_exome,
gnomad41_genome, avsnp150, cosmic84_coding, exac03

Publishing to PyPI

pip install build twine

# Build source + wheel distributions
python -m build

# Upload to PyPI (requires ~/.pypirc or TWINE_USERNAME / TWINE_PASSWORD env vars)
twine upload dist/*

To publish to TestPyPI first:

twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ exomeflow

Development

# Install in editable mode with dev extras
pip install -e ".[dev]"

# Lint
flake8 exomeflow/
mypy exomeflow/

Citation

If you use ExomeFlow in your research, please cite:

Robin Tomar. ExomeFlow: a production-quality whole exome sequencing pipeline. AIIMS New Delhi, 2025.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exomeflow-1.0.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exomeflow-1.0.0-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file exomeflow-1.0.0.tar.gz.

File metadata

  • Download URL: exomeflow-1.0.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c0f104367b7c18bbccaaf08ca184d1b7cea6054c7b2d8fd7fb0c6e541c31f489
MD5 b531d5a602db21b925f2585d0276d13f
BLAKE2b-256 c95f0035aabfb77c48e09f61056d6f906d68af1b707f5c1cb32f28ccd60c8920

See more details on using hashes here.

File details

Details for the file exomeflow-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: exomeflow-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5e3b96e11a87d6acd2a5fbb29f0e00ce68a78ec006934bf0a104f67476f5fe65
MD5 3b8072a65b9191d9ac020c48f0b811e5
BLAKE2b-256 858d031ef4312e5275fcb9b2a6c9a8ff8bfc6650ab4fe247f3242e0423062f2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page