Production-quality Whole Exome Sequencing analysis pipeline
Project description
ExomeFlow
Production-quality Whole Exome Sequencing (WES) analysis pipeline
Author: Robin Tomar, AIIMS New Delhi
License: MIT
Overview
ExomeFlow is a Python package that wraps a complete WES analysis workflow into a single, reproducible CLI command. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution.
FASTQ
└─ fastp (QC + trimming)
└─ BWA MEM (alignment)
└─ GATK SortSam (coordinate sort)
└─ samtools flagstat (alignment QC)
└─ GATK MarkDuplicates
└─ GATK BuildBamIndex
└─ GATK BQSR (BaseRecalibrator + ApplyBQSR)
└─ GATK HaplotypeCaller (variant calling)
└─ GATK VariantFiltration (hard filters)
└─ ANNOVAR (functional annotation)
Requirements
System dependencies (must be on PATH)
| Tool | Version tested |
|---|---|
bwa |
≥ 0.7.17 |
samtools |
≥ 1.17 |
gatk |
4.6.x |
fastp |
≥ 0.23 |
| Perl + ANNOVAR | table_annovar.pl |
Python
- Python ≥ 3.9
- See
requirements.txtfor Python dependencies
Installation
From PyPI
pip install exomeflow
From source
git clone https://github.com/robintomar/exomeflow.git
cd exomeflow
pip install -e .
Reference files required
| File | Description |
|---|---|
hg38.fa |
BWA-indexed reference genome |
dbsnp.vcf.gz |
dbSNP (bgzipped + tabix-indexed) |
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
Mills indels |
Homo_sapiens_assembly38.known_indels.vcf.gz |
Known indels |
| Exome capture BED | e.g. Illumina_Exome_TargetedRegions_v1.2.hg38.bed |
| ANNOVAR humandb | hg38 annotation databases |
Input FASTQ naming convention
ExomeFlow automatically detects samples from paired-end FASTQ files:
fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz
Pattern: <sample_id>_1.fastq.gz / <sample_id>_2.fastq.gz
Usage
Minimal example
exomeflow run \
--input-dir fastq/ \
--output results/ \
--reference /refs/hg38.fa \
--dbsnp /refs/dbsnp.vcf.gz \
--mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
--annovar-bin /tools/annovar \
--annovar-db /tools/annovar/humandb
Full example with all options
exomeflow run \
--input-dir fastq/ \
--output results/ \
--reference /refs/hg38.fa \
--dbsnp /refs/dbsnp.vcf.gz \
--mills /refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-indels /refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
--intervals /refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
--interval-padding 100 \
--annovar-bin /tools/annovar \
--annovar-db /tools/annovar/humandb \
--threads 32 \
--fastp-threads 8 \
--annovar-threads 24 \
--max-workers 2 \
--java-opts "-Xmx80g"
Check version
exomeflow --version
Help
exomeflow run --help
Output files
After a successful run the results/ directory contains:
results/
├── QC/ # fastp HTML/JSON reports (reserved)
├── filtered_fastp/
│ ├── <sample>_1_filtered.fastq.gz
│ ├── <sample>_2_filtered.fastq.gz
│ ├── <sample>_fastp.html
│ └── <sample>_fastp.json
├── Mapsam/
│ ├── <sample>_recalibrated.bam ← use in IGV for variant validation
│ └── <sample>_recalibrated.bam.bai
├── VCF/
│ ├── <sample>.vcf ← raw HaplotypeCaller output
│ ├── <sample>_PASS.vcf ← PASS-only hard-filtered variants
│ ├── <sample>.annovar.hg38_multianno.vcf ← annotated VCF
│ └── <sample>.annovar.hg38_multianno.txt ← annotated tab-delimited table
├── logs/
│ ├── analysis_<timestamp>.log ← full pipeline log
│ ├── errors_<timestamp>.log ← errors only
│ └── <sample>_<timestamp>.log ← per-sample log
└── .checkpoints/ ← resume state (do not delete during a run)
Checkpointing & resuming
ExomeFlow writes a checkpoint file for every completed step. If the pipeline is interrupted (power failure, wall-time limit, etc.) simply re-run the exact same command — completed steps are skipped automatically.
GATK hard-filter thresholds
SNPs
| Filter name | Expression |
|---|---|
SNP_LowQD |
QD < 2.0 |
SNP_StrandBias |
FS > 60.0 |
SNP_StrandOddsRatio |
SOR > 3.0 |
SNP_LowMQ |
MQ < 40.0 |
SNP_MQRankSum |
MQRankSum < -12.5 |
SNP_ReadPosRankSum |
ReadPosRankSum < -8.0 |
LowDepth |
DP < 10 |
LowGQ (genotype) |
GQ < 20 |
INDELs
| Filter name | Expression |
|---|---|
INDEL_LowQD |
QD < 2.0 |
INDEL_StrandBias |
FS > 200.0 |
INDEL_StrandOddsRatio |
SOR > 10.0 |
INDEL_ReadPosRankSum |
ReadPosRankSum < -20.0 |
LowDepth |
DP < 10 |
LowGQ (genotype) |
GQ < 20 |
ANNOVAR annotation databases (default)
refGene, dbnsfp47a, clinvar_20240416, gnomad41_exome,
gnomad41_genome, avsnp150, cosmic84_coding, exac03
Publishing to PyPI
pip install build twine
# Build source + wheel distributions
python -m build
# Upload to PyPI (requires ~/.pypirc or TWINE_USERNAME / TWINE_PASSWORD env vars)
twine upload dist/*
To publish to TestPyPI first:
twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ exomeflow
Development
# Install in editable mode with dev extras
pip install -e ".[dev]"
# Lint
flake8 exomeflow/
mypy exomeflow/
Citation
If you use ExomeFlow in your research, please cite:
Robin Tomar. ExomeFlow: a production-quality whole exome sequencing pipeline. AIIMS New Delhi, 2025.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exomeflow-1.0.0.tar.gz.
File metadata
- Download URL: exomeflow-1.0.0.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0f104367b7c18bbccaaf08ca184d1b7cea6054c7b2d8fd7fb0c6e541c31f489
|
|
| MD5 |
b531d5a602db21b925f2585d0276d13f
|
|
| BLAKE2b-256 |
c95f0035aabfb77c48e09f61056d6f906d68af1b707f5c1cb32f28ccd60c8920
|
File details
Details for the file exomeflow-1.0.0-py3-none-any.whl.
File metadata
- Download URL: exomeflow-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e3b96e11a87d6acd2a5fbb29f0e00ce68a78ec006934bf0a104f67476f5fe65
|
|
| MD5 |
3b8072a65b9191d9ac020c48f0b811e5
|
|
| BLAKE2b-256 |
858d031ef4312e5275fcb9b2a6c9a8ff8bfc6650ab4fe247f3242e0423062f2a
|