Production-quality Whole Exome Sequencing analysis pipeline
Project description
What is it?
ExomeFlow is a Python package that provides a complete, automated Whole Exome Sequencing (WES) analysis workflow — from raw FASTQ files to functionally annotated variants — in a single reproducible CLI command.
It aims to be the standard high-level pipeline for WES analysis in Python, combining GATK best-practice variant calling, hard filtering, and ANNOVAR annotation into one modular, maintainable package. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution out of the box.
Table of Contents
- What is it?
- Main Features
- Pipeline Workflow
- Where to get it
- System Requirements
- Python Dependencies
- Quick Start
- Reference Files
- Input Convention
- Output Files
- Documentation
- Getting Help
- License
- Citation
Main Features
Here are the things ExomeFlow does well:
- Automatic sample detection — scans an input directory and detects all paired-end samples from FASTQ filenames; no manifest file required
- Complete GATK best-practice workflow — fastp QC → BWA MEM alignment → coordinate sorting → duplicate marking → BQSR → HaplotypeCaller → hard filtering → ANNOVAR annotation
- Cohort processing — processes any number of samples sequentially or in parallel
with
--max-workers - Checkpointing and resume — every completed step is recorded; an interrupted run resumes exactly where it left off without repeating work
- Automatic requirements check — verifies all system tools and Python packages before the pipeline starts, reporting every missing dependency at once
- Structured logging — per-sample log files plus a pipeline-wide log with INFO / WARNING / ERROR / SUCCESS levels
- GATK hard filters — applies GATK best-practice SNP and INDEL hard-filter thresholds and extracts PASS-only variants automatically
- ANNOVAR functional annotation — annotates variants against 8 databases: refGene, ClinVar, gnomAD, dbNSFP, COSMIC, ExAC, avSNP150, and dbscSNV
- Modular architecture — each pipeline step is an independent Python module; easy to extend or modify individual steps without touching the rest
- PyPI installable —
pip install exomeflow; no Docker or Nextflow required
Pipeline Workflow
Raw FASTQ
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 1 fastp Quality control & adapter trim │
│ length ≥ 50 bp · base quality ≥ Q30 │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 2 BWA MEM Read alignment to hg38 │
│ -Y -K 100000000 · read-group tags set │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 3 GATK SortSam Coordinate-sort BAM │
│ Step 4 samtools Flagstat alignment QC │
│ Step 5 GATK MarkDuplicates PCR duplicate removal │
│ Step 6 GATK BuildBamIndex BAI index │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 7 GATK BQSR BaseRecalibrator + ApplyBQSR │
│ Known sites: dbSNP · Mills · known indels │
│ → recalibrated.bam (IGV-ready) │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 8 GATK HaplotypeCaller Variant calling │
│ Exome intervals + padding · dbSNP annotation │
└──────────────────────────┬──────────────────────────────┘
│
┌──────┴──────┐
▼ ▼
SNP filters INDEL filters
(Step 9) (Step 10)
└──────┬──────┘
│ MergeVcfs
▼
┌─────────────────────────────────────────────────────────┐
│ Step 11 SelectVariants Extract PASS-only variants │
└──────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Step 12 ANNOVAR Functional annotation │
│ refGene · ClinVar · gnomAD · dbNSFP · COSMIC │
│ → multianno.vcf + multianno.txt │
└─────────────────────────────────────────────────────────┘
Where to get it
The source code is hosted on GitHub at: https://github.com/imrobintomar/exomeflow
Binary installers for the latest released version are available at the Python Package Index (PyPI).
# PyPI
pip install exomeflow
# Install latest development version from GitHub
pip install git+https://github.com/imrobintomar/exomeflow.git
The list of changes between each release can be found in the Release History.
System Requirements
ExomeFlow calls the following external tools via the command line.
They must be installed separately and available on your PATH.
| Tool | Minimum Version | Install |
|---|---|---|
| BWA | ≥ 0.7.17 | conda install -c bioconda bwa |
| SAMtools | ≥ 1.13 | conda install -c bioconda samtools |
| GATK | ≥ 4.6.0 | Download jar + add to PATH |
| fastp | ≥ 0.20.1 | conda install -c bioconda fastp |
| Perl | ≥ 5.26 | conda install perl |
| ANNOVAR | latest | Register + download |
Run
python check_requirements.pyto verify all tools are installed and meet minimum version requirements before starting the pipeline. This check also runs automatically as Step 0 of every pipeline run.
Python Dependencies
- typer — Builds the CLI interface
- rich — Provides coloured terminal output and structured logging
- pandas — Data handling for variant count summaries
- matplotlib — Variant summary figure generation
See requirements.txt for exact minimum versions.
Quick Start
1. Install
pip install exomeflow
2. Check requirements
python check_requirements.py
3. Prepare FASTQ files
fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz
4. Run the pipeline
exomeflow run \
--input-dir fastq/ \
--output results/ \
--reference refs/hg38.fa \
--dbsnp refs/dbsnp.vcf.gz \
--mills refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-indels refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
--intervals refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
--annovar-bin /path/to/annovar \
--annovar-db /path/to/annovar/humandb \
--threads 32 \
--max-workers 2
Reference Files
| File | Description |
|---|---|
hg38.fa |
BWA-indexed reference genome (UCSC / GATK resource bundle) |
dbsnp.vcf.gz |
dbSNP VCF — bgzipped + tabix-indexed |
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz |
Mills gold standard indels |
Homo_sapiens_assembly38.known_indels.vcf.gz |
Known indels for BQSR |
| Exome capture BED | From your capture kit vendor (Illumina / Twist / Agilent) |
| ANNOVAR humandb | hg38 annotation databases |
Download the GATK resource bundle:
gsutil -m cp -r gs://gatk-best-practices/somatic-hg38/ .
Input Convention
ExomeFlow automatically detects samples from paired-end FASTQ filenames. Files must follow the pattern:
<sample_id>_1.fastq.gz ← Read 1
<sample_id>_2.fastq.gz ← Read 2
The sample_id can be any string — SRR accession, patient ID, etc.
Output Files
| File | Description |
|---|---|
Mapsam/<sample>_recalibrated.bam |
Analysis-ready BAM — open in IGV |
VCF/<sample>.vcf |
Raw HaplotypeCaller output |
VCF/<sample>_PASS.vcf |
PASS-only hard-filtered variants |
VCF/<sample>.annovar.hg38_multianno.vcf |
Annotated VCF |
VCF/<sample>.annovar.hg38_multianno.txt |
Annotated tab-delimited table |
filtered_fastp/<sample>_fastp.html |
fastp QC report |
Mapsam/<sample>_flagstat.txt |
Alignment statistics |
logs/analysis_<timestamp>.log |
Full pipeline log |
logs/<sample>_<timestamp>.log |
Per-sample log |
Documentation
Full usage documentation is available in USAGE.md, including:
- Complete CLI option reference
- How to resume interrupted runs
- How to tune parallel processing
- Common errors and fixes
- Quick reference card
Getting Help
For usage questions, please open a GitHub Issue.
Bug reports, feature requests, and general questions are all welcome.
License
Citation
If you use ExomeFlow in your research, please cite:
Robin Tomar. ExomeFlow: A Production-Quality Python Package for Automated Whole Exome Sequencing Analysis. AIIMS New Delhi, 2025. https://pypi.org/project/exomeflow/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exomeflow-1.0.2.tar.gz.
File metadata
- Download URL: exomeflow-1.0.2.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76612fcc6c540481e16c2de4b4efed9b39f883a9f4be0430e8650cfe29b9b903
|
|
| MD5 |
443854490ae9c4af5db96292069c9334
|
|
| BLAKE2b-256 |
ef962cd89c5587d1928d3c1bd2400c75d719f55507c5b6e3785f6f786b086247
|
File details
Details for the file exomeflow-1.0.2-py3-none-any.whl.
File metadata
- Download URL: exomeflow-1.0.2-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cdc8868863433f4949dc54d51c3aa09f48089de3299d31f3de4a4ca8ca1410a
|
|
| MD5 |
e7f88ea382f0c05b6a8f2df4065b0a12
|
|
| BLAKE2b-256 |
d9af28660611aa37ad7cc376139a9534060997014e2bf7e2cbb484d2eddfcaa6
|