Skip to main content

Production-quality Whole Exome Sequencing analysis pipeline

Project description

ExomeFlow Logo

ExomeFlow: A Production-Quality Python WES Analysis Toolkit

Testing CI
Package PyPI Latest Release PyPI Downloads
Meta License - MIT Python Versions DOI

What is it?

ExomeFlow is a Python package that provides a complete, automated Whole Exome Sequencing (WES) analysis workflow — from raw FASTQ files to functionally annotated variants — in a single reproducible CLI command.

It aims to be the standard high-level pipeline for WES analysis in Python, combining GATK best-practice variant calling, hard filtering, and ANNOVAR annotation into one modular, maintainable package. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution out of the box.


Table of Contents


Main Features

Here are the things ExomeFlow does well:

  • Automatic sample detection — scans an input directory and detects all paired-end samples from FASTQ filenames; no manifest file required
  • Complete GATK best-practice workflow — fastp QC → BWA MEM alignment → coordinate sorting → duplicate marking → BQSR → HaplotypeCaller → hard filtering → ANNOVAR annotation
  • Cohort processing — processes any number of samples sequentially or in parallel with --max-workers
  • Checkpointing and resume — every completed step is recorded; an interrupted run resumes exactly where it left off without repeating work
  • Automatic requirements check — verifies all system tools and Python packages before the pipeline starts, reporting every missing dependency at once
  • Structured logging — per-sample log files plus a pipeline-wide log with INFO / WARNING / ERROR / SUCCESS levels
  • GATK hard filters — applies GATK best-practice SNP and INDEL hard-filter thresholds and extracts PASS-only variants automatically
  • ANNOVAR functional annotation — annotates variants against 8 databases: refGene, ClinVar, gnomAD, dbNSFP, COSMIC, ExAC, avSNP150, and dbscSNV
  • Modular architecture — each pipeline step is an independent Python module; easy to extend or modify individual steps without touching the rest
  • PyPI installablepip install exomeflow; no Docker or Nextflow required

Pipeline Workflow

Raw FASTQ
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Step 1   fastp         Quality control & adapter trim   │
│           length ≥ 50 bp · base quality ≥ Q30            │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 2   BWA MEM        Read alignment to hg38          │
│           -Y -K 100000000 · read-group tags set          │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 3   GATK SortSam   Coordinate-sort BAM             │
│  Step 4   samtools       Flagstat alignment QC           │
│  Step 5   GATK MarkDuplicates   PCR duplicate removal    │
│  Step 6   GATK BuildBamIndex    BAI index                │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 7   GATK BQSR      BaseRecalibrator + ApplyBQSR    │
│           Known sites: dbSNP · Mills · known indels      │
│           → recalibrated.bam  (IGV-ready)                │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 8   GATK HaplotypeCaller   Variant calling         │
│           Exome intervals + padding · dbSNP annotation   │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    ▼             ▼
               SNP filters   INDEL filters
               (Step 9)       (Step 10)
                    └──────┬──────┘
                           │  MergeVcfs
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 11  SelectVariants  Extract PASS-only variants     │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 12  ANNOVAR         Functional annotation          │
│           refGene · ClinVar · gnomAD · dbNSFP · COSMIC   │
│           → multianno.vcf  +  multianno.txt              │
└─────────────────────────────────────────────────────────┘

Where to get it

The source code is hosted on GitHub at: https://github.com/imrobintomar/exomeflow

Binary installers for the latest released version are available at the Python Package Index (PyPI).

# PyPI
pip install exomeflow

# Install latest development version from GitHub
pip install git+https://github.com/imrobintomar/exomeflow.git

The list of changes between each release can be found in the Release History.


System Requirements

ExomeFlow calls the following external tools via the command line. They must be installed separately and available on your PATH.

Tool Minimum Version Install
BWA ≥ 0.7.17 conda install -c bioconda bwa
SAMtools ≥ 1.13 conda install -c bioconda samtools
GATK ≥ 4.6.0 Download jar + add to PATH
fastp ≥ 0.20.1 conda install -c bioconda fastp
Perl ≥ 5.26 conda install perl
ANNOVAR latest Register + download

Run python check_requirements.py to verify all tools are installed and meet minimum version requirements before starting the pipeline. This check also runs automatically as Step 0 of every pipeline run.


Python Dependencies

  • typer — Builds the CLI interface
  • rich — Provides coloured terminal output and structured logging
  • pandas — Data handling for variant count summaries
  • matplotlib — Variant summary figure generation

See requirements.txt for exact minimum versions.


Quick Start

1. Install

pip install exomeflow

2. Check requirements

python check_requirements.py

3. Prepare FASTQ files

fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz

4. Run the pipeline

exomeflow run \
  --input-dir    fastq/ \
  --output       results/ \
  --reference    refs/hg38.fa \
  --dbsnp        refs/dbsnp.vcf.gz \
  --mills        refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --intervals    refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
  --annovar-bin  /path/to/annovar \
  --annovar-db   /path/to/annovar/humandb \
  --threads      32 \
  --max-workers  2

Reference Files

File Description
hg38.fa BWA-indexed reference genome (UCSC / GATK resource bundle)
dbsnp.vcf.gz dbSNP VCF — bgzipped + tabix-indexed
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz Mills gold standard indels
Homo_sapiens_assembly38.known_indels.vcf.gz Known indels for BQSR
Exome capture BED From your capture kit vendor (Illumina / Twist / Agilent)
ANNOVAR humandb hg38 annotation databases

Download the GATK resource bundle:

gsutil -m cp -r gs://gatk-best-practices/somatic-hg38/ .

Input Convention

ExomeFlow automatically detects samples from paired-end FASTQ filenames. Files must follow the pattern:

<sample_id>_1.fastq.gz   ← Read 1
<sample_id>_2.fastq.gz   ← Read 2

The sample_id can be any string — SRR accession, patient ID, etc.


Output Files

File Description
Mapsam/<sample>_recalibrated.bam Analysis-ready BAM — open in IGV
VCF/<sample>.vcf Raw HaplotypeCaller output
VCF/<sample>_PASS.vcf PASS-only hard-filtered variants
VCF/<sample>.annovar.hg38_multianno.vcf Annotated VCF
VCF/<sample>.annovar.hg38_multianno.txt Annotated tab-delimited table
filtered_fastp/<sample>_fastp.html fastp QC report
Mapsam/<sample>_flagstat.txt Alignment statistics
logs/analysis_<timestamp>.log Full pipeline log
logs/<sample>_<timestamp>.log Per-sample log

Documentation

Full usage documentation is available in USAGE.md, including:

  • Complete CLI option reference
  • How to resume interrupted runs
  • How to tune parallel processing
  • Common errors and fixes
  • Quick reference card

Getting Help

For usage questions, please open a GitHub Issue.

Bug reports, feature requests, and general questions are all welcome.


License

MIT


Citation

If you use ExomeFlow in your research, please cite:

Robin Tomar. ExomeFlow: A Production-Quality Python Package for Automated Whole Exome Sequencing Analysis. AIIMS New Delhi, 2025. https://pypi.org/project/exomeflow/



Built for the bioinformatics community · Robin Tomar, AIIMS New Delhi

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exomeflow-1.0.5.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exomeflow-1.0.5-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file exomeflow-1.0.5.tar.gz.

File metadata

  • Download URL: exomeflow-1.0.5.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.5.tar.gz
Algorithm Hash digest
SHA256 0556d703e22af1053a295768cd93d42f9ae96e51514d3fab137398be7d3917a3
MD5 3441c7ad7fd31cebcb3a6b979d6b1a50
BLAKE2b-256 c0cee1d8fbda0dac6dc400f5a20d5ec2c0f2dbfbe918b3bea307bd4fcfd463c4

See more details on using hashes here.

File details

Details for the file exomeflow-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: exomeflow-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9523b2d66dee6bbc1bca34822484bf3f68ede925931b0363acf5bc69c7f24ac8
MD5 3d3ac69a250df42d0a1142d1edfe1716
BLAKE2b-256 ca90866b2e98dc2c470d8ad403dae6ddcdbde16f4684f37311f17fed9f803b4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page