Production-quality Whole Exome Sequencing analysis pipeline

These details have not been verified by PyPI

Project links

Project description

ExomeFlow: A Production-Quality Python WES Analysis Toolkit


Testing
Package
Meta

What is it?

ExomeFlow is a Python package that provides a complete, automated Whole Exome Sequencing (WES) analysis workflow — from raw FASTQ files to functionally annotated variants — in a single reproducible CLI command.

It aims to be the standard high-level pipeline for WES analysis in Python, combining GATK best-practice variant calling, hard filtering, and ANNOVAR annotation into one modular, maintainable package. It handles cohort-level processing (multiple samples), checkpointing for resumable runs, structured logging, and parallel execution out of the box.

What is it?
Main Features
Pipeline Workflow
Where to get it
System Requirements
Python Dependencies
Quick Start
Reference Files
Input Convention
Output Files
Documentation
Getting Help
License
Citation

Main Features

Here are the things ExomeFlow does well:

Automatic sample detection — scans an input directory and detects all paired-end samples from FASTQ filenames; no manifest file required
Complete GATK best-practice workflow — fastp QC → BWA MEM alignment → coordinate sorting → duplicate marking → BQSR → HaplotypeCaller → hard filtering → ANNOVAR annotation
Cohort processing — processes any number of samples sequentially or in parallel with --max-workers
Checkpointing and resume — every completed step is recorded; an interrupted run resumes exactly where it left off without repeating work
Automatic requirements check — verifies all system tools and Python packages before the pipeline starts, reporting every missing dependency at once
Structured logging — per-sample log files plus a pipeline-wide log with INFO / WARNING / ERROR / SUCCESS levels
GATK hard filters — applies GATK best-practice SNP and INDEL hard-filter thresholds and extracts PASS-only variants automatically
ANNOVAR functional annotation — annotates variants against 8 databases: refGene, ClinVar, gnomAD, dbNSFP, COSMIC, ExAC, avSNP150, and dbscSNV
Modular architecture — each pipeline step is an independent Python module; easy to extend or modify individual steps without touching the rest
PyPI installable — pip install exomeflow; no Docker or Nextflow required

Pipeline Workflow

Raw FASTQ
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Step 1   fastp         Quality control & adapter trim   │
│           length ≥ 50 bp · base quality ≥ Q30            │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 2   BWA MEM        Read alignment to hg38          │
│           -Y -K 100000000 · read-group tags set          │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 3   GATK SortSam   Coordinate-sort BAM             │
│  Step 4   samtools       Flagstat alignment QC           │
│  Step 5   GATK MarkDuplicates   PCR duplicate removal    │
│  Step 6   GATK BuildBamIndex    BAI index                │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 7   GATK BQSR      BaseRecalibrator + ApplyBQSR    │
│           Known sites: dbSNP · Mills · known indels      │
│           → recalibrated.bam  (IGV-ready)                │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 8   GATK HaplotypeCaller   Variant calling         │
│           Exome intervals + padding · dbSNP annotation   │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    ▼             ▼
               SNP filters   INDEL filters
               (Step 9)       (Step 10)
                    └──────┬──────┘
                           │  MergeVcfs
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 11  SelectVariants  Extract PASS-only variants     │
└──────────────────────────┬──────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Step 12  ANNOVAR         Functional annotation          │
│           refGene · ClinVar · gnomAD · dbNSFP · COSMIC   │
│           → multianno.vcf  +  multianno.txt              │
└─────────────────────────────────────────────────────────┘

Where to get it

The source code is hosted on GitHub at: https://github.com/imrobintomar/exomeflow

Binary installers for the latest released version are available at the Python Package Index (PyPI).

# PyPI
pip install exomeflow

# Install latest development version from GitHub
pip install git+https://github.com/imrobintomar/exomeflow.git

The list of changes between each release can be found in the Release History.

System Requirements

ExomeFlow calls the following external tools via the command line. They must be installed separately and available on your PATH.

Tool	Minimum Version	Install
BWA	≥ 0.7.17	`conda install -c bioconda bwa`
SAMtools	≥ 1.13	`conda install -c bioconda samtools`
GATK	≥ 4.6.0	Download jar + add to `PATH`
fastp	≥ 0.20.1	`conda install -c bioconda fastp`
Perl	≥ 5.26	`conda install perl`
ANNOVAR	latest	Register + download

Run python check_requirements.py to verify all tools are installed and meet minimum version requirements before starting the pipeline. This check also runs automatically as Step 0 of every pipeline run.

Python Dependencies

typer — Builds the CLI interface
rich — Provides coloured terminal output and structured logging
pandas — Data handling for variant count summaries
matplotlib — Variant summary figure generation

See requirements.txt for exact minimum versions.

Quick Start

1. Install

pip install exomeflow

2. Check requirements

python check_requirements.py

3. Prepare FASTQ files

fastq/
├── sample1_1.fastq.gz
├── sample1_2.fastq.gz
├── sample2_1.fastq.gz
└── sample2_2.fastq.gz

4. Run the pipeline

exomeflow run \
  --input-dir    fastq/ \
  --output       results/ \
  --reference    refs/hg38.fa \
  --dbsnp        refs/dbsnp.vcf.gz \
  --mills        refs/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
  --known-indels refs/Homo_sapiens_assembly38.known_indels.vcf.gz \
  --intervals    refs/Illumina_Exome_TargetedRegions_v1.2.hg38.bed \
  --annovar-bin  /path/to/annovar \
  --annovar-db   /path/to/annovar/humandb \
  --threads      32 \
  --max-workers  2

Reference Files

File	Description
`hg38.fa`	BWA-indexed reference genome (UCSC / GATK resource bundle)
`dbsnp.vcf.gz`	dbSNP VCF — bgzipped + tabix-indexed
`Mills_and_1000G_gold_standard.indels.hg38.vcf.gz`	Mills gold standard indels
`Homo_sapiens_assembly38.known_indels.vcf.gz`	Known indels for BQSR
Exome capture BED	From your capture kit vendor (Illumina / Twist / Agilent)
ANNOVAR humandb	hg38 annotation databases

Download the GATK resource bundle:

gsutil -m cp -r gs://gatk-best-practices/somatic-hg38/ .

Input Convention

ExomeFlow automatically detects samples from paired-end FASTQ filenames. Files must follow the pattern:

<sample_id>_1.fastq.gz   ← Read 1
<sample_id>_2.fastq.gz   ← Read 2

The sample_id can be any string — SRR accession, patient ID, etc.

Output Files

File	Description
`Mapsam/<sample>_recalibrated.bam`	Analysis-ready BAM — open in IGV
`VCF/<sample>.vcf`	Raw HaplotypeCaller output
`VCF/<sample>_PASS.vcf`	PASS-only hard-filtered variants
`VCF/<sample>.annovar.hg38_multianno.vcf`	Annotated VCF
`VCF/<sample>.annovar.hg38_multianno.txt`	Annotated tab-delimited table
`filtered_fastp/<sample>_fastp.html`	fastp QC report
`Mapsam/<sample>_flagstat.txt`	Alignment statistics
`logs/analysis_<timestamp>.log`	Full pipeline log
`logs/<sample>_<timestamp>.log`	Per-sample log

Documentation

Full usage documentation is available in USAGE.md, including:

Complete CLI option reference
How to resume interrupted runs
How to tune parallel processing
Common errors and fixes
Quick reference card

Getting Help

For usage questions, please open a GitHub Issue.

Bug reports, feature requests, and general questions are all welcome.

License

MIT

Citation

If you use ExomeFlow in your research, please cite:

Robin Tomar. ExomeFlow: A Production-Quality Python Package for Automated Whole Exome Sequencing Analysis. AIIMS New Delhi, 2025. https://pypi.org/project/exomeflow/

_{Built with ❤️ for the bioinformatics community · Robin Tomar, AIIMS New Delhi}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.7

Apr 16, 2026

1.0.6

Apr 13, 2026

1.0.5

Apr 13, 2026

1.0.4

Apr 10, 2026

1.0.3

Apr 9, 2026

This version

1.0.2

Apr 9, 2026

1.0.1

Apr 9, 2026

1.0.0

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exomeflow-1.0.2.tar.gz (24.5 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

exomeflow-1.0.2-py3-none-any.whl (26.6 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file exomeflow-1.0.2.tar.gz.

File metadata

Download URL: exomeflow-1.0.2.tar.gz
Upload date: Apr 9, 2026
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`76612fcc6c540481e16c2de4b4efed9b39f883a9f4be0430e8650cfe29b9b903`
MD5	`443854490ae9c4af5db96292069c9334`
BLAKE2b-256	`ef962cd89c5587d1928d3c1bd2400c75d719f55507c5b6e3785f6f786b086247`

See more details on using hashes here.

File details

Details for the file exomeflow-1.0.2-py3-none-any.whl.

File metadata

Download URL: exomeflow-1.0.2-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for exomeflow-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cdc8868863433f4949dc54d51c3aa09f48089de3299d31f3de4a4ca8ca1410a`
MD5	`e7f88ea382f0c05b6a8f2df4065b0a12`
BLAKE2b-256	`d9af28660611aa37ad7cc376139a9534060997014e2bf7e2cbb484d2eddfcaa6`

See more details on using hashes here.

exomeflow 1.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

ExomeFlow: A Production-Quality Python WES Analysis Toolkit

What is it?

Table of Contents

Main Features

Pipeline Workflow

Where to get it

System Requirements

Python Dependencies

Quick Start

1. Install

2. Check requirements

3. Prepare FASTQ files

4. Run the pipeline

Reference Files

Input Convention

Output Files

Documentation

Getting Help

License

Citation

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes