Skip to main content

A ChIP-seq pipeline from raw reads to peaks

Project description

https://badge.fury.io/py/sequana-chipseq.svg https://github.com/sequana/chipseq/actions/workflows/main.yml/badge.svg Python 3.10 | 3.11 | 3.12 JOSS (journal of open source software) DOI

This is the chipseq pipeline from the Sequana project.

Overview:

ChIP-seq pipeline from raw reads to peaks, IDR statistics, and functional annotation

Input:

Paired or single-end FastQ files and a CSV experimental design file

Output:

HTML summary report, narrow/broad peak files, IDR statistics, bigwig tracks, annotation tables, and IGV session file

Status:

Production

Citation:

Cokelaer et al, (2017), ‘Sequana’: a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI https://doi.org/10.21105/joss.00352

sequana_pipelines/chipseq/dag.png sequana_pipelines/chipseq/dag_complete.png

Installation

pip install sequana_chipseq --upgrade

You will also need the third-party tools listed under Requirements below.

Quick Start

1. Prepare a design file design.csv:

type,condition,replicat,sample_name
IP,EXP1,1,IP_EXP1_rep1
IP,EXP1,2,IP_EXP1_rep2
Input,EXP1,1,Input_EXP1
  • type must be IP (immunoprecipitated) or Input (control).

  • sample_name must match the prefix of the corresponding FastQ file (e.g. IP_EXP1_rep1 matches IP_EXP1_rep1_R1_.fastq.gz).

  • At least two IP replicates per condition are required for IDR analysis.

2. Prepare a genome directory named after the genome, containing:

  • <name>.fa — reference genome FASTA

  • <name>.gff or <name>.gff3 — gene annotation

Example:

ecoli_MG1655/
├── ecoli_MG1655.fa
└── ecoli_MG1655.gff

3. Set up the pipeline:

sequana_chipseq \
    --input-directory DATAPATH \
    --genome-directory /path/to/ecoli_MG1655 \
    --design-file design.csv

4. Run the pipeline:

cd chipseq
sh chipseq.sh

Usage

sequana_chipseq --help

Key pipeline-specific options:

--genome-directory

Path to the genome directory (must contain <name>.fa and <name>.gff).

--design-file

CSV experimental design file (see Quick Start above).

--aligner-choice

Aligner to use. Currently only bowtie2 is supported.

--blacklist-file

BED3 file of genomic regions to exclude from analysis (tab-separated: chromosome, start, end).

--genome-size

Effective genome size for macs3 peak calling. Automatically computed from the FASTA file if not provided; override with a plain integer.

--do-fingerprints

Enable plotFingerprint QC to assess ChIP enrichment quality.

Run on a SLURM cluster:

cd chipseq
sbatch chipseq.sh

Or drive Snakemake directly:

snakemake -s chipseq.rules --cores 4 --stats stats.txt

Usage with Apptainer

Run every tool inside pre-built containers — no local tool installation needed:

sequana_chipseq \
    --input-directory DATAPATH \
    --genome-directory /path/to/genome \
    --design-file design.csv \
    --use-apptainer

Store images in a shared location to avoid re-downloading:

sequana_chipseq ... --use-apptainer --apptainer-prefix ~/.sequana/apptainers

Then run as usual:

cd chipseq
sh chipseq.sh

Requirements

The following tools must be available (install via conda/bioconda):

mamba env create -f environment.yml
  • bowtie2 — read alignment

  • fastp — adapter trimming and quality filtering

  • fastqc — per-read quality control

  • samtools — BAM sorting, indexing, and flagstat

  • bedtools — bedGraph generation from BAM files (genomeCoverageBed)

  • ucsc-bedgraphtobigwig — bedGraph to bigWig conversion (bedGraphToBigWig)

  • deeptools — fingerprint QC (plotFingerprint) and multi-sample bigwig summary (multiBigwigSummary)

  • macs3 — narrow and broad peak calling

  • homer — peak annotation (annotatePeaks.pl)

  • idr — Irreproducibility Discovery Rate between replicates (installed from sequana/idr fork via pip; the upstream bioconda package is Python 3.10-only)

  • multiqc — aggregated QC report

Pipeline overview

  1. Trimming — fastp removes low-quality reads and adapters.

  2. QC — FastQC on raw and cleaned reads.

  3. Alignment — bowtie2 maps reads to the reference genome.

  4. [Optional] Mark duplicates — Picard marks PCR duplicates.

  5. [Optional] Blacklist removal — bedtools removes artefact-prone regions.

  6. bigwig — per-sample coverage tracks for genome browsers (bedtools genomeCoverageBed → UCSC bedGraphToBigWig); an IGV session file (igv.xml) is generated to preload all tracks.

  7. [Optional] Fingerprints — plotFingerprint QC to assess ChIP enrichment.

  8. Phantom peak — strand cross-correlation analysis (NSC, RSC, Qtag scores).

  9. Peak calling — macs3 detects narrow and broad peaks for each IP vs Input pair.

  10. FRiP — Fraction of Reads in Peaks per sample and comparison.

  11. IDR — Irreproducibility Discovery Rate on true replicates, pseudo-replicates, and self-pseudo-replicates.

  12. Annotation — homer annotates peaks relative to genomic features.

  13. MultiQC — aggregated QC across all samples.

  14. HTML report — summary with phantom peaks, FRiP plots, IDR tables, and annotation plots.

Configuration

Here is the latest documented configuration file. Key sections:

  • general — aligner choice and genome directory path

  • fastp — trimming options (length, quality, adapters)

  • fastqc — FastQC options and threads

  • bowtie2_mapping / bowtie2_index — mapping options, threads, memory

  • macs3 — peak calling parameters (genome size, bandwidth, q-value, broad cutoff)

  • idr — IDR thresholds, rank metric, number of pseudo-replicates

  • fingerprints — enable/disable and number of bins

  • mark_duplicates — enable/disable PCR duplicate marking

  • remove_blacklist — enable/disable and path to BED blacklist

  • trimming — enable/disable read trimming and choice of trimming tool

  • phantom — use SPP (use_spp: true) instead of the built-in sequana phantom-peak detection

  • igv — enable/disable generation of the IGV session file (igv.xml)

  • multiqc — MultiQC options

Changelog

Version

Description

0.12.0

  • Fix macs3, self_pseudo_replicate_peaks, and pseudo_replicate_peaks rules: macs3 exits non-zero on sparse CI data; added || true + conditional touch so the pipeline continues and downstream rules handle empty peak files gracefully

  • Add container: sequana_tools to all macs3 rules so peak calling runs consistently inside the apptainer container

  • Replace bioconda idr with pip install from sequana/idr fork; fixes CI failures on Python 3.11/3.12 (upstream package is Python 3.10-only due to Cython 3.x incompatibility)

  • Fix plot_FRiP: was iterating over all comparisons inside each rule invocation causing FileNotFoundError in parallel runs; now processes only its own wildcard

  • Fix IDR rules (idr_NT, self_pseudo_replicate_idr, pseudo_replicate_idr): IDR exits non-zero on sparse data; added || true + conditional mv so the pipeline continues and downstream Python rules handle empty results gracefully peaks and Homer returns an empty DataFrame

  • Fix fastp rule: use input.fastq / output.r1 / output.r2 to match the sequana-wrappers fastp shell interface; split into paired/single-end branches

  • Add log: directives and stderr redirection to rules that were missing them: phantom_align, chrom_sizes, fingerprints, bam_to_bed, bed_to_bigwig, pseudo_replicate_idr

  • Update sequana_tools container to 26.1.14

  • Update CI: Python 3.10/3.11/3.12; actions/checkout@v4

0.11.0

  • Switch to click and new sequana_pipetools

0.10.0

  • Fix design in case of samples that start with the same prefix

  • Include final IDR plots and tables

  • Fix containers and wrappers in the config file

  • Better HTML report

0.9.1

  • Fix requirements and setup.py (remove wrong idr package)

0.9.0

  • Use latest wrappers and apptainer (for rulegraph)

0.8.0

First release.

Contribute & Code of Conduct

To contribute to this project, please take a look at the Contributing Guidelines first. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequana_chipseq-0.12.0.tar.gz (282.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sequana_chipseq-0.12.0-py3-none-any.whl (281.1 kB view details)

Uploaded Python 3

File details

Details for the file sequana_chipseq-0.12.0.tar.gz.

File metadata

  • Download URL: sequana_chipseq-0.12.0.tar.gz
  • Upload date:
  • Size: 282.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.14 Linux/6.14.5-100.fc40.x86_64

File hashes

Hashes for sequana_chipseq-0.12.0.tar.gz
Algorithm Hash digest
SHA256 1952dd7214d2536c534d2d242cc1ee50453d25b2745abb1069d017cb8a6727b7
MD5 1780a21145717f8cf6a4d8220a9f1659
BLAKE2b-256 d258a228f18cfa45c5f0ac16dfcdf70bf3a8d30f06e5a1ca0f824323c73d6c2d

See more details on using hashes here.

File details

Details for the file sequana_chipseq-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: sequana_chipseq-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 281.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.10.14 Linux/6.14.5-100.fc40.x86_64

File hashes

Hashes for sequana_chipseq-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4674856ebac2ff024f4e5896a594a8804c9348e8413c65ef7d5f4dd073153e6d
MD5 1bf08fc2cba2054c857c8ab77f53dae5
BLAKE2b-256 78c6332546455477f8d4ef9fc6fd2d9c480e104aa56d157432849563a92eba93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page