Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.
Project description
eCLIP-Peak
Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.
Installation
-
For Van Nostrand Lab
The pipeline has already been installed. Activate its environment by issue the following command:
source /storage/vannostrand/software/eclip/venv/environment.sh
. -
For all others:
- Install Python (3.6+)
- Install peak (
pip install eclip-peak
) - Install IDR (2.0.3+)
- Install Perl (5.10.1+) with the following packages:
- Statistics::Basic (
cpanm install Statistics::Basic
) - Statistics::Distributions (
cpanm install Statistics::Distributions
) - install Statistics::R (
cpanm install Statistics::R
)
- Statistics::Basic (
Usage
-
For Van Nostrand Lab
After activate peak's environment call
peak -h
to see the detailed usage. -
For all others:
After successfully installed Python, peak, Perl (with required packages), call
peak -h
inside your terminal to see the following detailed usage:
$ peak -h
usage: peak [-h]
[--ip_bams IP_BAMS [IP_BAMS ...]]
[--input_bams INPUT_BAMS [INPUT_BAMS ...]]
[--peak_beds PEAK_BEDS [PEAK_BEDS ...]]
[--read_type READ_TYPE] [--outdir OUTDIR]
[--species SPECIES]
[--l2fc L2FC] [--l10p L10P] [--idr IDR]
[--dry_run] [--cores] [--debug]
Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset
with two or three replicates.
optional arguments:
-h, --help show this help message and exit
--ip_bams IP_BAMS [IP_BAMS ...]
Space separated IP bam files (at least 2 files).
--input_bams INPUT_BAMS [INPUT_BAMS ...]
Space separated INPUT bam files (at least 2 files).
--peak_beds PEAK_BEDS [PEAK_BEDS ...]
Space separated peak bed files (at least 2 files).
--ids IDS [IDS ...] Optional space separated short IDs (e.g., S1, S2, S3) for datasets.
--read_type READ_TYPE
Read type of eCLIP experiment, either SE or PE.
--outdir OUTDIR Path to output directory.
--species SPECIES Short code for species, e.g., hg19, mm10.
--l2fc L2FC Only consider peaks at or above this l2fc cutoff, default: 3.
--l10p L10P Only consider peaks at or above this l10p cutoff, default: 3.
--idr IDR Only consider peaks at or above this idr score cutoff, default: 0.01.
--cores CORES Maximum number of CPU cores for parallel processing, default: 1.
--dry_run Print out steps and inputs/outputs of each step without
actually running the pipeline.
--debug Invoke debug mode (only for develop purpose).
Outline of workflow
- Normalize CLIP IP BAM over INPUT for each replicate
- Peak compression/merging on input-normalized peaks for each replicate
- Entropy calculation on IP and INPUT read probabilities within each peak for each replicate
- Run IDR on peaks ranked by entropy
- Normalize IP BAM over INPUT using new IDR peak regions
- Identify reproducible peaks within IDR regions
Examples
-
eCLIP with 2 replicates
Assuming we have eCLIP pipeline run successfully and have the following files generated for species
hg19
:replicate 1: IP BAM: ip1.bam INPUT BAM: input1.bam Peak BED: clip1.peak.clusters.bed replicate 2: IP BAM: ip2.bam INPUT BAM: input2.bam Peak BED: clip2.peak.clusters.bed
The pipeline then can be called like this to identify reproducible peaks:
peak \ --ip_bams ip1.bam ip2.bam \ --input_bams input1.bam input2.bam \ --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed \ --species hg19
-
eCLIP with 3 replicates
Assuming we have eCLIP pipeline run successfully and have the following files generated for species
hg19
:replicate 1: IP BAM: ip1.bam INPUT BAM: input1.bam Peak BED: clip1.peak.clusters.bed replicate 2: IP BAM: ip2.bam INPUT BAM: input2.bam Peak BED: clip2.peak.clusters.bed replicate 3: IP BAM: ip3.bam INPUT BAM: input3.bam Peak BED: clip3.peak.clusters.bed
The pipeline then can be called like this to identify reproducible peaks:
peak \ --ip_bams ip1.bam ip2.bam ip3.bam \ --input_bams input1.bam input2.bam input3.bam \ --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed clip3.peak.clusters.bed \ --species hg19
Note:
- The indentation of the command does not matter, you can write it on the same line.
- The order of bam and peak files followed by
--ip_bams
,input_bams
, andpeak_beds
DOES matter, make sure you pass them in a consistent order for these three parameters. - There are 3 cutoffs can be set for fine tune the peak filtering, see Usage part for more details.
- If the pipeline failed, check the log to identify the error and make necessary changes, re-run the pipeline will skip successfully processed parts only continue to processed failed and unprocessed parts.
Output
The peak pipeline will output 5 different types of files into the current work directory
or into a user specified output directory (via --outdir
):
- *.bed: either a 6 columns or 9 columns bed file saves information for peaks.
- *.tsv: TSV separated text file saves more information in addition to the BED file.
- *.txt: text file saves the mapped reads count
- *.out: TAB separated text file generated by IDR.
- *.png: plot generated by IDR.
All filenames of output files are self-explained, only the basename of peak bed files ( after the removal of .peak.clusters.bed) was used to mark the name of each replicate.
The reproducible peaks can be found in *.reproducible.peaks.bed and additional information can be found in *.reproducible.peaks.custom.tsv. While the former file is 6-column bed file, the later one is a TSV separated text file with the following columns in order:
- IDR region (entire IDR identified reproducible region)
- Peak (reproducible peak region)
- Geomean of the l2fc
- Columns of log2 fold change (2 or 3 columns for 2 or 3 replicates experiment, respectively)
- Columns of -log10 p-value (2 or 3 columns for 2 or 3 replicates experiment, respectively)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for eclip_peak-1.0.20-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b960590b8e84e9b68023390958c48820f8c74862144e8f7a35cff109e8200c0c |
|
MD5 | 740a7768b5adbc1ef8daa8976c5f90d0 |
|
BLAKE2b-256 | 80bff901a8ccf2dafcb225f1bdd64339b3ae5aa2d3b55b2a41f9dafd4edb2d25 |