Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.
Project description
eCLIP-Peak
Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.
Installation
-
For Van Nostrand Lab
The pipeline has already been installed. Activate its environment by issue the following command:
source /storage/vannostrand/software/eclip/venv/environment.sh
. -
For all others:
- Install Python (3.6+)
- Install peak (
pip install eclip-peak
) - Install IDR (2.0.3+)
- Install Perl (5.10.1+) with the following packages:
- Statistics::Basic (
cpanm install Statistics::Basic
) - Statistics::Distributions (
cpanm install Statistics::Distributions
) - install Statistics::R (
cpanm install Statistics::R
)
- Statistics::Basic (
Usage
-
For Van Nostrand Lab
After activate peak's environment call
peak -h
to see the detailed usage. -
For all others:
After successfully installed Python, peak, Perl (with required packages), call
peak -h
inside your terminal to see the following detailed usage:
$ peak -h
usage: peak [-h]
[--ip_bams IP_BAMS [IP_BAMS ...]]
[--input_bams INPUT_BAMS [INPUT_BAMS ...]]
[--peak_beds PEAK_BEDS [PEAK_BEDS ...]]
[--read_type READ_TYPE] [--outdir OUTDIR]
[--species SPECIES]
[--l2fc L2FC] [--l10p L10P] [--idr IDR]
[--dry_run] [--cores] [--debug]
Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset
with two or three replicates.
optional arguments:
-h, --help show this help message and exit
--ip_bams IP_BAMS [IP_BAMS ...]
Space separated IP bam files (at least 2 files).
--input_bams INPUT_BAMS [INPUT_BAMS ...]
Space separated INPUT bam files (at least 2 files).
--peak_beds PEAK_BEDS [PEAK_BEDS ...]
Space separated peak bed files (at least 2 files).
--ids IDS [IDS ...] Optional space separated short IDs (e.g., S1, S2, S3) for datasets.
--read_type READ_TYPE
Read type of eCLIP experiment, either SE or PE.
--outdir OUTDIR Path to output directory.
--species SPECIES Short code for species, e.g., hg19, mm10.
--l2fc L2FC Only consider peaks at or above this l2fc cutoff, default: 3.
--l10p L10P Only consider peaks at or above this l10p cutoff, default: 3.
--idr IDR Only consider peaks at or above this idr score cutoff, default: 0.01.
--cores CORES Maximum number of CPU cores for parallel processing, default: 1.
--dry_run Print out steps and inputs/outputs of each step without
actually running the pipeline.
--debug Invoke debug mode (only for develop purpose).
Outline of workflow
- Normalize CLIP IP BAM over INPUT for each replicate
- Peak compression/merging on input-normalized peaks for each replicate
- Entropy calculation on IP and INPUT read probabilities within each peak for each replicate
- Run IDR on peaks ranked by entropy
- Normalize IP BAM over INPUT using new IDR peak regions
- Identify reproducible peaks within IDR regions
Examples
-
eCLIP with 2 replicates
Assuming we have eCLIP pipeline run successfully and have the following files generated for species
hg19
:replicate 1: IP BAM: ip1.bam INPUT BAM: input1.bam Peak BED: clip1.peak.clusters.bed replicate 2: IP BAM: ip2.bam INPUT BAM: input2.bam Peak BED: clip2.peak.clusters.bed
The pipeline then can be called like this to identify reproducible peaks:
peak \ --ip_bams ip1.bam ip2.bam \ --input_bams input1.bam input2.bam \ --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed \ --species hg19
-
eCLIP with 3 replicates
Assuming we have eCLIP pipeline run successfully and have the following files generated for species
hg19
:replicate 1: IP BAM: ip1.bam INPUT BAM: input1.bam Peak BED: clip1.peak.clusters.bed replicate 2: IP BAM: ip2.bam INPUT BAM: input2.bam Peak BED: clip2.peak.clusters.bed replicate 3: IP BAM: ip3.bam INPUT BAM: input3.bam Peak BED: clip3.peak.clusters.bed
The pipeline then can be called like this to identify reproducible peaks:
peak \ --ip_bams ip1.bam ip2.bam ip3.bam \ --input_bams input1.bam input2.bam input3.bam \ --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed clip3.peak.clusters.bed \ --species hg19
Note:
- The indentation of the command does not matter, you can write it on the same line.
- The order of bam and peak files followed by
--ip_bams
,input_bams
, andpeak_beds
DOES matter, make sure you pass them in a consistent order for these three parameters. - There are 3 cutoffs can be set for fine tune the peak filtering, see Usage part for more details.
- If the pipeline failed, check the log to identify the error and make necessary changes, re-run the pipeline will skip successfully processed parts only continue to processed failed and unprocessed parts.
Output
The peak pipeline will output 5 different types of files into the current work directory
or into a user specified output directory (via --outdir
):
- *.bed: either a 6 columns or 9 columns bed file saves information for peaks.
- *.tsv: TSV separated text file saves more information in addition to the BED file.
- *.txt: text file saves the mapped reads count
- *.out: TAB separated text file generated by IDR.
- *.png: plot generated by IDR.
All filenames of output files are self-explained, only the basename of peak bed files ( after the removal of .peak.clusters.bed) was used to mark the name of each replicate.
The reproducible peaks can be found in *.reproducible.peaks.bed and additional information can be found in *.reproducible.peaks.custom.tsv. While the former file is 6-column bed file, the later one is a TSV separated text file with the following columns in order:
- IDR region (entire IDR identified reproducible region)
- Peak (reproducible peak region)
- Geomean of the l2fc
- Columns of log2 fold change (2 or 3 columns for 2 or 3 replicates experiment, respectively)
- Columns of -log10 p-value (2 or 3 columns for 2 or 3 replicates experiment, respectively)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file eclip-peak-1.0.20.tar.gz
.
File metadata
- Download URL: eclip-peak-1.0.20.tar.gz
- Upload date:
- Size: 23.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e4c9498e663f9bf912e3bbb4c22be6643073530b4b314ebde9296517e9cb069 |
|
MD5 | 0f7c3cf86d3dd86180227e0e5d493d5f |
|
BLAKE2b-256 | fe155ec8c6ffd24575ad0d79550deb90f345b51ccc86076107de584287c9d5a0 |
File details
Details for the file eclip_peak-1.0.20-py3-none-any.whl
.
File metadata
- Download URL: eclip_peak-1.0.20-py3-none-any.whl
- Upload date:
- Size: 56.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b960590b8e84e9b68023390958c48820f8c74862144e8f7a35cff109e8200c0c |
|
MD5 | 740a7768b5adbc1ef8daa8976c5f90d0 |
|
BLAKE2b-256 | 80bff901a8ccf2dafcb225f1bdd64339b3ae5aa2d3b55b2a41f9dafd4edb2d25 |