Skip to main content

A tool for identifying differentially bound peaks in CLIP/CRAC data

Project description

DBPeaks: Differential RNA-Binding Site Analysis Tool

Contents

Introduction

DBPeaks is a Python-based command-line tool designed for the identification and analysis of differential RNA-binding sites in CLIP/CRAC datasets. It integrates various bioinformatics tools and methods to process sequencing data, identify peaks, and perform statistical analyses to detect significant differences in RNA-binding across conditions. It does so by first analysing the peaks in each individual file and it then looks whether peaks are found in the same regions. These peaks need to have overlapping genome mapping coordinates. All overlapping peaks will then be merged into a single peak interval and for each interval the program will then calculate the total number of reads covering that genomic interval. DESeq2 will then be used to determine if the read counts for that interval is statistically significantly different between sample and control files.

It requires CLIP/CRAC data BAM files as input as well as GTF and genome files for the model organism. Make sure your GTF annotation file does not have any silly formatting mistakes, otherwise the program will not run. Example genome files for yeast are available in this repository.

NOTE! DBPeaks was SPECIFICALLY designed to analyse CLIP/CRAC datasets from two different conditions or by comparing data from WT vs mutant RBPs. It was NOT designed to compare RBP CLIP datasets to control datasets that have substantially lower read counts (i.e. data from untagged strains or no UV cross-linking controls). Should you be stubborn and still decide to use DBPeaks for this purpose, you will get rubbish results!

It is really important that all bam files have a good number of reads and that there is not a huge difference in read depth between the files. This will make DESeq2 much happier and will therefore improve the results.

We have tried many different tools that do similar things. However, we were either not able to get them running on our servers or they were not able to detect clearly differentially bound (DB) peaks in our data. We have not benchmarked DBPeaks to existing tools so we do not yet know how well it performs compared to most popular peak calling methods. All I can say is that on OUR data where we removed an RBP binding site in the genome it performs better than existing tools such as MACS3 and Peakachu. DBPeaks was able to detect loss of binding in that single genomic location. The other tools we tested could not. DBPeaks is, however, slower than most existing tools. This is because it relies on pyCalculateFDRs from the pyCRAC package to call peaks. This script looks for peaks in each individual gene annotated in the genome and takes read coverage of the gene into consideration for this. So this part is rather slow if you have many features annotated in your genome file.

DBPeaks uses multiple CPUs to process the data and has the added advantage that it can also use replicates.

Repo Contents

Features

  • Comprehensive Analysis Pipeline: From reading BAM files to statistical analysis with DESeq2.
  • Parallel Processing: Utilises multiple CPUs to speed up the analysis.
  • Flexible Input Options: Supports various configurations and customisations through command-line options.
  • Integrated Peak Calling and Filtering: Includes functionality for peak detection, filtering based on reproducibility, and adjustment of peak widths.
  • Statistical Analysis: Incorporates DESeq2 for rigorous differential analysis.

System Requirements

  • Python 3.8 or higher
  • Dependencies: pybedtools, numpy, pandas, pydeseq2, pyCRAC (installed automatically)

Installation Guide

Install DBPeaks from PyPI:

pip install DBPeaks

Install DBPeaks from the repository:

git clone https://git.ecdf.ed.ac.uk/sgrannem/dbpeaks.git
cd dbpeaks
pip install .

For development (changes to the source are reflected immediately without reinstalling):

pip install -e .

Running DBPeaks

DBPeaks is run from the command line. Below are some common usage examples.

Basic run with two replicates per condition:

dbpeaks --samples sample_rep1.bam sample_rep2.bam \
        --controls control_rep1.bam control_rep2.bam \
        --gtf annotation.gtf \
        --chromfile chrominfo.txt \
        --jobname WT_vs_mutant

Using more CPUs and a stricter FDR threshold:

dbpeaks --samples sample_rep1.bam sample_rep2.bam sample_rep3.bam \
        --controls control_rep1.bam control_rep2.bam control_rep3.bam \
        --gtf annotation.gtf \
        --chromfile chrominfo.txt \
        --jobname WT_vs_mutant \
        --minfdr 0.01 \
        --cpu 8

Collapsing PCR duplicates (low-complexity libraries):

dbpeaks --samples sample_rep1.bam sample_rep2.bam \
        --controls control_rep1.bam control_rep2.bam \
        --gtf annotation.gtf \
        --chromfile chrominfo.txt \
        --jobname WT_vs_mutant \
        --blocks

Allowing peaks present in only 50% of replicates:

dbpeaks --samples sample_rep1.bam sample_rep2.bam sample_rep3.bam \
        --controls control_rep1.bam control_rep2.bam control_rep3.bam \
        --gtf annotation.gtf \
        --chromfile chrominfo.txt \
        --jobname WT_vs_mutant \
        --rep 50

Command-Line Options

All available options can be viewed with:

dbpeaks --help

File input options:

  • --samples: Paths to the BAM files for the sample group.
  • --controls: Paths to the BAM files for the control group.
  • --gtf: Path to the GTF annotation file.
  • --chromfile: Location of the chromosome info file. This file should have two columns: first column is the names of the chromosomes, second column is the length of the chromosomes.
  • -j / --jobname: A name for the job to organise output files. Default = WT_vs_mutant.

Peak calling settings:

  • -m / --minfdr 0.05: Minimal FDR threshold for filtering interval data. Default is 0.05.

    This setting is used when running pyCalculateFDRs. If you end up getting a lot of peaks in your data, it is recommended to lower this threshold (e.g. to 0.01) to reduce the number of significantly enriched peaks.

  • --padj 0.05: DESeq2 threshold for calling a peak DB. Default is 0.05.

    If you hardly get any DB peaks, then it may be worth slightly adjusting this threshold. However, in this scenario, it may also be the case that your samples just have too much variability. It may then be wise to do a PCA analysis on your data to see if replicates are indeed grouped together.

  • --min 5: Minimal read coverage for a region. Regions with coverage less than this minimum will be ignored.

  • --blocks: Add this flag if you want to consider reads with identical mapping coordinates once, regardless of sequence.

    NOTE! This is a HUGELY important flag! Setting --blocks will remove any 'towers' in your data and collapse them into one single interval. This can completely change the shape and height of the peak and the peak may no longer be detected. However, if you suspect that your library is of low complexity and you see many of these blocks or towers in your genome browser, then I would recommend adding this flag as I have seen that this can improve the reliability of the final DESeq2 analyses.

  • --iterations 100: Number of iterations for randomisation of read coordinates. Default = 100.

    This is important for the peak calling analysis by the pyCalculateFDR.py script.

  • -r / --rep 90: Percentage of replicates in which a peak must be detected. Default = 100.

    For example, if you have three sample and three control bam files and you set -r to 50, then peaks that are present in the sample files but absent in the control files will also be considered. If you set -r to 100, the tool will expect to find overlapping peaks at any given position for ALL samples, so you may miss peaks that were only present in your sample but not in the control.

  • --filter mean: Filter peaks in GTF files by a specific threshold. Options are mean, median, or mean_plus_std (mean plus one standard deviation). Default is no filtering.

    It is recommended to start with no filtering. If you get too many DB peaks, start with --filter mean and then --filter median.

  • --min_peak_width 20: Minimum width of a called peak. Default = 20.

Logging options:

  • -v / --verbose: Print status messages to a log file.

CPU options:

  • --cpu 1: Number of processors to use for the analyses. Default = 1.

Contributing to further improving DBPeaks

Contributions to DBPeaks are welcome! Please fork the repository and submit pull requests with your enhancements. We will also be including some test data on the repository soon!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

DBPeaks was developed to analyse CRAC data for a manuscript that we are about to submit. This will be updated once the paper has been accepted or put on a preprint server.

Contact

For support or to report issues, please contact Sander Granneman at Sander.Granneman@ed.ac.uk, University of Edinburgh.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbpeaks-0.0.6.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dbpeaks-0.0.6-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file dbpeaks-0.0.6.tar.gz.

File metadata

  • Download URL: dbpeaks-0.0.6.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for dbpeaks-0.0.6.tar.gz
Algorithm Hash digest
SHA256 c058268b84ffefe6976e243659e6542c040fbac9752508e4127655d34650fe85
MD5 44a7335c040de07708487fe4d9ecf7b2
BLAKE2b-256 64844fda097fb82d85d7fa207b4cdf1091b947e321a97a9b197310dbe0774573

See more details on using hashes here.

File details

Details for the file dbpeaks-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: dbpeaks-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for dbpeaks-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d47195a9031573163f2f83166c28e1de0f71f9c39dda5fa8dd164d1dca905927
MD5 3f6210ebb2bb1d943e62577d2472a801
BLAKE2b-256 471a145a03b75f3f34f125be1fa693029ebfe7c70489ea554502e8d45e0b3883

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page