Skip to main content

scanRBP: RNA-protein binding toolkit

Project description

scanRBP loads RNA-protein binding motif PWM and computes the log-odds scores for all the loaded RBPs across a given genomic sequence + draws a heatmap of the scores.

The scores can be described as follows (biopython docs):

Here we can see positive values for symbols more frequent in the motif than in the background and negative for symbols more frequent in the background. 0.0 means that it's equally likely to see a symbol in the background and in the motif.

Using the background distribution and PWM with pseudo-counts added, it's easy to compute the log-odds ratios, telling us what are the log odds of a particular symbol to be coming from a motif against the background.

For more information, see the biopython docs.

Installation

The easiest way to install scanRBP is to simply run:

$ pip install scanRBP

Note that on some systems, pip is installing the executable scripts under ~/.local/bin. However this folder is not in the PATH which will result in command not found if you try to run $ scanRBP on the command line. To fix this, please execute export PATH="$PATH:~/.local/bin" (and add this to your .profile). Another suggestion is to install inside a virtual environment (using virtualenv).

If you would like to install scanRBP directly from this repository, clone the repository into a folder, for example ~/software/scanRBP. Add the ~/software/scanRBP folder to $PYTHONPATH (export PYTHONPATH=$PYTHONPATH:~/software/scanRBP).

Example run

scanRBP quick start:

Usage for single sequence: scanRBP sequence output [options]
     * one sequence provided on the command line, generates output.png/pdf + output.tab

Usage for processing FASTA file: scanRBP filename.fasta [options]
     * one heatmap/matrix will be generated per sequence
     * output name of the files will be sequence ids provided in the fasta file

Options:
     -annotate               Annotate each heatmap cell with the number
     -xlabels                Display sequence (x-labels), default False
     -only_protein TARDBP    Only analyze binding for the specific protein / search by name
     -all_protein TARDBP     Additionally to one motif per protein (for all proteins), also include all motifs (PWMs) for this specific protein (search by name)
                             (note that one protein can have several PWMs)
     -figsize "(10,20)"      Change matplotlib/seaborn figure size for the heatmap, example width=10, height=20
     -heatmap title          Make heatmap (png+pdf) with title
     -output_folder folder   Store all results to the output folder (default: current folder)
     -nonzero                All negative vector values are set to 0, not enabled by default

Examples:

# taking a random sequence, will produce binding scores and a heatmap
# output: example1_PWM.tab # file with log-odds vectors for all proteins for the given command line sequence
# output: example1.png/pdf # heatmap image with clustering of protein binding vectors
./scanRBP AAAGCGGCGACTTATTATATCCCCATATATTATATCTTCTTCTCTTATATATAAACCAGAGATAGATGTGTGTGGTGG example1 -heatmap example1

# instead of taking one single sequence, the input can be a fasta file with multiple sequences
./scanRBP data.fasta

Motif PWM database

Using the mCross database of 112 RBPs from the paper:

Feng H, Bao S et al.
Modeling RNA-Binding Protein Specificity In Vivo by Precisely Registering Protein-RNA Crosslink Sites
Molecular Cell, 2019

To download the PWMs:

wget http://zhanglab.c2b2.columbia.edu/data/mCross/eCLIP_mCross_PWM.tgz --no-check-certificate
tar xfz eCLIP_mCross_PWM.tgz

Additional PWM datasets

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02913-0 https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-023-02913-0/MediaObjects/13059_2023_2913_MOESM6_ESM.txt

CLIP datasets

bedGraph files list from:

https://www.encodeproject.org/metadata/?status=released&internal_tags=ENCORE&assay_title=eCLIP&biosample_ontology.term_name=K562&biosample_ontology.term_name=HepG2&type=Experiment&files.analyses.status=released&files.preferred_default=true

Any other bedGraph CLIP peak called file for a specific genome can be added to the database.

Gene data

Gene metadata (names, aliases) donwloaded from https://www.ncbi.nlm.nih.gov/gene/?term=human[organism]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scanRBP-0.1.5.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

scanRBP-0.1.5-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file scanRBP-0.1.5.tar.gz.

File metadata

  • Download URL: scanRBP-0.1.5.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for scanRBP-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d848c23c3270a441659677600f9351c83e42e88421dd96564a9dd53f26f09571
MD5 cec3faea908d7be4ba7392619ec2f3cc
BLAKE2b-256 adfcf0759520066f78da42867d9547c18e8e112c08862834f8b77916ecfda8b8

See more details on using hashes here.

File details

Details for the file scanRBP-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: scanRBP-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.8

File hashes

Hashes for scanRBP-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6d093fee4f2d6097d79df92ffad8dcf5c9065a9a96b7b513ecf1e0a7c7d83649
MD5 2b891b16457590b0e1d1cb1c5802f54e
BLAKE2b-256 36d58caa8a2026874c577d033af82f26645267e072a1ec86fb8cfe454af490c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page