Skip to main content

Python package wrapping over FANTOM and ENCODE labels for cis regulatory regions.

Project description

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Python package wrapping over FANTOM labels for cis regulatory regions.

How do I install this package?

As usual, just download it using pip:

pip install crr_labels

Tests Coverage

Since some software handling coverages sometime get slightly different results, here’s three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Usage examples

Currently we support only FANTOM CAGE data, but in the future also Roadmap will be included.

FANTOM

To retrieve the FANTOM promoters you can proceed as follows:

from crr_labels import fantom

enhancers, promoters = fantom(
    cell_lines=["HelaS3", "GM12878"], # list of cell lines to be considered.
    window_size=200, # window size to use for the various regions.
    genome = "hg19", # considered genome version. Currently supported only "hg19".
    center_enhancers = "peak", # how to center the enhancer window, either around "peak" or the "center" of the region.
    enhancers_threshold = 0, # activation threshold for the enhancers.
    promoters_threshold = 5, # activation threshold for the promoters.
    drop_always_inactive_rows = True, # whetever to drop the rows where no activation is detected for every rows.
    nrows:int=None # the number of rows to read, usefull when testing pipelines for creating smaller datasets.
)

The library will download and parse the fantom project raw data and return two dataframes for the required cell lines. Consider reading the method docstring for more id-depth informations about the method.

The main steps are the following:

  • The raw files are retrieved from the fantom dataset from the link specified in the fantom_data.json file

  • The window for the enhancers and promoters are expanded or compressed to the given window size. In particular:

    • The enhancers window can either be centered on the region center with the “center” mode or around the “peak” with the “peak” mode.

    • The promoters window is upstream in the positive strand from the end of the promoter and downstream on the negative strand from the start of the promoter.

  • When multiple experiments are present for a cell line, for instance for “HelaS3”, an average of the activation peaks is executed.

  • Optionally (and by default) the rows that are always inactive for the chosen cell lines are dropped. You can specify this behaviour using the parameter “drop_always_inactive_rows”.

Rendered datasets

Some datasets are already available pre-processed. See the tables below for more informations. If you think that another dataset could be useful as preprocessed do let me know and I will render it and add it to the tables.

FANTOM

Cell line

200bp

300bp

500bp

1000bp

All cell lines

Download

Download

Download

Download

HelaS3

Download

Download

Download

Download

GM12878

Download

Download

Download

Download

HepG2

Download

Download

Download

Download

K562

Download

Download

Download

Download

A549

Download

Download

Download

Download

MCF7

Download

Download

Download

Download

Future works

In the future more datasets containing labels with cis-regulatory regions will be added to this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crr_labels-1.0.0.tar.gz (9.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page