Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Python package wrapping over FANTOM and Roadmap labels for cis regulatory regions.

Project description

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Python package wrapping over FANTOM and Roadmap labels for cis regulatory regions.

How do I install this package?

As usual, just download it using pip:

pip install crr_labels

Tests Coverage

Since some software handling coverages sometime get slightly different results, here’s three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Usage examples

Currently we support FANTOM CAGE data and Roadmap but in the future an additional cis-regulatory dataset based on open chromatin data will be added.

FANTOM

To retrieve the FANTOM promoters and enhancers you can proceed as follows:

from crr_labels import fantom

enhancers, promoters = fantom(
    cell_lines=["HelaS3", "GM12878"], # list of cell lines to be considered.
    window_size=200, # window size to use for the various regions.
    genome = "hg19", # considered genome version. Currently supported only "hg19".
    center_enhancers = "peak", # how to center the enhancer window, either around "peak" or the "center" of the region.
    enhancers_threshold = 0, # activation threshold for the enhancers.
    promoters_threshold = 5, # activation threshold for the promoters.
    drop_always_inactive_rows = True, # whetever to drop the rows where no activation is detected for every rows.
    binarize: bool= True, # Whetever to return the data binary-encoded, zero for inactive, one for active.
    nrows:int=None # the number of rows to read, usefull when testing pipelines for creating smaller datasets.
)

The library will download and parse the fantom project raw data and return two dataframes for the required cell lines. Consider reading the method docstring for more id-depth informations about the method.

The main steps are the following:

  • The raw files are retrieved from the fantom dataset from the link specified in the fantom_data.json file
  • The window for the enhancers and promoters are expanded or compressed to the given window size. In particular:
    • The enhancers window can either be centered on the region center with the “center” mode or around the “peak” with the “peak” mode.
    • The promoters window is upstream in the positive strand from the end of the promoter and downstream on the negative strand from the start of the promoter.
  • When multiple experiments are present for a cell line, for instance for “HelaS3”, an average of the activation peaks is executed.
  • Optionally (and by default) the rows that are always inactive for the chosen cell lines are dropped. You can specify this behaviour using the parameter “drop_always_inactive_rows”.

Roadmap

To retrieve the Roadmap promoters and enhancers you can proceed as follows:

from crr_labels import roadmap

enhancers, promoters = roadmap(
    cell_lines=["HelaS3", "GM12878"], # List of cell lines to be considered.
    window_size=200, # Window size to use for the various regions.
    genome = "hg19", # Considered genome version. Currently supported only "hg19".
    states: int = 18, # Number of the states of the model to consider. Currently supported only "15" and "18".
    enhancers_labels: List[str] = ("7_Enh", "9_EnhA1", "10_EnhA2"), # Labels to encode as active enhancers.
    promoters_labels: List[str] = ("1_TssA",), # Labels to enode as active promoters.
    nrows:int=None # the number of rows to read, usefull when testing pipelines for creating smaller datasets.
)

Consider reading the method docstring for more id-depth informations about the method.

Rendered datasets

The following two datasets have label for 7 common cell lines (GM12878, HelaS3, HepG2, K562, A549, H1, H9) and for various other that were not available in the other dataset.

FANTOM

The following datasets contain data for the cell lines GM12878, HelaS3, HepG2, K562, A549, H1, H9, JURKAT, MCF7, HEK293, Caco2, HL60 and PC3.

Nucleotides window Genome Region-centered enhancers Peak-centered enhancers Promoters
200 hg19 Download Download Download
300 hg19 Download Download Download
500 hg19 Download Download Download
1000 hg19 Download Download Download
2000 hg19 Download Download Download

Roadmap

The following datasets contain data for the cell lines GM12878, HelaS3, HepG2, K562, A549, H1, H9, DND41, HUES48, HUES6, HUES64 and IMR90.

Nucleotides window Genome 15-states model enhancers 15-states model promoters 18-states model enhancers 18-states model promoters
200 hg19 Download Download Download Download
300 hg19 Download Download Download Download
500 hg19 Download Download Download Download
1000 hg19 Download Download Download Download
2000 hg19 Download Download Download Download

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for crr-labels, version 1.0.5
Filename, size File type Python version Upload date Hashes
Filename, size crr_labels-1.0.5.tar.gz (12.8 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page