epigenomic-dataset

Python package wrapping ENCODE epigenomic data for a number of reference cell lines.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Python package wrapping ENCODE epigenomic data for several reference cell lines.

How do I install this package?

As usual, just download it using pip:

pip install epigenomic_dataset

Tests Coverage

Since some software handling coverages sometimes get slightly different results, here’s three of them:

Preprocessed data for cis-regulatory regions

We have already downloaded and obtained the max window value for each promoter and enhancer region for the cell lines A549, GM12878, H1, HEK293, HepG2, K562 and MCF7 in the dataset Fantom and cell lines A549, GM12878, H1, HepG2 and K562 for the Roadmap dataset taking in consideration all the target features listed in the complete table of epigenomes.

The thresholds used for classifying the activations of enhancers and promoters in Fantom are the default explained in the sister pipeline CRR labels which handles the download and preprocessing of the data from Fantom and Roadmap.

Dataset	Cell line	Promoters	Enhancers
Fantom	A549	200	1000	200	1000
Fantom	GM12878	200	1000	200	1000
Fantom	H1	200	1000	200	1000
Fantom	HEK293	200	1000	200	1000
Fantom	HepG2	200	1000	200	1000
Fantom	K562	200	1000	200	1000
Fantom	MCF-7	200	1000	200	1000
Roadmap	A549	200	1000	200	1000
Roadmap	GM12878	200	1000	200	1000
Roadmap	H1	200	1000	200	1000
Roadmap	HepG2	200	1000	200	1000
Roadmap	K562	200	1000	200	1000

Here are the labels for all the considered cell lines.

Dataset	Promoters	Enhancers
Fantom	200	1000	200	1000
Roadmap	200	1000	200	1000

TODO: align promoters and enhancers in a reference labels dataset.

The complete pipeline used to retrieve the CRR epigenomic data is available here.

Automatic retrieval of preprocessed data

You can automatically retrieve the data as follows:

from epigenomic_dataset import load_epigenomes

X, y = load_epigenomes(
    cell_line = "K562",
    dataset = "fantom",
    regions = "promoters",
    window_size = 200,
    root = "datasets" # Path where to download data
)

Pipeline for epigenomic data

The considered raw data are from this query from the ENCODE project

You can find the complete table of the available epigenomes here. These datasets were selected to have (at time of the writing, 07/02/2020) the least possible amount of known problems, such as low read resolution.

You can run the pipeline as follows: suppose you want to extract the epigenomic features for the cell lines HepG2 and H1:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"]
)

If you want to specify where to store the files use:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"],
    path="path/to/my/target"
)

By default, the downloaded bigWig files are not deleted. You can choose to delete the files as follows:

from epigenomic_dataset import build

build(
    bed_path="path/to/my/bed/file.bed",
    cell_lines=["HepG2", "H1"],
    path="path/to/my/target",
    clear_download=True
)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.14

Apr 27, 2022

1.2.13

Nov 27, 2021

1.2.12

Nov 27, 2021

1.2.11

Jun 10, 2021

1.2.10

Jun 10, 2021

1.2.9

Jun 10, 2021

1.2.8

Jun 10, 2021

1.2.7

Jun 2, 2021

1.2.6

Jun 2, 2021

1.2.5

Jan 17, 2021

1.2.4

Jan 15, 2021

1.2.3

Jan 14, 2021

1.2.2

Jan 13, 2021

1.2.0

Dec 7, 2020

This version

1.1.7

Jul 19, 2020

1.1.6

Jun 22, 2020

1.1.5

May 13, 2020

1.1.4

Mar 19, 2020

1.1.3

Mar 18, 2020

1.1.1

Mar 15, 2020

1.1.0

Mar 15, 2020

1.0.3

Mar 6, 2020

1.0.2

Mar 6, 2020

1.0.1

Feb 27, 2020

1.0.0

Feb 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epigenomic_dataset-1.1.7.tar.gz (69.3 kB view hashes)

Uploaded Jul 19, 2020 Source

Hashes for epigenomic_dataset-1.1.7.tar.gz

Hashes for epigenomic_dataset-1.1.7.tar.gz
Algorithm	Hash digest
SHA256	`cfbfe521d9839c0d8ca6a590fb555654e0db461bae2312a806b46bc009173ec5`
MD5	`cda0e482bf0e42f4ac2d228fbbb0cbd5`
BLAKE2b-256	`cba0fa8ab2fbfcbb7baf9227377cc14f9cb181c0481ce336ae2dc54284dddd53`