Skip to main content

A package to download, load, and process multiple benchmark multi-omic drug response datasets

Project description

Cancer drug benchmark dataset

There is a recent explosion of deep learning algorithms that

This package collects diverse sets of paired molecular datasets with corresponding drug sensitivity data. All data here is reprocessed and standardized so it can be easily used as a benchmark dataset for the This repository leverages existing datasets to collect the data required for CANDLE data analysis. Since each deep learning model requires distinct data capabilities, the goal of this repository is to collect and format all data into a schema that can be leveraged for existing models.

The goal of this repository is two-fold: First, it aims to collate and standardize the data for all CANDLE related models. This requires running a series of scripts to build and append to a standardized data model. Second, it has a series of scripts that pull from the data model to create model-specific data files that can be run by the data infrastructure.

IMPROVE Data Model

The goal of the data model is to collate drug response data together with molecular data in a way that can be easily ingested by machine learning models. The overall schema is shown below.

We will store the data in tables that are represented by the files below. Each data-specific model can be generated from a smaller set of these tables. The schema for these tables is represented below.

The files are comma-delimited and named follows:

  1. genes.csv
  2. drugs.tsv.gz --> Drug names have commas and quotes in them, therefore require tab delimited
  3. samples.csv
  4. experiments.csv.gz --> compressed to fit on github
  5. transcriptomics.csv.gz
  6. mutations.csv.gz
  7. copy_number.csv.gz
  8. methylation.csv.gz
  9. mirnas.csv.gz

Building the data model

Below is a description of how the data model is built.

Data model step Description/Dependencies Script Destination
Build cell line data Runs through PGX and existing CCLE data to compile all values cell_line/buildInitialDataset.py [./cell_line]
Build cptac data This uses the genes files created in the [./cell_line] directory but generates additional samples. cptac/getCptacData.py [./cptac]
Get HCMI data This uses a fixed manifest to download the data into the proper schema hcmi/getHCMIData.py [./hcmi]

Current data

What data is stored here?

Using the data model

Files are stored on FigShare. We need to build a script that pulls those data as needed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coderdata-0.1.2.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

coderdata-0.1.2-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file coderdata-0.1.2.tar.gz.

File metadata

  • Download URL: coderdata-0.1.2.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for coderdata-0.1.2.tar.gz
Algorithm Hash digest
SHA256 411001a3e762c6012f45951a672d130a186e9599f62981ffd337248dfef6e1c1
MD5 483a668b59edd99c8d5d517f0b361360
BLAKE2b-256 3e6b05e7d0dd6a73e9e6a85e3378b5f24e84bcc11e2630223c37fe902dba46d0

See more details on using hashes here.

File details

Details for the file coderdata-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: coderdata-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for coderdata-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 777342cae8c9a0bda8073e3dc174e4c9fdd68d90d762c72cdc8884c9c6176651
MD5 54c610ba436b78f8e74c55b4a2b02e52
BLAKE2b-256 8bb093369e84fc2fc1f20139a6c385f4b7c7108ac8c3c5000cb88ec40aacbdfd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page