Skip to main content

Preprocessing for various cancer genomics datasets

Project description

cancer_data

This package provides unified methods for accessing popular datasets used in cancer research.

Full documentation

Installation

pip install cancer_data

System requirements

The raw downloaded files occupy approximately 15 GB, and the processed HDFs take up about 10 GB. On a relatively recent machine with a fast SSD, processing all of the files after download takes about 3-4 hours. At least 16 GB of RAM is recommended for handling the large splicing tables.

Datasets

A complete description of the datasets may be found in schema.csv.

Collection Datasets Portal
Cancer Cell Line Encyclopedia (CCLE) Many (see portal) https://portals.broadinstitute.org/ccle/data (registration required)
Cancer Dependency Map (DepMap) Genome-wide CRISPR-cas9 and RNAi screens, gene expression, mutations, and copy number https://depmap.org/portal/download/
The Cancer Genome Atlas (TCGA) Mutations, RNAseq expression and splicing, and copy number https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
The Genotype-Tissue Expression (GTEx) Project RNAseq expression and splicing https://gtexportal.org/home/datasets

Features

The goal of this package is to make statistical analysis and coordination of these datasets easier. To that end, it provides the following features:

  1. Harmonization: datasets within a collection have sample IDs reduced to the same format. For instance, all CCLE+DepMap datasets have been modified to use Achilles/Arxspan IDs, rather than cell line names.
  2. Speed: processed datasets are all stored in high-performance HDF5 format, allowing large tables to be loaded orders of magnitude faster than with CSV or TSV formats.
  3. Space: tables of purely numerical values (e.g. gene expression, methylation, drug sensitivities) are stored in half-precision format. Compression is used for all tables, resulting in size reductions by factors of over 10 for sparse matrices such as mutation tables, and over 50 for highly-redundant tables such as gene-level copy number estimates.

How it works

The schema serves as the reference point for all datasets used. Each dataset is identified by a unique id column, which also serves as its access identifier.

Datasets are downloaded from the location specified in download_url, after which they are checked against the provided downloaded_md5 hash.

The next steps depend on the type of the dataset:

  • reference datasets, such as the hg19 FASTA files, are left as-is.
  • primary_dataset objects are preprocessed and converted into HDF5 format.
  • secondary_dataset objects are defined as being made from primary_dataset objects. These are also processed and converted into HDF5 format.

To keep track of which datasets are necessary for producing another, the dependencies column specifies the dataset ids that are required for making another. For instance, the ccle_proteomics dataset is dependent on the ccle_annotations dataset for converting cell line names to Achilles IDs. When running the processing pipeline, the package will automatically check that dependencies are met, and raise an error if they are not found.

Notes

Some datasets have filtering applied to reduce their size. These are listed below:

  • CCLE, GTEx, and TCGA splicing datasets have been filtered to remove splicing events with many missing values as well as those with low standard deviations.
  • When constructing binary mutation matrices (depmap_damaging and depmap_hotspot), a minimum mutation frequency is used to remove especially rare (present in less than four samples) mutations.
  • The TCGA MX splicing dataset is extremely large (approximately 10,000 rows by 900,000 columns), so it has been split column-wise into 8 chunks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cancer_data-0.3.6.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

cancer_data-0.3.6-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file cancer_data-0.3.6.tar.gz.

File metadata

  • Download URL: cancer_data-0.3.6.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.4.0

File hashes

Hashes for cancer_data-0.3.6.tar.gz
Algorithm Hash digest
SHA256 a4dbcb804c2aff71be8ebda0baea4083b786432e5f4c6e3ec79459af96e60a1f
MD5 58c2fd2eb3a03b0803fb55c0a859a36b
BLAKE2b-256 695b46b96e864b44786df7ea88e3e654a690ccf7f716e1b36604666509eae468

See more details on using hashes here.

File details

Details for the file cancer_data-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: cancer_data-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.6 Darwin/23.4.0

File hashes

Hashes for cancer_data-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 b5053ec75bfc95e4a35fac603b23ba60f54cb3e5aa819fff981e0813809d3fc4
MD5 ab12c1b10b440c6413b4aaf75d8f5aa6
BLAKE2b-256 0b0c31ca191221fa5cf1ae15a605d4efe4d28abb994d662c049bdd925d511ca0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page