This package contains utilities to process TCGA data for fedpydeseq2.
Project description
Datasets organisation
This directory contains the data, assets and scripts necessary to:
- download the raw data necessary to run the tests and experiments, when not
available in the repository, in the
download_datadirectory; - open the data when performing a Substra experiment in the
assetsdirectory; - store the data in the
datadirectory.
Data download
For a detailed description of the data download process, please refer to the README.
If you want to run the pipeline directly, you can use the script which is available in the distribution: fedpydeseq2-download-data
fedpydeseq2-download-data
By default, this script download the data in the data/raw directory at the root of the github repo.
To change the location of the raw data download, add the following option:
fedpydeseq2-download-data --raw_data_output_path <path>
If you only want the LUAD dataset, add the --only_luad flag.
You can pass the conda activation path as an argument as well, for example:
fedpydeseq2-download-data --raw_data_output_path <path> --conda_activate_path /opt/miniconda/bin/activate
Origin of the data
- The
Counts_raw.parquetandrecount3_metadata.csvfiles are downloaded from the RECOUNT3 database. - The
tumor_purity_metadata.csvfile is downloaded from the Systematic pan-cancer analysis of tumour purity paper. - The
cleaned_clinical_metadata.csvfile is downloaded from the An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics paper.
For more detailed references, see the References section.
Assets
The assets directory contains a TCGA opener necessary to open the data on each center
when performing a federated experiment with Substra.
In particular, the fedpydeseq2_datasets/assets/tcga directory contains the following files:
assets/tcga
├── description.md
├── opener.py
The opener is a Python script that opens the data and makes it available to the
Substra platform. The description.md file contains a description of the data.
For more details on how the opener works, please refer to the Substra documentation.
Raw data organisation
The data directory contains the raw data.
The raw directory contains the data downloaded from the original sources,
with the download_data scripts.
It is organized as follows:
data
├── raw
│ └── tcga
│ ├── <COHORT NAME>
│ │ ├── Counts_raw.parquet
│ │ └── recount3_metadata.csv
│ ├── centers.csv
│ ├── cleaned_clinical_metadata.csv
│ └── tumor_purity_metadata.csv
Data preprocessing
This module not only provides the raw data on which to test fedpydeseq2; it also provides the necessary
preprocessing functions, to organise the data according to their center of origin, and to aggregate the raw
data into metadata and counts acceptable to run pydeseq2 or fedpydeseq2.
These preprocessing function usually create the preprocessed data in a processed_data_path directory, with the following
structure (the files shown below are created by different preprocessing functions).
└── <processed_data_path>
├── tcga
│ ├── <COHORT NAME>
│ │ ├── counts.parquet
│ │ └── clinical_data.csv
├── centers_data
│ └── tcga
│ ├── <dataset_and_dge_experiment_id>
│ │ ├── <center_i>
│ │ │ ├── counts_data.csv
│ │ │ ├── metadata.csv
│ │ │ └── ground_truth_dds.pkl
└── pooled_data
└── tcga
├── <dataset_and_dge_experiment_id>
│ ├── counts_data.csv
│ ├── metadata.csv
│ └── ground_truth_dds.pkl
These files are automatically generated, if they are not
already present from the raw files or if the force option is on.
Note that the centers are always indexed by an integer, starting from 0. For example, one would
have center_0,...,center_3 if there are
4 centers in the experiment.
The <dataset_and_dge_experiment_id> is an identifier of a differential gene expression task (and its specific hyperparameters)
and a TCGA dataset. fedpydeseq2 or deseq2 can then be run on the data corresponding to that DGE task.
Details on the processed data
In this repository, we study the following cofactors:
- the
genderof the patients, which is obtained from thecleaned_clinical_metadata.csvfile; - the
CPEof the samples, which is obtained from thetumor_purity_metadata.csvfile; - the
stageof the patients, which is obtained from thecleaned_clinical_metadata.csvfile. this stage is originally a stage betweenIandIV, but we have grouped them intoI-II-III(Non-advanced) andIV(Advanced) stages, to have a binary covariate. - the
center_idof the samples, which is obtained from thecenters.csvfile and used to create natural centers for the federated experiments.
The processing is done by functions in the fedpydeseq2_datasets directory. There are three main functions.
- the
common_preprocessing_tcgafunction in thefedpydeseq2_datasets/common_preprocessing.pyfile; - the
setup_tcga_datasetfunction in thefedpydeseq2_datasets/process_and_split_data.pyfile; - the
setup_tcga_ground_truth_ddsfunction in thetcga_preprocessing/create_reference_dds.pyfile.
The role of the common_preprocessing_tcga function is to generate counts and processed
clinical data for a given cohort (e.g. LUAD), from the raw data.
└── processed
├── tcga
│ ├── <COHORT NAME>
│ │ ├── counts.parquet
│ │ └── clinical_data.csv
The counts.parquet file contains the counts data, indexed by TCGA sample barcode,
and with columns corresponding to the gene_id in ENSEMBL convention.
Note that we filter out the PAR_Y genes, as they are not common to all patients.
The clinical_data.csv file aggregates the different metadata from the different sources
described above in a per-cohort fashion. This csv is indexed by the TCGA sample barcode.
It contains the following columns:
gender: the gender of the patient;CPE: CPE stands for "consensus measurement of purity estimations", and is an aggregate of different purity estimations for the sample;stage: the stage of the patient, as an integer between 1 and 4center_id: the center id of the sample, as an integeris_normal_tissue: a boolean indicating if the sample is a normal tissue or not.T: the tumor grade of the patient, as an integer between 1 and 4N: the nodal status of the patient, as an integer between 0 and 3M: the metastasis status of the patient, as an integer between 0 and 1
The role of the setup_tcga_dataset function and the setup_tcga_ground_truth_dds function
is to generate the data necessary for the federated AND corresponding pooled experiments, creating
this part of the arborescence:
└── processed
├── centers_data
│ └── tcga
│ ├── <dataset_and_dge_experiment_id>
│ │ ├── <center_i>
│ │ │ ├── counts_data.csv
│ │ │ ├── metadata.csv
│ │ │ └── ground_truth_dds.pkl
└── pooled_data
└── tcga
├── <dataset_and_dge_experiment_id>
│ ├── counts_data.csv
│ ├── metadata.csv
│ └── ground_truth_dds.pkl
The <dataset_and_dge_experiment_id> identifies an experiment. It concatenates
not only the dataset name (TCGA cohort), but also the design factors, continuous factors
as well as other parameters used to filter the data.
The counts_data.csv file contains the counts data, indexed by TCGA sample barcode,
and with columns corresponding to the gene_id in ENSEMBL convention.
The metadata.csv file contains the clinical data, indexed by the TCGA sample barcode, and
containing only the columns corresponding to a design factor.
The ground_truth_dds.pkl file contains the ground truth for the differential expression
analysis, as a dds object from the DESeq2 package.
For more details on these functions, please refer to their respective documentations.
Note: the
setup_tcga_datasetfunction will binarize thestageinto two categories:AdvancedandNon-advanced.Advancedcorresponds to stageIV, andNon-advancedcorresponds to stagesI,IIandIII. For the TCGA-PRAD cohort, we do not have the stage information, but we infer the stage from theT,NandMcolumns. If theNorMcolumns are > 0, the stage is IV (see the following reference) and hence theAdvancedstage. Otherwise, it isNon-advanced.
References
The data downloaded here has mainly been obtained from TCGA and processed by the following works.
[1] Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015 Dec 4;6:8971. doi: 10.1038/ncomms9971. Erratum in: Nat Commun. 2016 Feb 05;7:10707. doi: 10.1038/ncomms10707. PMID: 26634437; PMCID: PMC4671203. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4671203/
[2] Jianfang Liu, Tara Lichtenberg, Katherine A. Hoadley, Laila M. Poisson, Alexander J. Lazar, Andrew D. Cherniack, Albert J. Kovatich, Christopher C. Benz, Douglas A. Levine, Adrian V. Lee, Larsson Omberg, Denise M. Wolf, Craig D. Shriver, Vesteinn Thorsson et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics, Cell, Volume 173, Issue 2, 2018, Pages 400-416.e11, ISSN 0092-8674, https://doi.org/10.1016/j.cell.2018.02.05 https://www.sciencedirect.com/science/article/pii/S0092867418302290
[3] Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL, Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L, Hansen KD, Langmead B (2021). "recount3: summaries and queries for large-scale RNA-seq expression and splicing." Genome Biol. doi:10.1186/s13059-021-02533-6 https://doi.org/10.1186/s13059-021-02533-6, https://doi.org/10.1186/s13059-021-02533-6.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fedpydeseq2_datasets-0.1.0.tar.gz.
File metadata
- Download URL: fedpydeseq2_datasets-0.1.0.tar.gz
- Upload date:
- Size: 576.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9af5fd15bf1f87a9fdcf0ccd0178967b96f9c9adacd0eb0abcc9c9add872c03
|
|
| MD5 |
27c456966de397576d3932d05db93db5
|
|
| BLAKE2b-256 |
2c72fe81564fc7db935144dd829ed9591f8eb8f27f319919c9bc3f16de4e0ad2
|
File details
Details for the file fedpydeseq2_datasets-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fedpydeseq2_datasets-0.1.0-py3-none-any.whl
- Upload date:
- Size: 580.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.9.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33e29de78ce2680653a3d22fbf4a15e317ae20e5a4d32a6b5de8df197e5015dd
|
|
| MD5 |
5e06887eab4623556522c70725c79674
|
|
| BLAKE2b-256 |
006766c419ad74b395c88b3dcc50e6c263615ada82ae23bfa4e6b502f1646c84
|