Tools for imputation, segmentation, analysis, and plotting of Copy Number Segments (CNS).
Project description
CNSistent is a Python tool for processing and analyzing copy number data. It is designed to work with data from a variety of sources. The tool is designed to be easy to use, and to provide a comprehensive set of analyses and visualizations.
This repository also contains data from the PCAWG, TCGA, and TRACERx datasets, as well as gene sets from COSMIC and Ensembl, therefore git clone will download around 1GB of data. If you only want to use the library, download via PyPI [TODO].
Repository Data Quickstart
This repository contains raw data from PCAWG, TCGA, TRACERx, as well as genomic locations. The data needs to be processed first before it can be used.
Requirements
- Git LFS
- Python 3.8+
- Pip 21.3+
- (Optional) Conda for environment creation
Processing
- Clone the repository:
git clone https://bitbucket.org/schwarzlab/cnsistent - Install dependencies (
pip install -r requirements.txt) or create a Conda environment (conda env create -f cnsistent.yml). - Install the package from location:
pip install -e .The-ewill make sure that the data files can be accessed under thecnspackage. - [Optional] Process data:
bash ./scripts/data_process.sh- will create imputed data and sample statistics. - [Optional] Aggregate data:
bash ./scripts/data_aggregate.sh- will aggregate the imputed data using 15 different segmentation strategies.
Usage
To load the data use:
from cns.data_utils import main_load
samples_df, cns_df = main_load("imp")
This will load the imputed and filtered data for all datasets.
The samples_df and cns_df are Pandas dataframes.
The former contains information about each samples as well as its statistics (e.g. ane_both_ all for homozygous aneuploidy across all chromosomes).
The latter contains the copy number segments for each sample in the form of sample_id, chrom, start, end, major_cn, minor_cn, name where name identifies each segment.
For example to load CNs for the COSMIC genes, data you can use the same function:
samples_df, cns_df = main_load("COSMIC")
cns_df.head()
would produce
sample_id chrom start end major_cn minor_cn name
0 SP101724 chr1 2160133 2241558 2 2 SKI
1 SP101724 chr1 2487077 2496821 2 2 TNFRSF14
2 SP101724 chr1 2985731 3355185 2 2 PRDM16
3 SP101724 chr1 6241328 6269449 2 2 RPL22
4 SP101724 chr1 6845383 7829766 2 2 CAMTA1
Alternativelly you can call:
main_loadto only load samples,main_load("raw")to load the raw data,main_load("imp")to load the imputed data,main_load(agg_type)to load the aggregated bins, if the aggregation has been done, which can be one of:["1MB", "2MB", "3MB", "5MB", "10MB", "250KB", "500KB", "whole", "arms", "bands", "COSMIC", "ENSEMBL"].
Notes
- By default, 16 threads are used, if that causes problems (crashes), reduce the number of threads in the
data_process.shanddata_aggregate.shscripts. - The
example_API.pyis split into cells that can be run individually in an IDE. - You can also install the package with
pip install ., however there is a set of utility functions for loading data incns.data_utils.pythat will not be accesible then. - Conda is optional, you can also install required packages manually using PIP based on the list in cnsistent.yml.
- Additionally, 5 of the PCAWG medulloblastoma samples have been labeled as female in the source, however they contained CN calls for chromosome Y and we have therefore re-labelled them as male.
Repository Structure
.
cnsistent.yml: Conda environment file for the CNSistent package, referencesrequirements.txt.requirements.txt: Packages required to run the CNSistent package.example_API.py: Example code for using the CNSistent package.example_CLI.sh: Example code for using the CNSistent package from the command line.pyproject.toml: Configuration for packaging tools.
cns/
Contains the main code for the CNSistent package.
data/
Contains the raw data from PCAWG, TCGA, TRACERx, as well as genomic locations, also a notebook used to obtain them or merge source files.
docs/
Contains the documentation for the CNSistent package. The documentation is built using Sphinx, with the source in the ./docs/source folder. The documentation can be built using the make html command in the ./docs folder, provided the requirements in ./docs/requirements.txt are met.
notebooks/
Contains notebooks used for data processing and analysis:
analyze_break_clusters.ipynb: A notebook used to analyze the breakpoint clustering, based on the distance between merged breakpoints.analyze_CN_clipping.ipynb: Evaluation of result of clipping the CN segment values, in particular the effects on distribution and proportion that is clipped of.analyze_coverage.ipynb: Calculates the proportion of the genome that is covered by segments and locations where it applies.analyze_features.ipynb: Calculates and plots features across datasets.analyze_lung.ipynb: Plots the lung cancer data across datasets and cancer types, in particular for chromosome 3 and genes that have been established as important by IG method.analyze_SOX2_overlay.ipynb: Plots the SOX2 gene overlay on the lung cancer data.analyze_types.ipynb: Plots the distribution of cancer types and overall CN across datasets.data_obtain.ipynb: A notebook that has been used to obtain the raw data and potentially merge files where needed.docs_illustrations.ipynb: A notebook used to create illustrations for the documentation.docs_knee_detection.ipynb: A demo of the kneepoint detection algorithm.docs_runtime.ipynb: Calculates the runtime of the data processing across 1-32 threads (log scale).
scripts/
data_process.sh: Fills and imputes the raw data. Also calculates the data stats, in particular coverage and aneuploidy.data_aggregate.sh: Creates various segmentations, and aggregates the preprocessed data based on these segmentations. Depends ondata_cluster.pyfor breakpoint clustering.data_time.sh: Run time tests for the data processing across1-32threads (log scale).
tests/
inandout: Contains the input and output data for the tests. Output is generated usingexample_CLI.sh.test_cli.sh: Executes the tests and outputs to./tests/temp.test_*: unittest based tests of the public API.
Reference
Data
The contents of the data folder were obtained by processing the following sources, accessed in December 2023.
TCGA data obtained from ASCATv3 at: https://github.com/VanLoo-lab/ascat/tree/master/ReleasedData
Cite: https://www.pnas.org/doi/full/10.1073/pnas.1009843107
The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
PCAWG data obtained from: https://dcc.icgc.org/releases/PCAWG/consensus_cnv Cite: https://www.nature.com/articles/s41587-019-0055-9
TRACERx data obtained from: https://zenodo.org/records/7649257
Cite: https://www.nature.com/articles/s41586-023-05729-x
COSMIC cancer set obtained from: https://cancer.sanger.ac.uk/census
Cite: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450507
Human genome gene set obtained using PyENSEMBL (2023). Cite: https://academic.oup.com/nar/article/51/D1/D933/6786199
Cytoband, Gap data obtained from: https://genome.ucsc.edu Cite: https://www.nature.com/articles/35057062
Please cite
TBD, pre-print expected 2024.
The MIT License
Copyright © 2023 Dr. Adam Streck, adam.streck@gmail.com
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cnsistent-0.6.3.tar.gz.
File metadata
- Download URL: cnsistent-0.6.3.tar.gz
- Upload date:
- Size: 29.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
362c716fce80fdc02929d66e7550fb32d8c38c4802584dfc6b02e15d50e414a5
|
|
| MD5 |
b213135542ab5eec328133f0384525c8
|
|
| BLAKE2b-256 |
6982a597eabcbbc88231cc8b2aca3bddb467272bb9fab68a666179100a295d48
|
File details
Details for the file cnsistent-0.6.3-py3-none-any.whl.
File metadata
- Download URL: cnsistent-0.6.3-py3-none-any.whl
- Upload date:
- Size: 74.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b8ce8998b9ed8ed708a017a7c39aa786db1559c4059eb8f9ab8323682d52fbcf
|
|
| MD5 |
09b038e07c1b7dda9065c52186a4ba0a
|
|
| BLAKE2b-256 |
cf9ed5ef906023b39db23b4573341bac5b0f8acec0d719f61082f62b753e9458
|