Skip to main content

Tools for imputation, segmentation, analysis, and plotting of Copy Number Segments (CNS).

Project description

CNSistent Logo

CNSistent is a Python tool for processing and analyzing copy number data. It is designed to work with data from a variety of sources. The tool is designed to be easy to use, and to provide a comprehensive set of analyses and visualizations.

This repository also contains data from the PCAWG, TCGA, and TRACERx datasets, as well as gene sets from COSMIC and Ensembl, therefore git clone will download around 1GB of data. If you only want to use the library, download via PyPI [TODO].

Repository Data Quickstart

This repository contains raw data from PCAWG, TCGA, TRACERx, as well as genomic locations. The data needs to be processed first before it can be used.

Requirements

  • Git LFS
  • Python 3.8+
  • Pip 21.3+
  • (Optional) Conda for environment creation

Processing

  1. Clone the repository: git clone https://bitbucket.org/schwarzlab/cnsistent
  2. Install dependencies (pip install -r requirements.txt) or create a Conda environment (conda env create -f cnsistent.yml).
  3. Install the package from location: pip install -e . The -e will make sure that the data files can be accessed under the cns package.
  4. [Optional] Process data: bash ./scripts/data_process.sh - will create imputed data and sample statistics.
  5. [Optional] Aggregate data: bash ./scripts/data_aggregate.sh - will aggregate the imputed data using 15 different segmentation strategies.

Usage

To load the data use:

from cns.data_utils import main_load
samples_df, cns_df = main_load("imp")

This will load the imputed and filtered data for all datasets.

The samples_df and cns_df are Pandas dataframes. The former contains information about each samples as well as its statistics (e.g. ane_both_ all for homozygous aneuploidy across all chromosomes). The latter contains the copy number segments for each sample in the form of sample_id, chrom, start, end, major_cn, minor_cn, name where name identifies each segment. For example to load CNs for the COSMIC genes, data you can use the same function:

    samples_df, cns_df = main_load("COSMIC")
    cns_df.head()

would produce

      sample_id chrom start    end     major_cn  minor_cn name
    0 SP101724  chr1  2160133  2241558 2         2        SKI
    1 SP101724  chr1  2487077  2496821 2         2        TNFRSF14
    2 SP101724  chr1  2985731  3355185 2         2        PRDM16
    3 SP101724  chr1  6241328  6269449 2         2        RPL22
    4 SP101724  chr1  6845383  7829766 2         2        CAMTA1

Alternativelly you can call:

  • main_load to only load samples,
  • main_load("raw") to load the raw data,
  • main_load("imp") to load the imputed data,
  • main_load(agg_type) to load the aggregated bins, if the aggregation has been done, which can be one of: ["1MB", "2MB", "3MB", "5MB", "10MB", "250KB", "500KB", "whole", "arms", "bands", "COSMIC", "ENSEMBL"].

Notes

  • By default, 16 threads are used, if that causes problems (crashes), reduce the number of threads in the data_process.sh and data_aggregate.sh scripts.
  • The example_API.py is split into cells that can be run individually in an IDE.
  • You can also install the package with pip install ., however there is a set of utility functions for loading data in cns.data_utils.py that will not be accesible then.
  • Conda is optional, you can also install required packages manually using PIP based on the list in cnsistent.yml.
  • Additionally, 5 of the PCAWG medulloblastoma samples have been labeled as female in the source, however they contained CN calls for chromosome Y and we have therefore re-labelled them as male.

Repository Structure

.

  • cnsistent.yml: Conda environment file for the CNSistent package, references requirements.txt.
  • requirements.txt: Packages required to run the CNSistent package.
  • example_API.py: Example code for using the CNSistent package.
  • example_CLI.sh: Example code for using the CNSistent package from the command line.
  • pyproject.toml: Configuration for packaging tools.

cns/

Contains the main code for the CNSistent package.

data/

Contains the raw data from PCAWG, TCGA, TRACERx, as well as genomic locations, also a notebook used to obtain them or merge source files.

docs/

Contains the documentation for the CNSistent package. The documentation is built using Sphinx, with the source in the ./docs/source folder. The documentation can be built using the make html command in the ./docs folder, provided the requirements in ./docs/requirements.txt are met.

notebooks/

Contains notebooks used for data processing and analysis:

  • analyze_break_clusters.ipynb: A notebook used to analyze the breakpoint clustering, based on the distance between merged breakpoints.
  • analyze_CN_clipping.ipynb: Evaluation of result of clipping the CN segment values, in particular the effects on distribution and proportion that is clipped of.
  • analyze_coverage.ipynb: Calculates the proportion of the genome that is covered by segments and locations where it applies.
  • analyze_features.ipynb : Calculates and plots features across datasets.
  • analyze_lung.ipynb : Plots the lung cancer data across datasets and cancer types, in particular for chromosome 3 and genes that have been established as important by IG method.
  • analyze_SOX2_overlay.ipynb: Plots the SOX2 gene overlay on the lung cancer data.
  • analyze_types.ipynb: Plots the distribution of cancer types and overall CN across datasets.
  • data_obtain.ipynb: A notebook that has been used to obtain the raw data and potentially merge files where needed.
  • docs_illustrations.ipynb: A notebook used to create illustrations for the documentation.
  • docs_knee_detection.ipynb: A demo of the kneepoint detection algorithm.
  • docs_runtime.ipynb: Calculates the runtime of the data processing across 1-32 threads (log scale).

scripts/

  • data_process.sh: Fills and imputes the raw data. Also calculates the data stats, in particular coverage and aneuploidy.
  • data_aggregate.sh: Creates various segmentations, and aggregates the preprocessed data based on these segmentations. Depends on data_cluster.py for breakpoint clustering.
  • data_time.sh: Run time tests for the data processing across 1-32 threads (log scale).

tests/

  • in and out: Contains the input and output data for the tests. Output is generated using example_CLI.sh.
  • test_cli.sh : Executes the tests and outputs to ./tests/temp.
  • test_*: unittest based tests of the public API.

Reference

Data

The contents of the data folder were obtained by processing the following sources, accessed in December 2023.

TCGA data obtained from ASCATv3 at: https://github.com/VanLoo-lab/ascat/tree/master/ReleasedData
Cite: https://www.pnas.org/doi/full/10.1073/pnas.1009843107
The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

PCAWG data obtained from: https://dcc.icgc.org/releases/PCAWG/consensus_cnv Cite: https://www.nature.com/articles/s41587-019-0055-9

TRACERx data obtained from: https://zenodo.org/records/7649257
Cite: https://www.nature.com/articles/s41586-023-05729-x

COSMIC cancer set obtained from: https://cancer.sanger.ac.uk/census
Cite: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450507

Human genome gene set obtained using PyENSEMBL (2023). Cite: https://academic.oup.com/nar/article/51/D1/D933/6786199

Cytoband, Gap data obtained from: https://genome.ucsc.edu Cite: https://www.nature.com/articles/35057062

Please cite

TBD, pre-print expected 2024.

The MIT License

Copyright © 2023 Dr. Adam Streck, adam.streck@gmail.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnsistent-0.6.3.tar.gz (29.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cnsistent-0.6.3-py3-none-any.whl (74.5 kB view details)

Uploaded Python 3

File details

Details for the file cnsistent-0.6.3.tar.gz.

File metadata

  • Download URL: cnsistent-0.6.3.tar.gz
  • Upload date:
  • Size: 29.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for cnsistent-0.6.3.tar.gz
Algorithm Hash digest
SHA256 362c716fce80fdc02929d66e7550fb32d8c38c4802584dfc6b02e15d50e414a5
MD5 b213135542ab5eec328133f0384525c8
BLAKE2b-256 6982a597eabcbbc88231cc8b2aca3bddb467272bb9fab68a666179100a295d48

See more details on using hashes here.

File details

Details for the file cnsistent-0.6.3-py3-none-any.whl.

File metadata

  • Download URL: cnsistent-0.6.3-py3-none-any.whl
  • Upload date:
  • Size: 74.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for cnsistent-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b8ce8998b9ed8ed708a017a7c39aa786db1559c4059eb8f9ab8323682d52fbcf
MD5 09b038e07c1b7dda9065c52186a4ba0a
BLAKE2b-256 cf9ed5ef906023b39db23b4573341bac5b0f8acec0d719f61082f62b753e9458

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page