Tools for imputation, segmentation, analysis, and plotting of Copy Number Segments (CNS).

These details have not been verified by PyPI

Project description

CNSistent Logo

~ READ THE DOCS HERE ~

CNSistent is a Python tool for processing and analyzing copy number data. It is designed to work with data from a variety of sources. The tool is designed to be easy to use, and to provide a comprehensive set of analyses and visualizations.

This repository also contains data from the PCAWG, TCGA, and TRACERx datasets, as well as gene sets from COSMIC and Ensembl, therefore git clone will download around 1GB of data. If you only want to use the library, download via PyPI [TODO].

Repository Data Quickstart

This repository contains raw data from PCAWG, TCGA, TRACERx, as well as genomic locations. The data needs to be processed first before it can be used.

Requirements

Git LFS
Python 3.8+
Pip 21.3+
(Optional) Conda for environment creation

Processing

Clone the repository: git clone https://bitbucket.org/schwarzlab/cnsistent
Install dependencies (pip install -r requirements.txt) or create a Conda environment (conda env create -f cnsistent.yml).
Install the package from location: pip install -e . The -e will make sure that the data files can be accessed under the cns package.
[Optional] Process data: bash ./scripts/data_process.sh - will create imputed data and sample statistics.
[Optional] Aggregate data: bash ./scripts/data_aggregate.sh - will aggregate the imputed data using 15 different segmentation strategies.

Usage

To load the data use:

from cns.data_utils import main_load
samples_df, cns_df = main_load("imp")

This will load the imputed and filtered data for all datasets.

The samples_df and cns_df are Pandas dataframes. The former contains information about each samples as well as its statistics (e.g. ane_both_ all for homozygous aneuploidy across all chromosomes). The latter contains the copy number segments for each sample in the form of sample_id, chrom, start, end, major_cn, minor_cn, name where name identifies each segment. For example to load CNs for the COSMIC genes, data you can use the same function:

    samples_df, cns_df = main_load("COSMIC")
    cns_df.head()

would produce

      sample_id chrom start    end     major_cn  minor_cn name
    0 SP101724  chr1  2160133  2241558 2         2        SKI
    1 SP101724  chr1  2487077  2496821 2         2        TNFRSF14
    2 SP101724  chr1  2985731  3355185 2         2        PRDM16
    3 SP101724  chr1  6241328  6269449 2         2        RPL22
    4 SP101724  chr1  6845383  7829766 2         2        CAMTA1

Alternativelly you can call:

main_load to only load samples,
main_load("raw") to load the raw data,
main_load("imp") to load the imputed data,
main_load(agg_type) to load the aggregated bins, if the aggregation has been done, which can be one of: ["1MB", "2MB", "3MB", "5MB", "10MB", "250KB", "500KB", "whole", "arms", "bands", "COSMIC", "ENSEMBL"].

Notes

By default, 16 threads are used, if that causes problems (crashes), reduce the number of threads in the data_process.sh and data_aggregate.sh scripts.
The example_API.py is split into cells that can be run individually in an IDE.
You can also install the package with pip install ., however there is a set of utility functions for loading data in cns.data_utils.py that will not be accesible then.
Conda is optional, you can also install required packages manually using PIP based on the list in cnsistent.yml.
Additionally, 5 of the PCAWG medulloblastoma samples have been labeled as female in the source, however they contained CN calls for chromosome Y and we have therefore re-labelled them as male.

Repository Structure

.

cnsistent.yml: Conda environment file for the CNSistent package, references requirements.txt.
requirements.txt: Packages required to run the CNSistent package.
example_API.py: Example code for using the CNSistent package.
example_CLI.sh: Example code for using the CNSistent package from the command line.
pyproject.toml: Configuration for packaging tools.

cns/

Contains the main code for the CNSistent package.

data/

Contains the raw data from PCAWG, TCGA, TRACERx, as well as genomic locations, also a notebook used to obtain them or merge source files.

docs/

Contains the documentation for the CNSistent package. The documentation is built using Sphinx, with the source in the ./docs/source folder. The documentation can be built using the make html command in the ./docs folder, provided the requirements in ./docs/requirements.txt are met.

notebooks/

Contains notebooks used for data processing and analysis:

analyze_break_clusters.ipynb: A notebook used to analyze the breakpoint clustering, based on the distance between merged breakpoints.
analyze_CN_clipping.ipynb: Evaluation of result of clipping the CN segment values, in particular the effects on distribution and proportion that is clipped of.
analyze_coverage.ipynb: Calculates the proportion of the genome that is covered by segments and locations where it applies.
analyze_features.ipynb : Calculates and plots features across datasets.
analyze_lung.ipynb : Plots the lung cancer data across datasets and cancer types, in particular for chromosome 3 and genes that have been established as important by IG method.
analyze_SOX2_overlay.ipynb: Plots the SOX2 gene overlay on the lung cancer data.
analyze_types.ipynb: Plots the distribution of cancer types and overall CN across datasets.
data_obtain.ipynb: A notebook that has been used to obtain the raw data and potentially merge files where needed.
docs_illustrations.ipynb: A notebook used to create illustrations for the documentation.
docs_knee_detection.ipynb: A demo of the kneepoint detection algorithm.
docs_runtime.ipynb: Calculates the runtime of the data processing across 1-32 threads (log scale).

scripts/

data_process.sh: Fills and imputes the raw data. Also calculates the data stats, in particular coverage and aneuploidy.
data_aggregate.sh: Creates various segmentations, and aggregates the preprocessed data based on these segmentations. Depends on data_cluster.py for breakpoint clustering.
data_time.sh: Run time tests for the data processing across 1-32 threads (log scale).

tests/

in and out: Contains the input and output data for the tests. Output is generated using example_CLI.sh.
test_cli.sh : Executes the tests and outputs to ./tests/temp.
test_*: unittest based tests of the public API.

Reference

Data

The contents of the data folder were obtained by processing the following sources, accessed in December 2023.

TCGA data obtained from ASCATv3 at: https://github.com/VanLoo-lab/ascat/tree/master/ReleasedData
Cite: https://www.pnas.org/doi/full/10.1073/pnas.1009843107
The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

PCAWG data obtained from: https://dcc.icgc.org/releases/PCAWG/consensus_cnv Cite: https://www.nature.com/articles/s41587-019-0055-9

TRACERx data obtained from: https://zenodo.org/records/7649257
Cite: https://www.nature.com/articles/s41586-023-05729-x

COSMIC cancer set obtained from: https://cancer.sanger.ac.uk/census
Cite: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6450507

Human genome gene set obtained using PyENSEMBL (2023). Cite: https://academic.oup.com/nar/article/51/D1/D933/6786199

Cytoband, Gap data obtained from: https://genome.ucsc.edu Cite: https://www.nature.com/articles/35057062

Please cite

TBD, pre-print expected 2024.

The MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.0.0

May 13, 2026

0.9.0

Sep 16, 2025

0.8.0

Jun 9, 2025

0.7.3

Jan 28, 2025

0.7.1

Dec 16, 2024

0.7.0

Dec 16, 2024

0.6.4

Dec 12, 2024

This version

0.6.3

Dec 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnsistent-0.6.3.tar.gz (29.1 MB view details)

Uploaded Dec 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cnsistent-0.6.3-py3-none-any.whl (74.5 kB view details)

Uploaded Dec 12, 2024 Python 3

File details

Details for the file cnsistent-0.6.3.tar.gz.

File metadata

Download URL: cnsistent-0.6.3.tar.gz
Upload date: Dec 12, 2024
Size: 29.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for cnsistent-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`362c716fce80fdc02929d66e7550fb32d8c38c4802584dfc6b02e15d50e414a5`
MD5	`b213135542ab5eec328133f0384525c8`
BLAKE2b-256	`6982a597eabcbbc88231cc8b2aca3bddb467272bb9fab68a666179100a295d48`

See more details on using hashes here.

File details

Details for the file cnsistent-0.6.3-py3-none-any.whl.

File metadata

Download URL: cnsistent-0.6.3-py3-none-any.whl
Upload date: Dec 12, 2024
Size: 74.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for cnsistent-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8ce8998b9ed8ed708a017a7c39aa786db1559c4059eb8f9ab8323682d52fbcf`
MD5	`09b038e07c1b7dda9065c52186a4ba0a`
BLAKE2b-256	`cf9ed5ef906023b39db23b4573341bac5b0f8acec0d719f61082f62b753e9458`

See more details on using hashes here.

CNSistent 0.6.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

~ READ THE DOCS HERE ~

Repository Data Quickstart

Requirements

Processing

Usage

Notes

Repository Structure

Reference

Data

Please cite

The MIT License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

CNSistent 0.6.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

**~ READ THE DOCS HERE ~**

Repository Data Quickstart

Requirements

Processing

Usage

Notes

Repository Structure

Reference

Data

Please cite

The MIT License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

~ READ THE DOCS HERE ~