Skip to main content

Tools to facilitate the parsing of SSM data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

Project description

Documentation Status

Scripts to automate parsing of data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

Download and installation

The core module is in PyPi, you can install it using:

pip install ICGC-data-parser

Although the whole example notebooks and helper VCF manipulation scripts are in the the GitHub repository. To download the whole thing, enter the repository webpage and click the download button or type the following in a Unix terminal:

git clone https://github.com/Ad115/ICGC-data-parser.git

Data download

The main subject of our inquiries is the ICGC’s aggregated of the simple somatic mutation data. Which can be downloded using:

wget https://dcc.icgc.org/api/v1/download?fn=/current/Summary/simple_somatic_mutation.aggregated.vcf.gz

To know more about this file, please read About the ICGC’s simple somatic mutations file

Usage

The whole package contains example scripts that do the following:

  • Mutation recurrence count: Analyzes the data and plots the Mutation recurrence distribution. This distribution contains the information regarding: How many mutations appear in more than one patient? or How many mutations are repeated among patients?, This is further documented in The mutation recurrence workflow

  • Mutation density plot: Plots the mutation density per chromosome along the whole chromosomal length. Allowing to identify visually the randomness of the mutation positions.

  • Distribution of the mutations in the genes: Automation of the extraction of the distribution of mutations in the genes. It answers the question of how many genes contain ``x`` number of mutations in a given gene or project? *TODO:* This is further documented in The mutations distribution workflow

Also, it contains helper scripts to manipulate VCF files, the format of the ICGC’s simple somatic mutations file. This scripts are the following:

  • vcf_map_assembly.py: With the help of the Ensembl REST API, maps the coordinates to the GRCh38 assembly. Assumes the data in the original VCF contains positions in the GRCh37 assembly, as is the case in the data releases until April 2018.

  • vcf_sample.py: Take a random sample of the input VCF file. The output is also a valid VCF file.

  • vcf_split.py: Split the input VCF file into several valid VCFs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ICGC-data-parser-0.1.1.tar.gz (259.2 kB view details)

Uploaded Source

File details

Details for the file ICGC-data-parser-0.1.1.tar.gz.

File metadata

File hashes

Hashes for ICGC-data-parser-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ae6dbfdc6e38094c6883716479b9a3caa1d81bb0211d06458a99ab2511bf8b3b
MD5 e85d57f855911dc7ba260b95283ef7f2
BLAKE2b-256 31961de7dacf24f68076f5fba1966b0d61d7f0245efd582dbbff52eda8943ca4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page