Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

Tools to facilitate the parsing of SSM data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

Project description

Documentation Status

Scripts to automate parsing of data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

Download and installation

The core module is in PyPi, you can install it using:

pip install ICGC_data_parser

Although the whole example notebooks and helper VCF manipulation scripts are in the the GitHub repository. To download the whole thing, enter the repository webpage and click the download button or type the following in a Unix terminal:

git clone

Data download

The main subject of our inquiries is the ICGC’s aggregated of the simple somatic mutation data. Which can be downloded using:


To know more about this file, please read About the ICGC’s simple somatic mutations file


The whole package contains example scripts that do the following:

  • Mutation recurrence count: Analyzes the data and plots the Mutation recurrence distribution. This distribution contains the information regarding: How many mutations appear in more than one patient? or How many mutations are repeated among patients?, This is further documented in The mutation recurrence workflow
  • Mutation density plot: Plots the mutation density per chromosome along the whole chromosomal length. Allowing to identify visually the randomness of the mutation positions.
  • Distribution of the mutations in the genes: Automation of the extraction of the distribution of mutations in the genes. It answers the question of how many genes contain ``x`` number of mutations in a given gene or project? *TODO:* This is further documented in The mutations distribution workflow

Also, it contains helper scripts to manipulate VCF files, the format of the ICGC’s simple somatic mutations file. This scripts are the following:

  • With the help of the Ensembl REST API, maps the coordinates to the GRCh38 assembly. Assumes the data in the original VCF contains positions in the GRCh37 assembly, as is the case in the data releases until April 2018.
  • Take a random sample of the input VCF file. The output is also a valid VCF file.
  • Split the input VCF file into several valid VCFs.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ICGC-data-parser, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size ICGC-data-parser-0.1.0.tar.gz (259.8 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page