ICGC-data-parser

Tools to facilitate the parsing of SSM data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

These details have not been verified by PyPI

Project links

Project description

What is the ICGC-data-parser?

A library to ease the parsing of data from the International Cancer Genome Consortium data releases, in particular, the simple somatic mutation aggregates.

Tutorial

Installation

Install via PyPI:

$ pip install ICGC_data_parser

Data download

The base data for the scripts is the ICGC’s aggregated of the simple somatic mutation data. Which can be downloded using

wget https://dcc.icgc.org/api/v1/download?fn=/current/Summary/simple_somatic_mutation.aggregated.vcf.gz

To know more about this file, please read About the ICGC’s simple somatic mutations file

WARNING: The current release of the data contains a malformed header that causes the library to crash with an IndexError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/vcf/parser.py in _parse_info(self, info_str)
    389                 try:
...
...
...
362     def _parse_info(self, info_str):

ValueError: could not convert string to float: 'PCAWG'

This is caused by a bad type specification in the header of the VCF file. To solve it, use the lollowing line after creating the SSM_Reader object (asuming the reader is in the reader variable)

# Fix weird bug due to malformed description headers
reader.infos['studies'] = reader.infos['studies']._replace(type='String')

In the future this will be solved in a more elegant way, but for now this is what we’ve got.

Usage

The main class in the project is the SSM_Reader. It allows to read easily the ICGC mutations file:

>>> from ICGC_data_parser import SSM_Reader

# Reads also compressed files!
>>> reader = SSM_Reader(open('data/simple_somatic_mutations.aggregated.vcf.gz'))

# or...
>>> reader = SSM_Reader(filename='data/simple_somatic_mutations.aggregated.vcf.gz')
#                       ^^^^^^^^
# The filename keyord argument is important, else we get an IndexError

The SSM_Reader.parse method allows to iterate through the records of the file and access the parts of the record. You can also specify regular expressions to filter only the lines you want:

# Print only the mutations that are in the
# European Union Breast Cancer project (BRCA-EU).

>>> for record in reader.parse(filters=['BRCA-EU']):
...    print(record.ID, record.CHROM, record.POS)

MU66865518 1 100141201
MU65487875 1 100160548
MU66281118 1 100638179
MU66254120 1 101352655
...

The INFO field is special in the sense that it contains several subfields, AND those subfields may be list-like entries with more subfields themselves (in particular the CONSEQUENCE and OCCURRENCE subfields):

# The subfields of the INFO field:
>>> next(reader).INFO

{'CONSEQUENCE': [
    '||||||intergenic_region||',
    'CD1A|ENSG00000158477|+|CD1A-001|ENST00000289429||upstream_gene_variant||'
    ],
 'OCCURRENCE': [
     'ESAD-UK|1|301|0.00332',
     'EOPC-DE|1|202|0.00495',
     'BRCA-EU|1|569|0.00176'
    ],
 'affected_donors': 3,
 'mutation': 'T>A',
 'project_count': 3,
 'studies': None,
 'tested_donors': 12068}

# The description of the CONSEQUENCE subfield
>>> print(reader.infos['CONSEQUENCE'].desc)

Mutation consequence predictions annotated by SnpEff
(subfields: gene_symbol|gene_affected|gene_strand|transcript_name|transcript_affected|protein_affected|consequence_type|cds_mutation|aa_mutation)

# The description of the OCCURRENCE subfield
>>> print(reader.infos['OCCURRENCE'].desc)

Mutation occurrence counts broken down by project
(subfields: project_code|affected_donors|tested_donors|frequency)

Sometimes we want to also parse the information in those subfields. For this purpose, the SSM_Reader.subfield_parser factory method is useful. This method creates a parser of the specified subfield that allows easy access to the data:

# Create the subfield parser for the CONSEQUENCE subfield
>>> consequences = reader.subfield_parser('CONSEQUENCE')


>>> for record in reader.parse():
...    # Which genes are affected?
...    genes_affected = {c.gene_symbol
...                          for c in consequences(record)
...                          if c.gene_affected}
...
...    print(f'Mutation: {record.ID}')
...    print('\t', ", ".join(genes_affected))

Mutation: MU93246178
     TPM3
Mutation: MU66962994
     RP11-350G8.9, SHE
Mutation: MU93246498
     DCST1, ADAM15, RP11-307C12.11
Mutation: MU66377106
     EFNA3, ADAM15, EFNA4
...

The library also contains some helper scripts to manipulate VCF files (like the ICGC mutations file):

vcf_map_assembly.py: Creates a new VCF with the positions mapped to another genome assembly. This is useful because currently the positions reported by ICGC are in the human genome assembly GRCh37, while the most recent (and the one the rest of the world uses) is the GRCh38 assembly.
vcf_sample.py: Creates a new VCF with a fraction of the mutations in the original. The mutations are randomly sampled but maintain the order they had in the original file. This is useful when one wants to make small test analysis on the data, but still wants the results to be representative of all the mutations.
vcf_split.py: Splits the input VCF into several (also valid VCFs), this is useful in case one wants to split the analyses into processes that receive one file each.

The specific documentation of the scripts can be obtained by executing:

$ python3 <script name>.py --help

Also, the library is shipped with some Jupyter Notebooks that elaborate on the examples. Besides, in the notebooks are demonstrated ways to manage common parsing errors that have to do with malformed input files.

Contributing

Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.
Fork the repository on GitHub to start making your changes to a feature branch, derived from the master branch.
Write a test which shows that the bug was fixed or that the feature works as expected.
Send a pull request and bug the maintainer until it gets merged and published.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Sep 3, 2018

0.2.1

Sep 3, 2018

0.2.0

Sep 3, 2018

0.1.1

Apr 20, 2018

0.1.0

Apr 20, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ICGC_data_parser-0.2.2-py3-none-any.whl (10.9 kB view details)

Uploaded Sep 3, 2018 Python 3

File details

Details for the file ICGC_data_parser-0.2.2-py3-none-any.whl.

File metadata

Download URL: ICGC_data_parser-0.2.2-py3-none-any.whl
Upload date: Sep 3, 2018
Size: 10.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.9.1 pkginfo/1.4.1 requests/2.19.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.1

File hashes

Hashes for ICGC_data_parser-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c32ccc793eff1ef249cd3f2ffc0863aa5b7f5653e21e78e7677cf84d26948ef6`
MD5	`7e3b0b000cf886e851f08d1f66218b9c`
BLAKE2b-256	`94978169f98b0f61e40809dc519893831d37772d313a598f137bee8328dce869`

See more details on using hashes here.

ICGC-data-parser 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is the ICGC-data-parser?

Tutorial

Installation

Data download

Usage

Meta

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes