Skip to main content

Variant annotation in Python

Project description

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.

Installation

You can install varcode using pip:

pip install varcode

Optionally, you can pre-populate metadata caches through PyEnsembl as follows:

# Downloads and installs the Ensembl releases (75 and 76)
pyensembl install --release 75 76

This will eliminate a potential delay of several minutes required to install the relevant data when using the varcode for the first time.

Example

import varcode

# Load TCGA MAF containing variants from their
variants = varcode.load_maf("tcga-ovarian-cancer-variants.maf")

print(variants)
### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>
###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)
###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)
###  ...

# you can index into a VariantCollection and get back a Variant object
variant = variants[0]

# groupby_gene_name returns a dictionary whose keys are gene names
# and whose values are themselves VariantCollections
gene_groups = variants.groupby_gene_name()

# get variants which affect the TP53 gene
TP53_variants = gene_groups["TP53"]

# predict protein coding effect of every TP53 variant on
# each transcript of the TP53 gene
TP53_effects = TP53_variants.effects()

print(TP53_effects)
### <EffectCollection with 789 elements>
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)
### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)
### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)
### ...

premature_stop_effect = TP53_effects[0]

print(str(premature_stop_effect.mutant_protein_sequence))
### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'

print(premature_stop_effect.aa_mutation_start_offset)
### 341

print(premature_stop_effect.transcript)
### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)

print(premature_stop_effect.gene.name)
### 'TP53'

Effect Types

Effect type

Description

AlternateStartCodon

Replace annotated start codon with alternative start codon (e.g. ATG>CAG).

ComplexSubstitution

Insertion and deletion of multiple amino acids.

Deletion

Coding mutation which causes deletion of amino acid(s).

ExonLoss

Deletion of entire exon, significantly disrupts protein.

ExonicSpliceSite

Mutation at the beginning or end of an exon, may affect splicing.

FivePrimeUTR

Variant affects 5’ untranslated region before start codon.

FrameShiftTruncation

A frameshift which leads immediately to a stop codon (no novel amino acids created).

FrameShift

Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.

IncompleteTranscript

Can’t determine effect since transcript annotation is incomplete (often missing either the start or stop codon).

Insertion

Coding mutation which causes insertion of amino acid(s).

Intergenic

Occurs outside of any annotated gene.

Intragenic

Within the annotated boundaries of a gene but not in a region that’s transcribed into pre-mRNA.

IntronicSpliceSite

Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.

Intronic

Variant occurs between exons and is unlikely to affect splicing.

NoncodingTranscript

Transcript doesn’t code for a protein.

PrematureStop

Insertion of stop codon, truncates protein.

Silent

Mutation in coding sequence which does not change the amino acid sequence of the translated protein.

SpliceAcceptor

Mutation in the last two nucleotides of an intron, likely to affect splicing.

SpliceDonor

Mutation in the first two nucleotides of an intron, likely to affect splicing.

StartLoss

Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).

StopLoss

Loss of stop codon, causes extension of protein by translation of nucleotides from 3’ UTR.

Substitution

Coding mutation which causes simple substitution of one amino acid for another.

ThreePrimeUTR

Variant affects 3’ untranslated region after stop codon of mRNA.

Coordinate System

Varcode currently uses a “base counted, one start” genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to “space counted, zero start” (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varcode-0.3.15.tar.gz (67.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page