Skip to main content

Variant annotation in Python

Project description

Tests Coverage Status PyPI PyPI downloads

Varcode

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.

Installation

You can install varcode using pip:

pip install varcode

You can install required reference genome data through PyEnsembl as follows:

# Downloads and installs the Ensembl releases (75 and 76)
pyensembl install --release 75 76

Example

import varcode

# Load TCGA MAF containing variants from their
variants = varcode.load_maf("tcga-ovarian-cancer-variants.maf")

print(variants)
### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>
###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)
###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)
###  ...

# you can index into a VariantCollection and get back a Variant object
variant = variants[0]

# groupby_gene_name returns a dictionary whose keys are gene names
# and whose values are themselves VariantCollections
gene_groups = variants.groupby_gene_name()

# get variants which affect the TP53 gene
TP53_variants = gene_groups["TP53"]

# predict protein coding effect of every TP53 variant on
# each transcript of the TP53 gene
TP53_effects = TP53_variants.effects()

print(TP53_effects)
### <EffectCollection with 789 elements>
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)
### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)
### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)
### ...

premature_stop_effect = TP53_effects[0]

print(str(premature_stop_effect.mutant_protein_sequence))
### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'

print(premature_stop_effect.aa_mutation_start_offset)
### 341

print(premature_stop_effect.transcript)
### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)

print(premature_stop_effect.gene.name)
### 'TP53'

If you are looking for a quick start guide, you can check out this iPython book that demonstrates simple use cases of Varcode.

Further reading

Feature guides live in docs/ and on the docs site:

See CHANGELOG.md for the release history.

Effect Types

Every concrete MutationEffect subclass that varcode may emit, grouped by biological context. Each row links to the class definition in varcode/effects/effect_classes.py via a browser text-fragment URL — links survive line-number drift as the source file evolves. Severity ordering across types is set by effect_priority(); the abstract bases — MutationEffect, TranscriptMutationEffect, CodingMutation, NonsilentCodingMutation, SpliceMechanismEffect, StructuralVariantEffect — define the shared interface; MutationEffect and NonsilentCodingMutation have dedicated entries in the API reference, and MultiOutcomeEffect is described in its own section below.

Effects that carry multiple possibilities

Several effects don't have a single deterministic protein-level outcome — splice-signal disruption can resolve as normal splicing, exon skipping, intron retention, or cryptic-site use; an exon-edge variant might be a routine coding effect or a splice disruption; a structural variant might affect multiple transcripts or have multiple plausible breakpoint resolutions; two or more cis variants on one transcript can compose into a joint mutant protein; and when phase is unknown between somatic and germline variants sharing a window, the somatic effect depends on which haplotype it landed on. Varcode wraps these as MultiOutcomeEffect instances. Every multi-outcome effect exposes the same surface:

  • .candidatestuple[EffectCandidate, ...]
  • Each EffectCandidate wraps an inner effect (.effect), a producer tag (.source, e.g. "varcode", "rna_evidence"), and free-form .evidence.
  • .effectstuple[MutationEffect, ...] — convenience that unwraps .candidates to inner effects when provenance isn't needed.
  • .most_likely_candidate / .most_likely_effect — producer-ordered top pick.
  • .highest_priority_candidate / .highest_priority_effect — most severe by effect_priority().

The MultiOutcomeEffect containers appear in the sub-tables below where they're emitted: SpliceOutcomeSet, ExonicSpliceSite, the StructuralVariantEffect sub-hierarchy, HaplotypeEffect, and PhaseCandidateSet.

Coding region — in-frame changes

Effect type Description
Substitution Coding mutation which causes simple substitution of one amino acid for another.
Insertion Coding mutation which causes insertion of amino acid(s).
Deletion Coding mutation which causes deletion of amino acid(s).
ComplexSubstitution Insertion and deletion of multiple amino acids.
Silent Mutation in coding sequence which does not change the amino acid sequence of the translated protein.
AlternateStartCodon Replace annotated start codon with alternative start codon (e.g. "ATG>CAG"); a Silent subclass since the initiator tRNA still loads Met.

Coding region — frame-disrupting / truncating

Effect type Description
FrameShift Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.
FrameShiftTruncation A frameshift which leads immediately to a stop codon (no novel amino acids created).
PrematureStop Insertion of stop codon, truncates protein.
StartLoss Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).
StopLoss Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR.

Splice-site disruption — where the signal was hit

DNA-level locations: these effects say a variant landed on or near a splice signal, but don't themselves carry a protein consequence — they say nothing about how the spliceosome responds (see the next table for that). All four share the SpliceSite base, so from varcode import SpliceSite; isinstance(effect, SpliceSite) matches any of them. (The four leaf classes are exported from the package root too.)

Effect type Description
SpliceDonor Mutation in the first two nucleotides of an intron, likely to affect splicing.
SpliceAcceptor Mutation in the last two nucleotides of an intron, likely to affect splicing.
IntronicSpliceSite Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.
ExonicSpliceSite Mutation at the beginning or end of an exon, may affect splicing; itself a MultiOutcomeEffect wrapping the alternate exonic coding effect alongside the splice candidates.

Splice mechanism — what the spliceosome does in response

These are the splice effects that carry a protein consequence. The protein-level outcome of a splice-signal hit is not deterministic from DNA alone, so (when you opt in with splice_outcomes=True) varcode emits these as candidates inside a SpliceOutcomeSet (a MultiOutcomeEffect). Each mechanism carries the originating disruption on its .splice_signal attribute (a SpliceSite instance), so you can always recover where the hit was off any mechanism. The set also records the disruption's class on .disrupted_signal_class (the SpliceSite subclass, e.g. SpliceDonor — a type, not an instance) for priority lookup.

Effect type Description
NormalSplicing Splice signal hit but splicing proceeds normally; protein consequence (if any) is whatever the underlying nucleotide change would produce.
ExonSkipping Affected exon excluded from the mature transcript; in-frame skip deletes amino acids, out-of-frame skip propagates a frameshift.
IntronRetention Intron stays in the mature transcript; translation usually hits a premature stop inside the retained intron.
CrypticDonor Disrupted canonical donor replaced by a nearby cryptic GT donor; exon extended or truncated.
CrypticAcceptor Disrupted canonical acceptor replaced by a nearby cryptic AG acceptor; exon extended or truncated.

Non-coding regions and unclassifiable contexts

Effect type Description
FivePrimeUTR Variant affects 5' untranslated region before start codon.
ThreePrimeUTR Variant affects 3' untranslated region after stop codon of mRNA.
Intronic Variant occurs between exons and is unlikely to affect splicing.
NoncodingTranscript Transcript doesn't code for a protein.
IncompleteTranscript Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
Intergenic Occurs outside of any annotated gene.
Intragenic Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA.
Failure Placeholder effect emitted when annotation failed but a non-empty effect list is required (raise_on_error=False).

Exon-level and structural-variant effects

ExonLoss is a plain Exonic effect. The structural-variant effects below (LargeDeletion through TranslocationToIntergenic) are MultiOutcomeEffects — their .candidates may include cryptic-exon outcomes, RNA-evidence-ranked alternatives, and so on. CrypticExonCandidate typically appears as a candidate inside those SV effects rather than standalone.

Effect type Description
ExonLoss Deletion of an entire exon, significantly disrupts protein.
LargeDeletion Structural deletion (<DEL> / <CN0>) removing one or more exons or an entire gene.
LargeDuplication Tandem duplication (<DUP>) overlapping exons; may yield copy-number increase or a fused reading frame.
Inversion Inversion (<INV>) flipping a stretch of a transcript; consequence depends on whether breakpoints fall in exons or introns.
GeneFusion Breakend (<BND>) whose mate lies in another protein-coding gene — the canonical fusion shape.
TranslocationToIntergenic Breakend whose mate lies in intergenic space; consequence depends on cryptic splice / ORF signals downstream.
CrypticExonCandidate An SV brings novel sequence into range of a transcript and motif scoring flags a plausible new splice acceptor / donor pair; attached as additional candidates on SV effects.

Multi-variant / phase-dependent effects

Both are MultiOutcomeEffects, emitted alongside per-variant effects (additive, not a replacement) when a phase resolver groups cis variants together or when phase between somatic and germline variants is unknown.

Effect type Description
HaplotypeEffect Joint effect of two or more cis variants on the same transcript; the combined mutant cDNA is built and translated as one unit.
PhaseCandidateSet Possibility set across phase hypotheses when a somatic variant and one or more germline variants share a window on a transcript and phase between them is unknown.

Coordinate System

Varcode currently uses a "base counted, one start" genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to "space counted, zero start" (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varcode-5.0.5.tar.gz (348.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

varcode-5.0.5-py3-none-any.whl (231.1 kB view details)

Uploaded Python 3

File details

Details for the file varcode-5.0.5.tar.gz.

File metadata

  • Download URL: varcode-5.0.5.tar.gz
  • Upload date:
  • Size: 348.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for varcode-5.0.5.tar.gz
Algorithm Hash digest
SHA256 c2a0c6db93ae64642f92c9c4d907ca8a6191fa3ad7b01a8d52868dcfd9d31717
MD5 6ce86bce39951f5fd393565b4ac81ba9
BLAKE2b-256 c7cbf4d71c796723b61d71fc3d19b0f1158d06c45b76e289a467032e43ecf8b7

See more details on using hashes here.

File details

Details for the file varcode-5.0.5-py3-none-any.whl.

File metadata

  • Download URL: varcode-5.0.5-py3-none-any.whl
  • Upload date:
  • Size: 231.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for varcode-5.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bbc754cb3e9c7a7913f9b0ba0fbcead8b5e162942f9214e9d0f765a4de3a454d
MD5 0fcb2af9cd2e0f086215c2fc875b3e89
BLAKE2b-256 f2cc21eb6c6377bab94fd76c3a78ef45db6f7c9064b794f0c73cef96562354ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page