Skip to main content

Variant annotation in Python

Project description

Tests Coverage Status PyPI PyPI downloads

Varcode

Varcode is a library for working with genomic variant data in Python and predicting the impact of those variants on protein sequences.

Installation

You can install varcode using pip:

pip install varcode

You can install required reference genome data through PyEnsembl as follows:

# Downloads and installs the Ensembl releases (75 and 76)
pyensembl install --release 75 76

Example

import varcode

# Load TCGA MAF containing variants from their
variants = varcode.load_maf("tcga-ovarian-cancer-variants.maf")

print(variants)
### <VariantCollection from 'tcga-ovarian-cancer-variants.maf' with 6428 elements>
###  -- Variant(contig=1, start=69538, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=881892, ref=T, alt=G, genome=GRCh37)
###  -- Variant(contig=1, start=3389714, ref=G, alt=A, genome=GRCh37)
###  -- Variant(contig=1, start=3624325, ref=G, alt=T, genome=GRCh37)
###  ...

# you can index into a VariantCollection and get back a Variant object
variant = variants[0]

# groupby_gene_name returns a dictionary whose keys are gene names
# and whose values are themselves VariantCollections
gene_groups = variants.groupby_gene_name()

# get variants which affect the TP53 gene
TP53_variants = gene_groups["TP53"]

# predict protein coding effect of every TP53 variant on
# each transcript of the TP53 gene
TP53_effects = TP53_variants.effects()

print(TP53_effects)
### <EffectCollection with 789 elements>
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R342*)
### -- ThreePrimeUTR(variant=chr17 g.7574003G>A, transcript_name=TP53-005, transcript_id=ENST00000420246)
### -- PrematureStop(variant=chr17 g.7574003G>A, transcript_name=TP53-002, transcript_id=ENST00000445888, effect_description=p.R342*)
### -- FrameShift(variant=chr17 g.7574030_7574030delG, transcript_name=TP53-001, transcript_id=ENST00000269305, effect_description=p.R333fs)
### ...

premature_stop_effect = TP53_effects[0]

print(str(premature_stop_effect.mutant_protein_sequence))
### 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMF'

print(premature_stop_effect.aa_mutation_start_offset)
### 341

print(premature_stop_effect.transcript)
### Transcript(id=ENST00000269305, name=TP53-001, gene_name=TP53, biotype=protein_coding, location=17:7571720-7590856)

print(premature_stop_effect.gene.name)
### 'TP53'

If you are looking for a quick start guide, you can check out this iPython book that demonstrates simple use cases of Varcode.

Further reading

Feature guides live in docs/:

  • Genotypes and sample-aware queries — per-sample zygosity on multi-sample VCFs (Genotype, Zygosity, VariantCollection.for_sample, .heterozygous_in, .homozygous_alt_in). New in 2.3.
  • CSV round-trip and metadata headersto_csv / from_csv on both collection types, with #-prefixed provenance headers. New in 2.1, refined in 2.2.
  • Error handlingReferenceMismatchError, SampleNotFoundError, and the raise_on_error=False escape hatch.

See CHANGELOG.md for the release history.

Effect Types

Effect type Description
AlternateStartCodon Replace annotated start codon with alternative start codon (e.g. "ATG>CAG").
ComplexSubstitution Insertion and deletion of multiple amino acids.
Deletion Coding mutation which causes deletion of amino acid(s).
ExonLoss Deletion of entire exon, significantly disrupts protein.
ExonicSpliceSite Mutation at the beginning or end of an exon, may affect splicing.
FivePrimeUTR Variant affects 5' untranslated region before start codon.
FrameShiftTruncation A frameshift which leads immediately to a stop codon (no novel amino acids created).
FrameShift Out-of-frame insertion or deletion of nucleotides, causes novel protein sequence and often premature stop codon.
IncompleteTranscript Can't determine effect since transcript annotation is incomplete (often missing either the start or stop codon).
Insertion Coding mutation which causes insertion of amino acid(s).
Intergenic Occurs outside of any annotated gene.
Intragenic Within the annotated boundaries of a gene but not in a region that's transcribed into pre-mRNA.
IntronicSpliceSite Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations.
Intronic Variant occurs between exons and is unlikely to affect splicing.
NoncodingTranscript Transcript doesn't code for a protein.
PrematureStop Insertion of stop codon, truncates protein.
Silent Mutation in coding sequence which does not change the amino acid sequence of the translated protein.
SpliceAcceptor Mutation in the last two nucleotides of an intron, likely to affect splicing.
SpliceDonor Mutation in the first two nucleotides of an intron, likely to affect splicing.
StartLoss Mutation causes loss of start codon, likely result is that an alternate start codon will be used down-stream (possibly in a different frame).
StopLoss Loss of stop codon, causes extension of protein by translation of nucleotides from 3' UTR.
Substitution Coding mutation which causes simple substitution of one amino acid for another.
ThreePrimeUTR Variant affects 3' untranslated region after stop codon of mRNA.

Coordinate System

Varcode currently uses a "base counted, one start" genomic coordinate system, to match the Ensembl annotation database. We are planning to switch over to "space counted, zero start" (interbase) coordinates, since that system allows for more uniform logic (no special cases for insertions). To learn more about genomic coordinate systems, read this blog post.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varcode-4.10.0.tar.gz (225.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

varcode-4.10.0-py3-none-any.whl (159.8 kB view details)

Uploaded Python 3

File details

Details for the file varcode-4.10.0.tar.gz.

File metadata

  • Download URL: varcode-4.10.0.tar.gz
  • Upload date:
  • Size: 225.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for varcode-4.10.0.tar.gz
Algorithm Hash digest
SHA256 f8c3154b6d35b497b6970c68e6f2492f66020cfce8929d292e0b3c7628154e69
MD5 e61cc73ac0313c0a5b6205e25f1403d8
BLAKE2b-256 8d69b2f40a85cd2d763eb4f19ee244629f6ca7a4723dc373e4f62598062947f3

See more details on using hashes here.

File details

Details for the file varcode-4.10.0-py3-none-any.whl.

File metadata

  • Download URL: varcode-4.10.0-py3-none-any.whl
  • Upload date:
  • Size: 159.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for varcode-4.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89cbb4c7bc38e73154bc1f74d46b46f30e692f7110086b99f4406317b0abe3c5
MD5 4874c64ec8f0561ec95769494d14f6f3
BLAKE2b-256 bde63705d17e5fa92014a8814afcaf5f48998abff766297213184f1fe8233043

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page