Skip to main content

HGVS variant name parsing and generation with SV (structural variant) support

Project description

HGVS variant name parsing and generation

The Human Genome Variation Society (HGVS) promotes the discovery and sharing of genetic variation in the human population. As part of facilitating variant sharing, the society has produced a series of recommendations for how to name and refer to variants within research publications and clinical settings. A compilation of these recommendations is available on their website.

This library provides a simple Python API for parsing, formatting, and normalizing HGVS names. Surprisingly, there are many non-trivial steps necessary in handling HGVS names and therefore there is a need for well tested libraries that encapsulate these steps.

HGVS name example

In most next-generation sequencing applications, variants are first discovered and described in terms of their genomic coordinates such as chromosome 7, position 117,199,563 with reference allele G and alternative allele T. According to the HGVS standard, we can describe this variant as NC_000007.13:g.117199563G>T. The first part of the name is a RefSeq ID NC_000007.13 for chromosome 7 version 13. The g. denotes that this is a variant described in genomic (i.e. chromosomal) coordinates. Lastly, the chromosomal position, reference allele, and alternative allele are indicated. For simple single nucleotide changes the > character is used.

More commonly, a variant will be described using a cDNA or protein style HGVS name. In the example above, the variant in cDNA style is named NM_000492.3:c.1438G>T. Here again, the first part of the name refers to a RefSeq sequence, this time mRNA transcript NM_000492 version 3. Optionally, the gene name can also be given as NM_000492.3(CFTR). The c. indicates that this is a cDNA name, and the coordinate indicates that this mutation occurs at position 1438 along the coding portion of the spliced transcript (i.e. position 1 is the first base of ATG translation start codon). Briefly, the protein style of the variant name is NP_000483.3:p.Gly480Cys which indicates the change in amino-acid coordinates (480) along an amino-acid sequence (NP_000483.3) and gives the reference and alternative amino-acid alleles (Gly and Cys, respectively).

The standard also specifies custom name formats for many mutation categories such as insertions (NM_000492.3:c.1438_1439insA), deletions (NM_000492.3:c.1438_1440delGGT), duplications (NM_000492.3:c.1438_1440dupGGT), and several other more complex genomic rearrangements.

While many of these names appear to be simple to parse or generate, there are many corner cases, especially with cDNA HGVS names. For example, variants before the start codon should have negative cDNA coordinates (NM_000492.3:c.-4G>C), and variants after the stop codon also have their own format (NM_000492.3:c.*33C>T). Variants within introns are indicated by the closest exonic base with an additional genomic offset such as NM_000492.3:4243-20A>G (the variant is 20 bases in the 5' direction of the cDNA coordinate 4243). Lastly, all coordinates and alleles are specified on the strand of the transcript. This library properly handles all logic necessary to convert genomic coordinates to and from HGVS cDNA coordinates.

Another important consideration of any library that handles HGVS names is variant normalization. The HGVS standard aims to provide "uniform and unequivocal" description of variants. Namely, two people discovering a variant should be able to arrive at the same name for it. Such a property is very useful for checking whether a variant has been seen before and connecting all known relevant information. For SNPs, this property is fairly easy to achieve. However, for insertions and deletions (indels) near repetitive regions, many indels are equivalent (e.g. it doesn't matter which AT in a run of ATATATAT was deleted). The VCF file format has chosen to uniquely specify such indels by using the most left-aligned genomic coordinate. Therefore, compliant variant callers that output VCF will have applied this normalization. The HGVS standard also specifies a normalization for such indels. However, it states that indels should use the most 3' position in a transcript. For genes on the positive strand, this is the opposite direction specified by VCF. This library properly implements both kinds of variant normalization and allows easy conversion between HGVS and VCF style variants. It also handles many other cases of normalization (e.g. the HGVS standard recommends indicating an insertion with the dup notation instead of ins if it can be represented as a tandem duplication).

Example usage

Below is a minimal example of parsing and formatting HGVS names. In addition to the name itself, two other pieces of information are needed: the genome sequence (needed for normalization), and the transcript model or a callback for fetching the transcript model (needed for transcript coordinate calculations). This library makes as few assumptions as possible about how this external data is stored. In this example, the genome sequence is read using the pyfaidx library and transcripts are read from a RefSeqGenes flat-file using methods provided by hgvsv.

import pyhgvsv as hgvsv
import pyhgvsv.utils as hgvsv_utils
from pyfaidx import Fasta

# Read genome sequence using pyfaidx.
# !!! ALL BELOW EXAMPLES FOR 'NM_000352.3:c.215A>G' USE hg19. RESULTS WILL VARY. !!!
genome = Fasta('/tmp/hg38.fa')

# Read RefSeq transcripts into a python dict.
with open('hgvs/data/genes.refGene') as infile:
    transcripts = hgvs_utils.read_transcripts(infile)

# Provide a callback for fetching a transcript by its name.
def get_transcript(name):
    return transcripts.get(name)

# Parse the HGVS name into genomic coordinates and alleles.
chrom, offset, ref, alt = hgvs.parse_hgvs_name(
    'NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
# Returns variant in VCF style: ('chr11', 17496508, 'T', 'C')
# Notice that since the transcript is on the negative strand, the alleles
# are reverse complemented during conversion.

# Format an HGVS name.
chrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')
transcript = get_transcript('NM_000352.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
# Returns 'NM_000352.3(ABCC8):c.215A>G'

# Format an HGVS name for a structural variant (deletion).
chrom, offset, ref, alt, sv_length = ('chrY', 24861625, '', '', -4780)
transcript = get_transcript('NM_001388484.1')
hgvs_name = hgvsv.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript, sv_length)
# Returns 'NM_001388484.1(DAZ4):c.1210-436_1354-437del4780'

# Format an HGVS name for a structural variant (insertion).
chrom, offset, ref, alt, sv_length = ('chr17', 8141778, '', 'TTCTCCCCCCTTGAACTTGAGCTCAATTC', 29)
transcript = get_transcript('NM_002616.3')
hgvs_name = hgvsv.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript, sv_length)
# Returns 'NM_002616.3(PER1):c.3600+26_3600+27ins29'

The hgvsv library can also perform just the parsing step and provide a parse tree of the HGVS name.

import pyhgvs as hgvs

hgvs_name = hgvs.HGVSName('NM_000352.3:c.215-10A>G')

# fields of the HGVS name are available as attributes:
#
# hgvs_name.transcript = 'NM_000352.3'
# hgvs_name.kind = 'c'
# hgvs_name.mutation_type = '>'
# hgvs_name.cdna_start = hgvs.CDNACoord(215, -10)
# hgvs_name.cdna_end = hgvs.CDNACoord(215, -10)
# hgvs_name.ref_allele = 'A'
# hgvs_name.alt_allele = 'G'

Install

hgvsv can be installed via pip:

pip install pyhgvsv

Or the library can be installed using the setup.py file as follows:

python setup.py install

Tests

Test cases can be run by running

python setup.py nosetests

Requirements

This library requires at least Python 2.6, but otherwise has no external dependencies.

The library's use often requires refSeq data, which can be found at genome.ucsc.edu

The library does assume that genome sequence is available through a pyfaidx compatible Fasta object. For an example of writing a wrapper for a different genome sequence back-end, see hgvs.tests.genome.MockGenome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhgvsv-0.1.post1.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyhgvsv-0.1.post1-py3-none-any.whl (22.5 kB view details)

Uploaded Python 3

File details

Details for the file pyhgvsv-0.1.post1.tar.gz.

File metadata

  • Download URL: pyhgvsv-0.1.post1.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for pyhgvsv-0.1.post1.tar.gz
Algorithm Hash digest
SHA256 3ddb5162067a66bbcaf5f6e54e44a41782f79595efa8de09b2a7fa373c967f89
MD5 811556e0c22f3a628963adb8d2db8ec9
BLAKE2b-256 bff52f2c702674c631b0e75543d1b668b2a0ce7b7b2cb22efc98402a6d8b1392

See more details on using hashes here.

File details

Details for the file pyhgvsv-0.1.post1-py3-none-any.whl.

File metadata

  • Download URL: pyhgvsv-0.1.post1-py3-none-any.whl
  • Upload date:
  • Size: 22.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for pyhgvsv-0.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 f904770121b2c12a5e0299852bf1e145dc0452485f9bd00b13e0be58294d825b
MD5 68a8e9ed4683592e8c4ef867497f2269
BLAKE2b-256 a254777eada038a6f067da5f868cf35e0654d5110185f757cad37bc4404a56c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page