HGVS variant name parsing and generation with SV (structural variant) support

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

HGVS variant name parsing and generation

The Human Genome Variation Society (HGVS) promotes the discovery and sharing of genetic variation in the human population. As part of facilitating variant sharing, the society has produced a series of recommendations for how to name and refer to variants within research publications and clinical settings. A compilation of these recommendations is available on their website.

This library provides a simple Python API for parsing, formatting, and normalizing HGVS names. Surprisingly, there are many non-trivial steps necessary in handling HGVS names and therefore there is a need for well tested libraries that encapsulate these steps.

HGVS name example

In most next-generation sequencing applications, variants are first discovered and described in terms of their genomic coordinates such as chromosome 7, position 117,199,563 with reference allele G and alternative allele T. According to the HGVS standard, we can describe this variant as NC_000007.13:g.117199563G>T. The first part of the name is a RefSeq ID NC_000007.13 for chromosome 7 version 13. The g. denotes that this is a variant described in genomic (i.e. chromosomal) coordinates. Lastly, the chromosomal position, reference allele, and alternative allele are indicated. For simple single nucleotide changes the > character is used.

More commonly, a variant will be described using a cDNA or protein style HGVS name. In the example above, the variant in cDNA style is named NM_000492.3:c.1438G>T. Here again, the first part of the name refers to a RefSeq sequence, this time mRNA transcript NM_000492 version 3. Optionally, the gene name can also be given as NM_000492.3(CFTR). The c. indicates that this is a cDNA name, and the coordinate indicates that this mutation occurs at position 1438 along the coding portion of the spliced transcript (i.e. position 1 is the first base of ATG translation start codon). Briefly, the protein style of the variant name is NP_000483.3:p.Gly480Cys which indicates the change in amino-acid coordinates (480) along an amino-acid sequence (NP_000483.3) and gives the reference and alternative amino-acid alleles (Gly and Cys, respectively).

The standard also specifies custom name formats for many mutation categories such as insertions (NM_000492.3:c.1438_1439insA), deletions (NM_000492.3:c.1438_1440delGGT), duplications (NM_000492.3:c.1438_1440dupGGT), and several other more complex genomic rearrangements.

While many of these names appear to be simple to parse or generate, there are many corner cases, especially with cDNA HGVS names. For example, variants before the start codon should have negative cDNA coordinates (NM_000492.3:c.-4G>C), and variants after the stop codon also have their own format (NM_000492.3:c.*33C>T). Variants within introns are indicated by the closest exonic base with an additional genomic offset such as NM_000492.3:4243-20A>G (the variant is 20 bases in the 5' direction of the cDNA coordinate 4243). Lastly, all coordinates and alleles are specified on the strand of the transcript. This library properly handles all logic necessary to convert genomic coordinates to and from HGVS cDNA coordinates.

Another important consideration of any library that handles HGVS names is variant normalization. The HGVS standard aims to provide "uniform and unequivocal" description of variants. Namely, two people discovering a variant should be able to arrive at the same name for it. Such a property is very useful for checking whether a variant has been seen before and connecting all known relevant information. For SNPs, this property is fairly easy to achieve. However, for insertions and deletions (indels) near repetitive regions, many indels are equivalent (e.g. it doesn't matter which AT in a run of ATATATAT was deleted). The VCF file format has chosen to uniquely specify such indels by using the most left-aligned genomic coordinate. Therefore, compliant variant callers that output VCF will have applied this normalization. The HGVS standard also specifies a normalization for such indels. However, it states that indels should use the most 3' position in a transcript. For genes on the positive strand, this is the opposite direction specified by VCF. This library properly implements both kinds of variant normalization and allows easy conversion between HGVS and VCF style variants. It also handles many other cases of normalization (e.g. the HGVS standard recommends indicating an insertion with the dup notation instead of ins if it can be represented as a tandem duplication).

Example usage

Below is a minimal example of parsing and formatting HGVS names. In addition to the name itself, two other pieces of information are needed: the genome sequence (needed for normalization), and the transcript model or a callback for fetching the transcript model (needed for transcript coordinate calculations). This library makes as few assumptions as possible about how this external data is stored. In this example, the genome sequence is read using the pyfaidx library and transcripts are read from a RefSeqGenes flat-file using methods provided by hgvsv.

import pyhgvsv as hgvsv
import pyhgvsv.utils as hgvsv_utils
from pyfaidx import Fasta

# Read genome sequence using pyfaidx.
# !!! ALL BELOW EXAMPLES FOR 'NM_000352.3:c.215A>G' USE hg19. RESULTS WILL VARY. !!!
genome = Fasta('/tmp/hg38.fa')

# Read RefSeq transcripts into a python dict.
with open('hgvs/data/genes.refGene') as infile:
    transcripts = hgvs_utils.read_transcripts(infile)

# Provide a callback for fetching a transcript by its name.
def get_transcript(name):
    return transcripts.get(name)

# Parse the HGVS name into genomic coordinates and alleles.
chrom, offset, ref, alt = hgvs.parse_hgvs_name(
    'NM_000352.3:c.215A>G', genome, get_transcript=get_transcript)
# Returns variant in VCF style: ('chr11', 17496508, 'T', 'C')
# Notice that since the transcript is on the negative strand, the alleles
# are reverse complemented during conversion.

# Format an HGVS name.
chrom, offset, ref, alt = ('chr11', 17496508, 'T', 'C')
transcript = get_transcript('NM_000352.3')
hgvs_name = hgvs.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript)
# Returns 'NM_000352.3(ABCC8):c.215A>G'

# Format an HGVS name for a structural variant (deletion).
chrom, offset, ref, alt, sv_length = ('chrY', 24861625, '', '', -4780)
transcript = get_transcript('NM_001388484.1')
hgvs_name = hgvsv.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript, sv_length)
# Returns 'NM_001388484.1(DAZ4):c.1210-436_1354-437del4780'

# Format an HGVS name for a structural variant (insertion).
chrom, offset, ref, alt, sv_length = ('chr17', 8141778, '', 'TTCTCCCCCCTTGAACTTGAGCTCAATTC', 29)
transcript = get_transcript('NM_002616.3')
hgvs_name = hgvsv.format_hgvs_name(
    chrom, offset, ref, alt, genome, transcript, sv_length)
# Returns 'NM_002616.3(PER1):c.3600+26_3600+27ins29'

The hgvsv library can also perform just the parsing step and provide a parse tree of the HGVS name.

import pyhgvs as hgvs

hgvs_name = hgvs.HGVSName('NM_000352.3:c.215-10A>G')

# fields of the HGVS name are available as attributes:
#
# hgvs_name.transcript = 'NM_000352.3'
# hgvs_name.kind = 'c'
# hgvs_name.mutation_type = '>'
# hgvs_name.cdna_start = hgvs.CDNACoord(215, -10)
# hgvs_name.cdna_end = hgvs.CDNACoord(215, -10)
# hgvs_name.ref_allele = 'A'
# hgvs_name.alt_allele = 'G'

Install

hgvsv can be installed via pip:

pip install pyhgvsv

Or the library can be installed using the setup.py file as follows:

python setup.py install

Tests

Test cases can be run by running

python setup.py nosetests

Requirements

This library requires at least Python 2.6, but otherwise has no external dependencies.

The library's use often requires refSeq data, which can be found at genome.ucsc.edu

The library does assume that genome sequence is available through a pyfaidx compatible Fasta object. For an example of writing a wrapper for a different genome sequence back-end, see hgvs.tests.genome.MockGenome.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2

Apr 30, 2024

0.1.1

Apr 24, 2024

This version

0.1.post1

Apr 24, 2024

0.1

Apr 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhgvsv-0.1.post1.tar.gz (24.7 kB view details)

Uploaded Apr 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyhgvsv-0.1.post1-py3-none-any.whl (22.5 kB view details)

Uploaded Apr 24, 2024 Python 3

File details

Details for the file pyhgvsv-0.1.post1.tar.gz.

File metadata

Download URL: pyhgvsv-0.1.post1.tar.gz
Upload date: Apr 24, 2024
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for pyhgvsv-0.1.post1.tar.gz
Algorithm	Hash digest
SHA256	`3ddb5162067a66bbcaf5f6e54e44a41782f79595efa8de09b2a7fa373c967f89`
MD5	`811556e0c22f3a628963adb8d2db8ec9`
BLAKE2b-256	`bff52f2c702674c631b0e75543d1b668b2a0ce7b7b2cb22efc98402a6d8b1392`

See more details on using hashes here.

File details

Details for the file pyhgvsv-0.1.post1-py3-none-any.whl.

File metadata

Download URL: pyhgvsv-0.1.post1-py3-none-any.whl
Upload date: Apr 24, 2024
Size: 22.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for pyhgvsv-0.1.post1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f904770121b2c12a5e0299852bf1e145dc0452485f9bd00b13e0be58294d825b`
MD5	`68a8e9ed4683592e8c4ef867497f2269`
BLAKE2b-256	`a254777eada038a6f067da5f868cf35e0654d5110185f757cad37bc4404a56c8`

See more details on using hashes here.

pyhgvsv 0.1.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HGVS variant name parsing and generation

HGVS name example

Example usage

Install

Tests

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes