Skip to main content

Abstract PyHGVS and Biocommons HGVS libraries

Project description

hgvs_shim

hgvs-shim is a small compatibility layer that simplifies migrating code from the Counsyl pyhgvs library to Biocommons HGVS.

Background

There are 2 Python HGVS libraries:

  • pyhgvs — simpler, but appears abandoned (no activity for years)
  • Biocommons HGVS — actively developed, more features (alignment gaps, inversions, uncertain coordinates)

Motivation

In the VariantGrid project, we initially started with pyhgvs. We eventually decided Biocommons HGVS was worth the added complexity and wrote cdot for providing transcripts to both libraries.

Migrating was risky: pyhgvs classes were spread throughout our codebase, and we needed to validate that Biocommons would produce identical results before switching. The transition steps were:

  1. Abstract away pyhgvs classes and methods behind our own interfaces
  2. Write implementations for both pyhgvs and Biocommons
  3. Run both in parallel during testing, raising errors if they disagreed
  4. Switch to Biocommons-only once confident

hgvs_shim is a cleaned-up library from this work.

Installation

# Biocommons only (pair with any hdp: UTA, cdot, etc.)
pip install hgvs_shim

# With pyhgvs support (for running both backends side-by-side during migration)
pip install 'hgvs_shim[pyhgvs]'

Usage

Setting up the converter

BioCommonsHGVSConverter takes any biocommons data provider (hgvs.dataproviders.interface.Interface). The simplest option is the default biocommons UTA database connection:

import hgvs.dataproviders.uta as uta
from hgvs_shim import BioCommonsHGVSConverter

hdp = uta.connect()  # connects to biocommons' public UTA PostgreSQL instance
converter = BioCommonsHGVSConverter('GRCh38', hdp)

For faster or offline transcript data, cdot provides JSON.gz files and a REST API (pip install 'hgvs_shim[cdot]'):

from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider
from hgvs_shim import BioCommonsHGVSConverter

# Local JSON.gz file (fast: 500-1000 tx/sec)
hdp = JSONDataProvider(['/path/to/cdot-0.2.x.refseq.grch37.json.gz'])
converter = BioCommonsHGVSConverter('GRCh37', hdp)

# cdot.cc REST API (no local file needed, ~10 tx/sec)
hdp = RESTDataProvider()
converter = BioCommonsHGVSConverter('GRCh37', hdp)

HGVS string → variant coordinate

chrom, pos, ref, alt = converter.hgvs_to_variant_coordinate('NM_000492.3:c.1521_1523delCTT')
# ('chr7', 117548628, 'ACTT', 'A')

Variant coordinate → c.HGVS

from hgvs_shim import TranscriptInfo

transcript_info = TranscriptInfo(
    accession='NM_000492.3',
    strand='+',
    is_coding=True,
    gene_symbol='CFTR',  # optional
)

variant = converter.variant_coordinate_to_c_hgvs('chr7', 117548628, 'ACTT', 'A', transcript_info)
print(variant.format())
# NM_000492.3:c.1521_1523del

Normalize a variant

variant = converter.create_hgvs_variant('NM_000492.3:c.1521_1523delCTT')
normalized = converter.normalize(variant)
print(normalized.format())

Running both backends in parallel (migration validation)

from cdot.hgvs.dataproviders import JSONDataProvider
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory
from pysam import FastaFile
from hgvs_shim import BioCommonsHGVSConverter, PyHGVSConverter, ComboCheckerHGVSConverter

hdp = JSONDataProvider(['/path/to/cdot.json.gz'])
biocommons = BioCommonsHGVSConverter('GRCh37', hdp)

factory = JSONPyHGVSTranscriptFactory(['/path/to/cdot.json.gz'])
pyhgvs_conv = PyHGVSConverter(FastaFile('/path/to/GRCh37.fa'), factory.get_transcript_grch37)

# Raises ValueError if the two converters return different results
combo = ComboCheckerHGVSConverter([biocommons, pyhgvs_conv], die_on_error=True)
result = combo.hgvs_to_variant_coordinate('NM_000352.3:c.215A>G')

Exception handling

from hgvs_shim import HGVSNomenclatureException, HGVSImplementationException

try:
    result = converter.hgvs_to_variant_coordinate(hgvs_string)
except HGVSNomenclatureException:
    # Bad HGVS string — user error, fixable
    ...
except HGVSImplementationException:
    # Library failure — not user-fixable
    ...

Format differences vs pyhgvs

Biocommons follows the modern HGVS specification, which differs from pyhgvs in two ways:

del/dup notation — biocommons omits the deleted/duplicated sequence:

pyhgvs:     NM_000492.3:c.442delA
biocommons: NM_000492.3:c.442del

delins VCF representation — biocommons uses minimal representation (no anchor base):

pyhgvs:     ('chr7', 117171119, 'CA', 'C')   # anchor base included
biocommons: ('chr7', 117171120, 'A',  'C')   # no anchor base

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hgvs_shim-0.2.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hgvs_shim-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file hgvs_shim-0.2.0.tar.gz.

File metadata

  • Download URL: hgvs_shim-0.2.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for hgvs_shim-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ccef68fe9df67c4b3e055eebe0ddeb4694a787b2165871e3288f9d209ecb6e52
MD5 483545e04d0e7c5c90829962f21fa70c
BLAKE2b-256 1af9504f1cdd2e449c7b51ce0ec76729f06fe5e689bf239445f08a5f8074b804

See more details on using hashes here.

File details

Details for the file hgvs_shim-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: hgvs_shim-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for hgvs_shim-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c8b5cc84420a2a4b65c0da0ea6bebcd562731825e29d99cc21e1ccaea1b18a1f
MD5 691348b019cc1e672a2c2590fde46876
BLAKE2b-256 954e1333860f924a49e68784d7f3b713bd5352050a4e0f57e9c4642ec127ac45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page