Abstract PyHGVS and Biocommons HGVS libraries
Project description
hgvs_shim
hgvs-shim is a small compatibility layer that simplifies migrating code from the Counsyl pyhgvs library to Biocommons HGVS.
Background
There are 2 Python HGVS libraries:
- pyhgvs — simpler, but appears abandoned (no activity for years)
- Biocommons HGVS — actively developed, more features (alignment gaps, inversions, uncertain coordinates)
Motivation
In the VariantGrid project, we initially started with pyhgvs. We eventually decided Biocommons HGVS was worth the added complexity and wrote cdot for providing transcripts to both libraries.
Migrating was risky: pyhgvs classes were spread throughout our codebase, and we needed to validate that Biocommons would produce identical results before switching. The transition steps were:
- Abstract away pyhgvs classes and methods behind our own interfaces
- Write implementations for both pyhgvs and Biocommons
- Run both in parallel during testing, raising errors if they disagreed
- Switch to Biocommons-only once confident
hgvs_shim is a cleaned-up library from this work.
Installation
# Biocommons only (pair with any hdp: UTA, cdot, etc.)
pip install hgvs_shim
# With pyhgvs support (for running both backends side-by-side during migration)
pip install 'hgvs_shim[pyhgvs]'
Usage
Setting up the converter
BioCommonsHGVSConverter takes any biocommons data provider (hgvs.dataproviders.interface.Interface). The simplest option is the default biocommons UTA database connection:
import hgvs.dataproviders.uta as uta
from hgvs_shim import BioCommonsHGVSConverter
hdp = uta.connect() # connects to biocommons' public UTA PostgreSQL instance
converter = BioCommonsHGVSConverter('GRCh38', hdp)
For faster or offline transcript data, cdot provides JSON.gz files and a REST API (pip install 'hgvs_shim[cdot]'):
from cdot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider
from hgvs_shim import BioCommonsHGVSConverter
# Local JSON.gz file (fast: 500-1000 tx/sec)
hdp = JSONDataProvider(['/path/to/cdot-0.2.x.refseq.grch37.json.gz'])
converter = BioCommonsHGVSConverter('GRCh37', hdp)
# cdot.cc REST API (no local file needed, ~10 tx/sec)
hdp = RESTDataProvider()
converter = BioCommonsHGVSConverter('GRCh37', hdp)
HGVS string → variant coordinate
chrom, pos, ref, alt = converter.hgvs_to_variant_coordinate('NM_000492.3:c.1521_1523delCTT')
# ('chr7', 117548628, 'ACTT', 'A')
Variant coordinate → c.HGVS
from hgvs_shim import TranscriptInfo
transcript_info = TranscriptInfo(
accession='NM_000492.3',
strand='+',
is_coding=True,
gene_symbol='CFTR', # optional
)
variant = converter.variant_coordinate_to_c_hgvs('chr7', 117548628, 'ACTT', 'A', transcript_info)
print(variant.format())
# NM_000492.3:c.1521_1523del
Normalize a variant
variant = converter.create_hgvs_variant('NM_000492.3:c.1521_1523delCTT')
normalized = converter.normalize(variant)
print(normalized.format())
Running both backends in parallel (migration validation)
from cdot.hgvs.dataproviders import JSONDataProvider
from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory
from pysam import FastaFile
from hgvs_shim import BioCommonsHGVSConverter, PyHGVSConverter, ComboCheckerHGVSConverter
hdp = JSONDataProvider(['/path/to/cdot.json.gz'])
biocommons = BioCommonsHGVSConverter('GRCh37', hdp)
factory = JSONPyHGVSTranscriptFactory(['/path/to/cdot.json.gz'])
pyhgvs_conv = PyHGVSConverter(FastaFile('/path/to/GRCh37.fa'), factory.get_transcript_grch37)
# Raises ValueError if the two converters return different results
combo = ComboCheckerHGVSConverter([biocommons, pyhgvs_conv], die_on_error=True)
result = combo.hgvs_to_variant_coordinate('NM_000352.3:c.215A>G')
Exception handling
from hgvs_shim import HGVSNomenclatureException, HGVSImplementationException
try:
result = converter.hgvs_to_variant_coordinate(hgvs_string)
except HGVSNomenclatureException:
# Bad HGVS string — user error, fixable
...
except HGVSImplementationException:
# Library failure — not user-fixable
...
Format differences vs pyhgvs
Biocommons follows the modern HGVS specification, which differs from pyhgvs in two ways:
del/dup notation — biocommons omits the deleted/duplicated sequence:
pyhgvs: NM_000492.3:c.442delA
biocommons: NM_000492.3:c.442del
delins VCF representation — biocommons uses minimal representation (no anchor base):
pyhgvs: ('chr7', 117171119, 'CA', 'C') # anchor base included
biocommons: ('chr7', 117171120, 'A', 'C') # no anchor base
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hgvs_shim-0.2.0.tar.gz.
File metadata
- Download URL: hgvs_shim-0.2.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccef68fe9df67c4b3e055eebe0ddeb4694a787b2165871e3288f9d209ecb6e52
|
|
| MD5 |
483545e04d0e7c5c90829962f21fa70c
|
|
| BLAKE2b-256 |
1af9504f1cdd2e449c7b51ce0ec76729f06fe5e689bf239445f08a5f8074b804
|
File details
Details for the file hgvs_shim-0.2.0-py3-none-any.whl.
File metadata
- Download URL: hgvs_shim-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8b5cc84420a2a4b65c0da0ea6bebcd562731825e29d99cc21e1ccaea1b18a1f
|
|
| MD5 |
691348b019cc1e672a2c2590fde46876
|
|
| BLAKE2b-256 |
954e1333860f924a49e68784d7f3b713bd5352050a4e0f57e9c4642ec127ac45
|