Skip to main content

HGVS variant simulator & toolkit: simulate SNVs and frameshift variants for MANE transcripts, plus parse, validate, normalize, backtranslate, convert, extract, liftover, transcribe, and translate HGVS descriptions.

Project description

VarSim — HGVS Variant Simulator & Toolkit

Note on Naming: This tool, VarSim, is a sequence variant simulator for generating HGVS nomenclature. It is not affiliated with, and should not be confused with, the "VarSim" read simulator (PMID: 25524895).

VarSim is a comprehensive toolkit for HGVS variant nomenclature — simulating all possible SNV and frameshift variants for MANE transcripts, plus parsing, validation, normalization, backtranslation, format conversion, extraction, liftover, transcription, and translation. All powered by NCBI Entrez.

Installation

pip install varsim

Configuration

Set two environment variables to query NCBI Entrez:

  • EMAIL — (Required) A valid email so NCBI can contact you about query issues.
  • API_KEY — (Recommended) An NCBI API key for higher query rates. Obtain one from your NCBI account settings.

Linux/macOS:

export EMAIL="your.email@example.com"
export API_KEY="your_api_key_here"

Windows (PowerShell):

$env:EMAIL="your.email@example.com"
$env:API_KEY="your_api_key_here"

Usage

import varsim

1. Variant Simulation

Generate all possible single-nucleotide variants for MANE transcripts. Results include nucleotide and protein HGVS where applicable.

Function Description
cds(gene) All CDS SNVs → (c.HGVS, p.HGVS¹, p.HGVS³)
utr5(gene) / utr3(gene) All 5′UTR / 3′UTR SNVs → list of c.HGVS
splice_site(gene) Canonical splice site SNVs (±1, ±2)
aa_sub(gene) All amino acid substitutions → (p¹, p³)
codon_sub(gene) All codon-level substitutions → list of c.HGVS
missense(gene) Codon variants with protein effect (missense / silent)
frameshift(gene) All 1-bp deletion & insertion frameshift variants
>>> varsim.cds("INS")
[('NM_000207.3:c.1A>G', 'NP_000198.1:p.(M1?)', 'NP_000198.1:p.(Met1?)'), ...]

>>> varsim.splice_site("INS")
['NC_000011.10(NM_000207.3):c.187+1G>A', 'NC_000011.10(NM_000207.3):c.187+1G>T', ...]

2. HGVS Parsing & Validation

Parse HGVS strings into structured objects or validate syntax and semantics.

Function Description
parse(hgvs) Parse HGVS → HGVSTag (.acc, .prefix, .variant_type, .ref, .alt, .start_pos, …)
is_valid(hgvs, ref_seq=None) Return True if syntax (and optionally semantics) passes
validate(hgvs) Detailed validation → list of {"severity", "message"} dicts
>>> tag = varsim.parse("NM_000207.3:c.1A>G")
>>> tag.variant_type, tag.ref, tag.alt
('substitution', 'A', 'G')
>>> varsim.validate("NM_000207.3:c.1A>G")
[]
>>> varsim.is_valid("NM_000207.3:c.1A>G", ref_seq="ATGCGTACG...")
True

3. HGVS Normalization

Normalize variants to canonical form per HGVS recommendations.

Function Description
normalize(hgvs, ref_seq=None) Full pipeline: 3′ shift, ins→dup, allele minimization, range normalization
normalize_3prime_shift(hgvs, ref_seq) Shift variant as far 3′ as possible
ins_to_dup(hgvs, ref_seq) Convert insertion to duplication when applicable
>>> varsim.normalize("c.4A>G", ref_seq="AAGC")
'c.2A>G'

>>> varsim.ins_to_dup("NM_000207.3:c.4_5insA", ref_seq="TAAA")
'NM_000207.3:c.3dup'

4. Backtranslation

Determine which nucleotide changes could produce a given protein variant.

Function Description
backtranslate(gene, p_hgvs) Protein → nucleotide backtranslation using the real MANE CDS
backtranslate_protein(p_hgvs) Pure codon-table backtranslation (no gene fetch)
>>> varsim.backtranslate_protein("p.(V42G)")
['c.125T>G']
>>> varsim.backtranslate("G6PD", "p.(V42G)")  # validates against real CDS
['NM_001360016.2:c.125T>G']

5. Format Conversion

Convert between HGVS, VCF, and SPDI formats.

Function Description
hgvs_to_vcf(hgvs, chrom=None) HGVS g./c. → VCF dict {CHROM, POS, REF, ALT}
vcf_to_hgvs(chrom, pos, ref, alt, acc=None) VCF record → HGVS string
hgvs_to_spdi(hgvs) HGVS → SPDI string
spdi_to_hgvs(spdi, prefix="g.") SPDI → HGVS string
c_to_p(c_hgvs, gene) Coding HGVS → protein HGVS
>>> varsim.hgvs_to_vcf("NC_000023.11:g.123456A>G")
{'CHROM': 'NC_000023.11', 'POS': 123456, 'ID': '.', 'REF': 'A', 'ALT': 'G'}
>>> varsim.vcf_to_hgvs("X", 123456, "A", "G", acc="NC_000023.11")
'NC_000023.11:g.123456A>G'

6. Variant Extraction

Diff two sequences and produce the minimal HGVS description.

Function Description
extract(ref_seq, obs_seq, acc="NM_000207.3", prefix="c.") Align & diff → HGVS string
>>> varsim.extract("ATGC", "ATTC", prefix="c.")
'NM_000207.3:c.3G>T'
>>> varsim.extract("ATGC", "ATC", prefix="c.")
'NM_000207.3:c.3del'

7. Liftover

Remap genomic variants between assemblies via the NCBI Remap API.

Function Description
liftover_g_to_assembly(hgvs, target_assembly="GRCh38") Lift g.HGVS between assemblies
liftover_transcript(gene, c_hgvs, target_assembly="GRCh38") Transcript → genomic → liftover pipeline
>>> varsim.liftover_g_to_assembly("NC_000001.10:g.12345A>G", "GRCh38")
'NC_000001.11:g.12345A>G'
>>> varsim.liftover_transcript("G6PD", "c.1A>G", "GRCh38")
'NC_000023.11:g.153760607T>C'

8. Transcription

Convert between coding and genomic coordinate systems using exon structure.

Function Description
c_to_g(c_hgvs, gene) Coding (c.) → genomic (g.) coordinates
g_to_c(g_hgvs, gene) Genomic (g.) → coding (c.) coordinates
get_cds_exon_map(gene) Exon structure mapping (cDNA + genomic coordinates)
>>> varsim.c_to_g("NM_001360016.2:c.1A>G", "G6PD")
'NC_000023.11:g.153760607A>G'
>>> varsim.get_cds_exon_map("G6PD")
[{'exon': 1, 'cds_start': 0, 'cds_end': 138, 'genomic_start': ..., 'strand': -1}, ...]

9. Translation

Translate coding variants to their protein consequences.

Function Description
translate_variant(c_hgvs, gene) Coding → protein HGVS string
translate_variants(c_hgvs_list, gene) Batch translation for multiple c.HGVS strings
get_protein_effect(c_hgvs, gene) Effect dict: effect_type, position, ref_aa, alt_aa, 1-letter + 3-letter p.HGVS
>>> varsim.translate_variant("NM_000207.3:c.1A>G", "INS")
'NP_000198.1:p.(M1?)'
>>> eff = varsim.get_protein_effect("NM_000207.3:c.4A>G", "INS")
>>> eff["effect_type"]
'missense'

API Reference

Category Function Brief
Simulation cds(gene) / utr5(gene) / utr3(gene) SNVs for CDS, 5′UTR, 3′UTR → c.HGVS + p.HGVS
splice_site(gene) / aa_sub(gene) / codon_sub(gene) Splice-site SNVs / amino acid substitutions / codon substitutions
missense(gene) / frameshift(gene) Codon variants with protein effect / frameshift indels
Parsing parse(hgvs) / validate(hgvs) / is_valid(hgvs, ref_seq?) Parse → HGVSTag / detailed issues / bool check
Normalization normalize(hgvs, ref_seq?) / normalize_3prime_shift(...) / ins_to_dup(...) Full normalization / 3′-shift / ins→dup
Backtranslation backtranslate(gene, p_hgvs) / backtranslate_protein(p_hgvs) Protein → nucleotide via CDS / codon table
Conversion hgvs_to_vcf(...) / vcf_to_hgvs(...) / hgvs_to_spdi(...) / spdi_to_hgvs(...) HGVS ↔ VCF ↔ SPDI
c_to_p(c_hgvs, gene) Coding HGVS → protein HGVS
Extraction extract(ref, obs, acc?, prefix?) Diff two sequences → HGVS
Liftover liftover_g_to_assembly(hgvs, target?) / liftover_transcript(gene, c_hgvs, target?) Assembly liftover / transcript→genomic→liftover
Transcription c_to_g(c_hgvs, gene) / g_to_c(g_hgvs, gene) / get_cds_exon_map(gene) Coding ↔ genomic / exon structure
Translation translate_variant(c_hgvs, gene) / get_protein_effect(c_hgvs, gene) c.HGVS → p.HGVS / detailed effect dict

License

MIT License

Note on Naming: This package is not affiliated with the read simulator "VarSim" (PMID: 25524895).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

varsim-2.0.0.tar.gz (55.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

varsim-2.0.0-py3-none-any.whl (54.5 kB view details)

Uploaded Python 3

File details

Details for the file varsim-2.0.0.tar.gz.

File metadata

  • Download URL: varsim-2.0.0.tar.gz
  • Upload date:
  • Size: 55.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for varsim-2.0.0.tar.gz
Algorithm Hash digest
SHA256 753bb54ecc78e30d8eb1520d273f11a2c0f94f087da11985d2f82a4ea6dba339
MD5 d7f71be528a9b1ea5387d22ecf766d4e
BLAKE2b-256 b3e5c3f94eddf99bd92b49d117a188e047c0f2d45add8591c9fc13159c073983

See more details on using hashes here.

File details

Details for the file varsim-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: varsim-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 54.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for varsim-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2d07949da71df2d12b5991149a587f70d9357788025ae3da49d70e340fdb0609
MD5 21050c480b3c4f3f516fad5d43eea764
BLAKE2b-256 1d32197e676c919994c6d0a688aa9b021b3a5b42ffad1c41d9d759639a2ef4c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page