Skip to main content

Universal Variant ID - compact 128-bit identifiers for human genetic variation

Project description

uvid

CI Docs License: Apache-2.0 Python 3.10+ Rust

Compact 128-bit Universal Variant IDs for human genetic variation.

Encode any human genomic variant -- SNP, indel, MNV -- into a deterministic 128-bit identifier that sorts in natural genomic order. No central authority, no database round-trips, no coordination required.

from uvid import UVID

uvid = UVID.encode("chr1", 100, "A", "G", "GRCh38")
print(uvid)  # 00000064-40000001-00000000-00000006

Read the docs for full API reference, normalization guide, and bit-layout details.


Why UVID?

Genomic variant databases typically assign arbitrary integer or string IDs to variants, requiring a round-trip to the database to discover whether a variant already exists before inserting it. UVID eliminates that problem: the ID is computed deterministically from the variant itself, so identical variants always receive identical IDs regardless of where or when they are encoded.

Property Detail
Deterministic Same variant = same ID, anywhere, without coordination
Compact 16 bytes per variant (fits in a UUID column)
Sortable Natural genomic order when compared as unsigned 128-bit integers
Streaming-friendly ID is known before database interaction -- bulk upsert in a single pass
Shard-friendly ID-driven partitioning for distributed variant stores
UUIDv5 compatible Deterministic SHA-1 mapping for systems that expect standard UUIDs
Sequence-searchable Alleles up to 20 bp are stored exactly; longer alleles keep length + a 17-bit Rabin fingerprint
HGVS support Bidirectional conversion between HGVS genomic notation (g./m.) and UVIDs

128-bit layout (MSB to LSB)

 127      96 95 94 93      47 46      0
 +----------+--+--+---------++---------+
 | position |as|rm| REF     |am| ALT   |
 |  (32)    |(2)(1)| (46)   |(1)| (46) |
 +----------+--+--+---------++---------+
Bits Width Field
127-96 32 Linearized genome position
95-94 2 Assembly (0 = GRCh37, 1 = GRCh38)
93 1 REF mode (0 = string, 1 = length)
92-47 46 REF payload
46 1 ALT mode (0 = string, 1 = length)
45-0 46 ALT payload

Each allele payload is independently encoded in one of two modes:

  • String mode (mode=0): 5-bit length + 40-bit 2-bit-encoded DNA. Stores up to 20 bases exactly.
  • Length mode (mode=1): 28-bit length + 17-bit Rabin fingerprint. Used for sequences >20 bases or containing non-ACGT characters.

Limitations

  • Two assemblies supported: GRCh37/hg19 and GRCh38/hg38 (2 reserved slots remain).
  • Alleles longer than 20 bases cannot be decoded exactly; the original sequence is always recoverable from the reference or VCF.
  • Focused on human genomics.

Installation

Requires Python 3.10+ and a Rust toolchain (for building from source).

# From PyPI (once published)
uv pip install uvid

# From source
uv pip install .

# As a CLI tool
uv tool install uvid

Quick Start

CLI

# Encode a variant
uvid encode chr1 100 A G

# Decode a UVID
uvid decode 00000064-40000001-00000000-00000006

# HGVS encode -- convert HGVS notation to UVID
uvid hgvs-encode "NC_000001.11:g.12345A>G"

# HGVS decode -- convert UVID back to HGVS notation
uvid hgvs-decode 00003039-40000001-00000000-00000006

# Annotate a VCF with UVIDs in the ID column
uvid vcf input.vcf output.vcf -a GRCh38

# Add a VCF to a .uvid collection
uvid add collection.uvid sample.vcf

# Search by region
uvid search collection.uvid --sample sample__NA12878 --chr chr1 --start 10000 --end 20000

# Collection info
uvid info collection.uvid

Python

from uvid import UVID, Collection, hgvs_to_uvid, uvid_to_hgvs, vcf_passthrough

# Encode / decode
uvid = UVID.encode("chr1", 100, "A", "G", "GRCh38")
fields = uvid.decode()
# {'chr': '1', 'pos': 100, 'ref': 'A', 'alt': 'G', 'assembly': 'GRCh38', ...}

# HGVS conversion
uvid = hgvs_to_uvid("NC_000001.11:g.12345A>G")
hgvs_str, warnings = uvid_to_hgvs(uvid.to_hex())
# hgvs_str = "NC_000001.11:g.12345A>G"

# UUIDv5 conversion
print(uvid.uuid5())  # deterministic UUID

# Range queries
lower, upper = UVID.range("chr1", 10000, 20000, "GRCh38")

# .uvid collections (DuckDB-backed)
store = Collection("my_variants.uvid")
store.add_vcf("sample.vcf", "GRCh38")
results = store.search_region("sample__NA12878", "chr1", 10000, 20000)

# VCF passthrough -- stamp UVIDs into the ID column
count = vcf_passthrough("input.vcf", "output.vcf", assembly="GRCh38")

Variant Normalization

uvid includes a built-in normalizer based on Tan et al. 2015 (the same algorithm used by bcftools and vt) to ensure consistent IDs across differently-represented variants:

# Normalize and encode in one pass
uvid vcf input.vcf output.vcf -a GRCh38 --normalize

See the normalization guide for details on reference genome setup.


Architecture

                   Python (typer CLI / library API)
                          |
                       PyO3 FFI
                          |
  +-----+--------+-------+--------+-----------+------+
  |     |        |       |        |           |      |
  | uvid128  assembly  vcf  normalize  store  hgvs |
  | (encode/ (chr     (noodles (Tan et  (DuckDB (HGVS|
  |  decode)  offsets)  parser) al 2015) I/O)  g./m.)|
  +------------------------------------------------+
                    Rust core
  • Rust core: UVID encoding/decoding, VCF parsing via noodles, variant normalization, DuckDB bulk I/O, HGVS notation support
  • Python bindings: PyO3 + maturin
  • CLI: Typer wrapping the native library

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uvid-0.5.4.tar.gz (209.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

uvid-0.5.4-cp311-cp311-win_amd64.whl (10.2 MB view details)

Uploaded CPython 3.11Windows x86-64

uvid-0.5.4-cp311-cp311-manylinux_2_28_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

uvid-0.5.4-cp311-cp311-manylinux_2_28_aarch64.whl (14.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

uvid-0.5.4-cp311-cp311-macosx_11_0_arm64.whl (12.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

uvid-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file uvid-0.5.4.tar.gz.

File metadata

  • Download URL: uvid-0.5.4.tar.gz
  • Upload date:
  • Size: 209.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for uvid-0.5.4.tar.gz
Algorithm Hash digest
SHA256 de1bf88d3ca95a2d060a2c4956dc706d2e966044d30e052daa976c45f4e28973
MD5 6bdad29c9bf75e6e82131ce1ddc09a06
BLAKE2b-256 cfc35f4c2f16682f08080d219384c15057fa14f0a966bc651214b1b361f6a91f

See more details on using hashes here.

File details

Details for the file uvid-0.5.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: uvid-0.5.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 10.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for uvid-0.5.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f3ab19943eb601a035d213eddacd8ec9d6fb32a579ac3a8380a5cdc0efcbac5d
MD5 7f686b5b5d985aed5ecfd3a0ea698397
BLAKE2b-256 a7ed7810c870184f4d61a5b7432ae53679dad65772fdfb297f8b90b62705c2c4

See more details on using hashes here.

File details

Details for the file uvid-0.5.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for uvid-0.5.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7a499e6fa2a7c3212d7767c6f11a85814668d69864e369323ebac0ae8aa0f0aa
MD5 64d7447ab868d824eee4b6d5eaa16581
BLAKE2b-256 ce17112d1e9eff5589f2304de5814f9adba77cd063934ceb9574a35800a717e1

See more details on using hashes here.

File details

Details for the file uvid-0.5.4-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for uvid-0.5.4-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0cae2909ac95950e097b84b2f0240b1315a392fbec0d760c95b814fd251c238a
MD5 d7d7ec1c0f014055b933b80b23b68c8c
BLAKE2b-256 01e5eff0215e3529c6a294d4950187b5c567a0fbb1ea69c1c1c80cd0f04a9ffe

See more details on using hashes here.

File details

Details for the file uvid-0.5.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for uvid-0.5.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 efd0f22a41df7c34432bb45276590f95a604a22d96ad0e37041dd8744f46783d
MD5 a502bee15ef9bc46cb6ef26fae31a6f7
BLAKE2b-256 7ebc9eb6ea455a592ee3d19653d90537b18aacd424a622cb3bc509b44ca3cd54

See more details on using hashes here.

File details

Details for the file uvid-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for uvid-0.5.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d30ae867607723eb7e308bcd57494e945a625b184a39c33aca1b1b031363deb5
MD5 7fe16158e48a8963deea4a3d6e51ed91
BLAKE2b-256 80362bd935eab8b8a9b157c29b6152aef40dbd474ba72f872b234b30ad4d650b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page