VRS Generator for mapping variants between different Variant Database Tables

These details have not been verified by PyPI

Project links

Homepage

Project description

VRS ID Generator

A Python package for generating and parsing VRS (Variant Representation Specification) identifiers for genomic variants.

This package was inspired by the GA4GH Variant Representation Specification (VRS) and provides a similar way to uniquely and precisely identify genomic variants. However, it is not a direct implementation of VRS and can have different IDs than the original VRS specification.

Installation

# Install from PyPI (once published)
pip install aiva-vrs

# Install from local directory in development mode
pip install -e .

Features

Generate consistent VRS identifiers for genomic variants
Parse VRS identifiers to extract components
Validate VRS identifiers
Build database queries for variant lookup
Normalize chromosome representations
Compliant with GA4GH VRS standard

Why Use VRS IDs?

The Problem with Traditional Variant Representation

Traditionally, genomic variants are represented using a combination of:

Chromosome (e.g., "chr7" or "7")
Position (e.g., 55174772)
Reference allele (e.g., "G")
Alternate allele (e.g., "A")

This approach has several significant challenges:

Inconsistent Representation: Different tools may represent the same variant differently:
- Chromosome format inconsistencies (e.g., "chr7" vs "7")
- Different normalization of complex variants
- Representation of insertions/deletions varies between tools
Database Querying Complexity:
- Querying by 4 separate fields is inefficient
- Requires complex joins and indexing strategies
- Difficult to maintain consistency across different data sources
Cross-Reference Challenges:
- Matching variants between datasets is error-prone
- No single identifier to track a variant across systems
- Difficult to integrate data from multiple sources

The VRS ID Solution

VRS IDs solve these problems by:

Single Consistent Identifier: Each variant gets a unique, stable identifier
Deterministic Generation: The same variant always gets the same ID
Self-Contained Information: The chromosome is encoded in the ID
Efficient Database Operations: Query by a single field instead of four
Simplified Data Integration: Easily match variants across different datasets

Basic Usage

from aiva_vrs import generate_vrs_id, parse_vrs_id, is_valid_vrs_id

# Generate a VRS ID
vrs_id = generate_vrs_id("chr7", 55174772, "GGAATTAAGAGAAGC", "", assembly="GRCh38")
print(vrs_id)  # ga4gh:VA:7:v9TQXvNOQeG1vNRVJCWlD_a1tRf_m2AP

# Validate a VRS ID
is_valid = is_valid_vrs_id(vrs_id)
print(is_valid)  # True

# Parse a VRS ID
components = parse_vrs_id(vrs_id)
print(components)  # {'chromosome': '7', 'digest': 'v9TQXvNOQeG1vNRVJCWlD_a1tRf_m2AP', 'type': 'VA'}

Database Structure and Integration

Why Chromosome-Based Tables?

Genomic variant databases often use a chromosome-based structure. This design provides several benefits:

Performance: Queries for variants on a specific chromosome are much faster
Scalability: Allows for parallel processing and sharding of data
Maintenance: Easier to manage and update data for specific chromosomes

Database Schema

A typical genomic database might include these key tables:

-- One table per chromosome for variants
CREATE TABLE public.variants_chr1 (
    id TEXT PRIMARY KEY,           -- VRS ID (ga4gh:VA:...)
    chromosome TEXT NOT NULL,      -- Normalized chromosome (e.g., "1")
    position INTEGER NOT NULL,     -- Genomic position
    reference_allele TEXT NOT NULL, -- Reference allele
    alternate_allele TEXT NOT NULL, -- Alternate allele
    -- Additional fields...
);

-- Similar tables for other chromosomes (variants_chr2, variants_chr3, etc.)

-- Transcript consequences
CREATE TABLE public.transcript_consequences (
    id TEXT PRIMARY KEY,
    variant_id TEXT NOT NULL,      -- References a VRS ID
    transcript_id TEXT NOT NULL,
    -- Additional fields...
);

-- Sample variants (associations between samples and variants)
CREATE TABLE public.sample_variants (
    sample_id TEXT NOT NULL,
    variant_id TEXT NOT NULL,      -- References a VRS ID
    genotype TEXT,
    allele_frequency FLOAT,
    -- Additional fields...
    PRIMARY KEY (sample_id, variant_id)
);

-- Samples information
CREATE TABLE public.samples (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    variant_count INTEGER DEFAULT 0,
    -- Additional fields...
);

How VRS IDs Connect the Database

The VRS ID serves as the primary identifier for variants across all tables:

Each variant has a unique VRS ID that includes the chromosome in its structure
The VRS generator creates consistent IDs for the same variant, even from different sources
The chromosome component of the VRS ID determines which table to query
Sample-variant associations use the VRS ID to link samples to their variants

This design enables efficient queries and ensures data consistency across the system.

Database Integration Examples

Example 1: Fetch a variant from the database

import psycopg2
from aiva_vrs import build_variant_query

# Connect to the database
conn = psycopg2.connect("dbname=genomics_db user=postgres")
cursor = conn.cursor()

# VRS ID to look up
vrs_id = "ga4gh:VA:7:v9TQXvNOQeG1vNRVJCWlD_a1tRf_m2AP"

# Build the query
query, params = build_variant_query(vrs_id)

# Execute the query
cursor.execute(query, params)
variant = cursor.fetchone()

# Process the result
if variant:
    print(f"Found variant: {variant}")
else:
    print(f"Variant not found: {vrs_id}")

# Close the connection
cursor.close()
conn.close()

Example 2: Using with SQLAlchemy

from sqlalchemy import create_engine, text
from aiva_vrs import get_chromosome_from_vrs_id, get_sql_table_for_variant

# Create engine
engine = create_engine("postgresql://postgres:password@localhost/genomics_db")

# VRS ID to look up
vrs_id = "ga4gh:VA:7:v9TQXvNOQeG1vNRVJCWlD_a1tRf_m2AP"

# Get table name and chromosome
table_name = get_sql_table_for_variant(vrs_id)
chromosome = get_chromosome_from_vrs_id(vrs_id)

# Build and execute query
with engine.connect() as connection:
    query = text(f"""
        SELECT v.*, tc.* 
        FROM public.{table_name} v
        LEFT JOIN public.transcript_consequences tc ON v.id = tc.variant_id
        WHERE v.id = :vrs_id
        AND v.chromosome = :chromosome
    """)
    
    result = connection.execute(query, {"vrs_id": vrs_id, "chromosome": chromosome})
    variants = result.fetchall()
    
    for variant in variants:
        print(f"Variant: {variant}")

Example 3: Cloud Function Integration

import functions_framework
from google.cloud import bigquery
from aiva_vrs import parse_vrs_id

@functions_framework.http
def lookup_variant(request):
    """HTTP Cloud Function to look up a variant by VRS ID."""
    # Get VRS ID from request
    request_json = request.get_json(silent=True)
    vrs_id = request_json.get('vrs_id')
    
    if not vrs_id:
        return {'error': 'No VRS ID provided'}, 400
    
    try:
        # Parse the VRS ID
        components = parse_vrs_id(vrs_id)
        chromosome = components['chromosome']
        
        # Set up BigQuery client
        client = bigquery.Client()
        
        # Query for the variant
        query = f"""
            SELECT *
            FROM `project.dataset.variants_chr{chromosome.lower()}`
            WHERE id = @vrs_id
        """
        
        job_config = bigquery.QueryJobConfig(
            query_parameters=[
                bigquery.ScalarQueryParameter("vrs_id", "STRING", vrs_id)
            ]
        )
        
        query_job = client.query(query, job_config=job_config)
        results = query_job.result()
        
        # Format results
        variants = [dict(row) for row in results]
        
        if not variants:
            return {'message': f'No variant found for VRS ID: {vrs_id}'}, 404
        
        return {'variants': variants}, 200
        
    except ValueError as e:
        return {'error': str(e)}, 400
    except Exception as e:
        return {'error': f'Internal error: {str(e)}'}, 500

Using in Processing variants from OpenCRAVAT CSV

from aiva_vrs import generate_vrs_id
import csv
import gzip

def process_opencravat_csv(csv_path, output_dir, assembly='GRCh38', compress=True):
    """Process an OpenCRAVAT CSV file and generate CSVs for database import."""
    # Open the input file
    with open(csv_path, 'r') as f:
        reader = csv.DictReader(f, delimiter=',')
        
        # Prepare output files
        variants_file = f"{output_dir}/variants.csv.gz" if compress else f"{output_dir}/variants.csv"
        variants_out = gzip.open(variants_file, 'wt') if compress else open(variants_file, 'w')
        
        # Write headers
        variants_writer = csv.writer(variants_out)
        variants_writer.writerow(['id', 'chromosome', 'position', 'reference_allele', 'alternate_allele'])
        
        # Process each row
        for row in reader:
            # Extract variant information
            chrom = row.get('Chrom', '')
            pos = row.get('Pos', '')
            ref = row.get('Reference allele', '')
            alt = row.get('Alternate allele', '')
            
            # Generate VRS ID
            vrs_id = generate_vrs_id(chrom, pos, ref, alt, assembly)
            
            # Write variant data
            variants_writer.writerow([vrs_id, chrom, pos, ref, alt])
        
        # Close output files
        variants_out.close()

Development

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests
Submit a pull request

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiva_vrs-0.1.0.tar.gz (8.7 kB view details)

Uploaded Apr 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aiva_vrs-0.1.0-py3-none-any.whl (8.0 kB view details)

Uploaded Apr 2, 2025 Python 3

File details

Details for the file aiva_vrs-0.1.0.tar.gz.

File metadata

Download URL: aiva_vrs-0.1.0.tar.gz
Upload date: Apr 2, 2025
Size: 8.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for aiva_vrs-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c0873bb41649fd0756df87e5bfcc637774dd9fbdf46d88d797a45d9e633368e2`
MD5	`0e13d6edb536fafd054f409c6a408a80`
BLAKE2b-256	`17c2a019f77e378344937255de612006190bc035aa9a21be237c3acd80af2b8e`

See more details on using hashes here.

File details

Details for the file aiva_vrs-0.1.0-py3-none-any.whl.

File metadata

Download URL: aiva_vrs-0.1.0-py3-none-any.whl
Upload date: Apr 2, 2025
Size: 8.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for aiva_vrs-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`81896d5fa32f8607340f434f67c7bf9104bfaabb6bd3abae8d0357ce5cc76853`
MD5	`7387cd7f7e977968a1974b786f1d1290`
BLAKE2b-256	`3242bd45b4909a36556188819eb73e79901df47171e1634c9056db8162a82dac`

See more details on using hashes here.

aiva-vrs 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VRS ID Generator

Installation

Features

Why Use VRS IDs?

The Problem with Traditional Variant Representation

The VRS ID Solution

Basic Usage

Database Structure and Integration

Why Chromosome-Based Tables?

Database Schema

How VRS IDs Connect the Database

Database Integration Examples

Example 1: Fetch a variant from the database

Example 2: Using with SQLAlchemy

Example 3: Cloud Function Integration

Using in Processing variants from OpenCRAVAT CSV

Development

Running Tests

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes