Skip to main content

Extract Rett Syndrome mutations from genetic diagnosis report

Project description

RettX Mutation Analysis Library

PyPI version Python 3.8+ License: MIT

A Python library for extracting and validating genetic mutations from clinical reports using an AI-powered agentic pipeline. Supports 6 Rett Syndrome-related genes across multiple languages, returning fully normalized HGVS nomenclature with genomic coordinates on both GRCh37 and GRCh38 assemblies.

๐Ÿš€ Quick Start

Installation

pip install rettxmutation

Basic Usage

import asyncio
from rettxmutation import RettxServices, DefaultConfig

async def extract():
    config = DefaultConfig()  # loads from .env / environment variables

    with RettxServices(config) as services:
        result = await services.agent_extraction_service.extract_mutations(
            "The patient carries the mutation NM_004992.4:c.916C>T (p.Arg306Cys) in MECP2."
        )

        for key, mutation in result.mutations.items():
            pt = mutation.primary_transcript
            print(f"Gene:       {pt.gene_id}")
            print(f"Transcript: {pt.hgvs_transcript_variant}")
            print(f"Protein:    {pt.protein_consequence_tlr}")
            print(f"Type:       {mutation.variant_type}")
            for assembly, coord in mutation.genomic_coordinates.items():
                print(f"{assembly}:    {coord.hgvs}  (pos {coord.start:,}โ€“{coord.end:,})")

asyncio.run(extract())

CLI Example

python examples/extract_from_file.py path/to/genetic_report.txt --verbose

โœจ Key Features

  • ๐Ÿงฌ Multi-Gene Support: MECP2, FOXG1, SLC6A1, CDKL5, EIF2B2, MEF2C โ€” with curated RefSeq transcripts
  • ๐ŸŒ Multilingual: Processes reports in English, Spanish, Greek, Turkish, and more
  • ๐Ÿค– Agentic Extraction: Azure OpenAI-powered agent with tool-calling (gene registry lookup, variant validation, complex variant handling)
  • โœ… HGVS Validation: Every mutation is validated via VariantValidator with automatic coordinate liftover (GRCh37 โ†” GRCh38)
  • ๐Ÿ”’ PHI Redaction: Automatic removal of personal health information before LLM processing
  • โšก Production Ready: Type-safe Pydantic v2 models, exponential backoff, connection pooling
  • ๐Ÿ”„ Dual Assembly Output: Every mutation includes genomic coordinates on both GRCh37 and GRCh38
  • ๐Ÿ—๏ธ Modular Architecture: Lazy-initialized services with dependency injection and context manager support

๐Ÿงฌ Supported Genes

Gene Chromosome Primary Transcript Condition
MECP2 Xq28 NM_004992.4 (+NM_001110792.2) Classic Rett Syndrome
FOXG1 14q12 NM_005249.5 Congenital variant Rett
SLC6A1 3p25.3 NM_003042.4 Myoclonic-atonic epilepsy
CDKL5 Xp22.13 NM_001323289.2 CDKL5 deficiency disorder
EIF2B2 14q24.3 NM_014239.4 Vanishing white matter disease
MEF2C 5q14.3 NM_002397.5 (+NM_001131005.2) MEF2C haploinsufficiency

๐Ÿ“Š Output Structure

The ExtractionResult contains:

ExtractionResult
โ”œโ”€โ”€ mutations: Dict[str, GeneMutation]    โ† keyed by GRCh38 genomic HGVS
โ”œโ”€โ”€ genes_detected: List[str]             โ† e.g. ["MECP2"]
โ”œโ”€โ”€ extraction_log: List[str]             โ† agent reasoning trace
โ””โ”€โ”€ tool_calls_count: int                 โ† total tool invocations

Each GeneMutation provides:

GeneMutation
โ”œโ”€โ”€ genomic_coordinates:
โ”‚   โ”œโ”€โ”€ GRCh38: { assembly, hgvs, start, end, size }
โ”‚   โ””โ”€โ”€ GRCh37: { assembly, hgvs, start, end, size }
โ”œโ”€โ”€ variant_type: "SNV" | "deletion" | "duplication" | "insertion" | "indel"
โ”œโ”€โ”€ primary_transcript:
โ”‚   โ”œโ”€โ”€ gene_id, transcript_id
โ”‚   โ”œโ”€โ”€ hgvs_transcript_variant      โ† e.g. NM_004992.4:c.916C>T
โ”‚   โ”œโ”€โ”€ protein_consequence_tlr      โ† e.g. NP_004983.1:p.(Arg306Cys)
โ”‚   โ””โ”€โ”€ protein_consequence_slr      โ† e.g. NP_004983.1:p.(R306C)
โ””โ”€โ”€ secondary_transcript (optional)

๐Ÿ› ๏ธ Requirements

Python Version

  • Python 3.8 or higher

Azure Services

Service Required? Purpose
Azure OpenAI โœ… Required Agentic mutation extraction
Azure AI Search Optional Semantic search for keyword detection
Azure Cognitive Services Optional Text analytics enrichment

Environment Variables

# Required โ€” Azure OpenAI
RETTX_OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
RETTX_OPENAI_KEY=your-openai-key
RETTX_OPENAI_MODEL_NAME=gpt-4o           # deployment name

# Optional โ€” Agent model (defaults to RETTX_OPENAI_MODEL_NAME if not set)
RETTX_OPENAI_AGENT_DEPLOYMENT=gpt-4o     # agent-specific deployment
RETTX_OPENAI_AGENT_MODEL_VERSION=2024-11-20

# Optional โ€” Embeddings
RETTX_EMBEDDING_DEPLOYMENT=text-embedding-ada-002

# Optional โ€” Azure AI Search
RETTX_AI_SEARCH_SERVICE=your-search-service
RETTX_AI_SEARCH_API_KEY=your-search-key
RETTX_AI_SEARCH_INDEX_NAME=your-index-name

# Optional โ€” Azure Cognitive Services
RETTX_COGNITIVE_SERVICES_ENDPOINT=https://your-cognitive-services.cognitiveservices.azure.com/
RETTX_COGNITIVE_SERVICES_KEY=your-cognitive-services-key

๐Ÿ“‹ Processing Pipeline

The agentic extraction pipeline works as follows:

Input Text (any language)
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  PHI Redaction        โ”‚  Remove patient names, DOBs, IDs
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  AI Agent (OpenAI)    โ”‚  Reads text, identifies mutations
โ”‚                      โ”‚
โ”‚  Tools available:    โ”‚
โ”‚  โ€ข lookup_gene_registry  โ†’ gene info + RefSeq transcripts
โ”‚  โ€ข validate_variant      โ†’ HGVS validation + coordinates
โ”‚  โ€ข validate_complex      โ†’ CNV / genomic coordinate validation
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Structured Output    โ”‚  ExtractionResult with validated
โ”‚                      โ”‚  GeneMutation objects
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key capabilities:

  • Ensembl โ†’ RefSeq remapping: Handles reports using Ensembl transcripts (ENST*) by using genomic coordinates
  • Minus-strand awareness: Correctly complements alleles for genes on the reverse strand
  • Old nomenclature: Normalizes legacy formats (e.g., 502C->T, R168X) to current HGVS

๐Ÿ’ป Available Services

All services are lazily initialized via RettxServices:

with RettxServices(config) as services:
    # Core extraction
    services.agent_extraction_service    # AI-powered mutation extraction
    
    # Validation & analysis
    services.variant_validator_service   # HGVS validation + coordinate liftover
    services.mutation_tokenizator        # Mutation string tokenization
    
    # Search & embeddings
    services.embedding_service           # Azure OpenAI embeddings
    services.ai_search_service           # Azure AI Search integration
    services.keyword_detector_service    # Multi-layer keyword detection

Direct Variant Validation

with RettxServices(config) as services:
    vvs = services.variant_validator_service

    # Validate a transcript-level variant
    result = vvs.get_gene_mutation_from_transcript("NM_004992.4:c.916C>T")
    print(result.primary_transcript.protein_consequence_tlr)
    # โ†’ NP_004983.1:p.(Arg306Cys)

    # Validate a complex / genomic variant
    result = vvs.create_gene_mutation_from_complex_variant(
        assembly_build="GRCh38",
        assembly_refseq="NC_000023.11",
        variant_description="NC_000023.11:g.154030912G>A",
        gene_symbol="MECP2"
    )

Custom Configuration

class MyConfig:
    """Custom configuration for production (e.g., from Key Vault)."""
    RETTX_OPENAI_ENDPOINT = "https://my-openai.openai.azure.com/"
    RETTX_OPENAI_KEY = get_secret("openai-key")
    RETTX_OPENAI_MODEL_NAME = "gpt-4o"
    # Only set fields needed for the services you use

with RettxServices(MyConfig()) as services:
    result = await services.agent_extraction_service.extract_mutations(text)

๐Ÿงช Golden Test Suite

The library includes a comprehensive golden test suite with 11 real-world genetic reports:

Gene Variant Type Language Key Feature
MECP2 SNV, splicing, deletion, duplication EN, ES, EL, TR Multiple transcripts
FOXG1 Frameshift deletion TR Non-MECP2 gene
SLC6A1 Whole-gene CNV (~20kb) ES Copy number variant

Run golden tests:

# Mock mode (no API calls, uses recorded responses)
python -m pytest tests/golden/ --golden-mode=mock -v

# Live mode (calls real APIs)
python -m pytest tests/golden/ --golden-mode=live -v

๐ŸŽฏ Use Cases

  • ๐Ÿฅ Clinical Genetics: Extract mutations from diagnostic reports in any language
  • ๐Ÿ”ฌ Research: Analyze genetic data across Rett Syndrome and related conditions
  • ๐Ÿ“Š Patient Registries: Populate genetic databases with normalized HGVS nomenclature
  • ๐Ÿค– Bioinformatics Pipelines: Integrate as a library or via the CLI example
  • ๐Ÿ“ฑ Clinical Applications: Build tools with structured mutation data (dual-assembly coordinates)

๐Ÿ”ง Reliability

  • Exponential Backoff: Automatic retry for VariantValidator and OpenAI API calls
  • Graceful Degradation: Optional services (AI Search, Cognitive Services) degrade gracefully
  • PHI Redaction: Patient data is stripped before any LLM processing
  • Type Safety: Pydantic v2 models with runtime validation
  • Context Manager: Automatic resource cleanup via with statement
  • Comprehensive Logging: Structured extraction logs with tool call traces

๐Ÿค Contributing

We welcome contributions! Please see our GitHub repository for:

  • Issue reporting
  • Feature requests
  • Pull request guidelines
  • Development setup instructions

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support

๐Ÿ”ฎ Roadmap

  • Additional Genes: Expand the gene registry beyond the current 6 genes
  • Batch Processing: Process multiple reports in parallel with rate limiting
  • Confidence Scoring: Per-mutation confidence metrics based on report quality
  • Structured Report Parsing: Native support for VCF, JSON, and HL7 FHIR formats
  • Cloud Deployment: Docker containers and Azure deployment templates

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rettxmutation-0.3.3.tar.gz (72.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rettxmutation-0.3.3-py3-none-any.whl (67.2 kB view details)

Uploaded Python 3

File details

Details for the file rettxmutation-0.3.3.tar.gz.

File metadata

  • Download URL: rettxmutation-0.3.3.tar.gz
  • Upload date:
  • Size: 72.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rettxmutation-0.3.3.tar.gz
Algorithm Hash digest
SHA256 f7e8b4b568136187dfa64c81cc2b9b5fc900e1e54af2d64ffa66ef54c75d21b6
MD5 c4440239152794928e669001c8106703
BLAKE2b-256 42af8f393dc3f6c504648de64e7c9eb3cb07854dda8922918bba12af8902d612

See more details on using hashes here.

Provenance

The following attestation bundles were made for rettxmutation-0.3.3.tar.gz:

Publisher: publish_pypi.yml on rett-europe/rettxmutation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rettxmutation-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: rettxmutation-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 67.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rettxmutation-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 06addad4da1b8eba3ddc218f9e47d0d9ce1d9610adc173b69cae814dd2167574
MD5 2daa6d430f3d3460c583efde4ed4a531
BLAKE2b-256 231c56f572df04aa95de8f8fb89c48672b060316dfcf5583dc61b5121d4875ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for rettxmutation-0.3.3-py3-none-any.whl:

Publisher: publish_pypi.yml on rett-europe/rettxmutation

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page