Extract Rett Syndrome mutations from genetic diagnosis report
Project description
RettX Mutation Analysis Library
A Python library for extracting and validating genetic mutations from clinical reports using an AI-powered agentic pipeline. Supports 6 Rett Syndrome-related genes across multiple languages, returning fully normalized HGVS nomenclature with genomic coordinates on both GRCh37 and GRCh38 assemblies.
๐ Quick Start
Installation
pip install rettxmutation
Basic Usage
import asyncio
from rettxmutation import RettxServices, DefaultConfig
async def extract():
config = DefaultConfig() # loads from .env / environment variables
with RettxServices(config) as services:
result = await services.agent_extraction_service.extract_mutations(
"The patient carries the mutation NM_004992.4:c.916C>T (p.Arg306Cys) in MECP2."
)
for key, mutation in result.mutations.items():
pt = mutation.primary_transcript
print(f"Gene: {pt.gene_id}")
print(f"Transcript: {pt.hgvs_transcript_variant}")
print(f"Protein: {pt.protein_consequence_tlr}")
print(f"Type: {mutation.variant_type}")
for assembly, coord in mutation.genomic_coordinates.items():
print(f"{assembly}: {coord.hgvs} (pos {coord.start:,}โ{coord.end:,})")
asyncio.run(extract())
CLI Example
python examples/extract_from_file.py path/to/genetic_report.txt --verbose
โจ Key Features
- ๐งฌ Multi-Gene Support: MECP2, FOXG1, SLC6A1, CDKL5, EIF2B2, MEF2C โ with curated RefSeq transcripts
- ๐ Multilingual: Processes reports in English, Spanish, Greek, Turkish, and more
- ๐ค Agentic Extraction: Azure OpenAI-powered agent with tool-calling (gene registry lookup, variant validation, complex variant handling)
- โ HGVS Validation: Every mutation is validated via VariantValidator with automatic coordinate liftover (GRCh37 โ GRCh38)
- ๐ PHI Redaction: Automatic removal of personal health information before LLM processing
- โก Production Ready: Type-safe Pydantic v2 models, exponential backoff, connection pooling
- ๐ Dual Assembly Output: Every mutation includes genomic coordinates on both GRCh37 and GRCh38
- ๐๏ธ Modular Architecture: Lazy-initialized services with dependency injection and context manager support
๐งฌ Supported Genes
| Gene | Chromosome | Primary Transcript | Condition |
|---|---|---|---|
| MECP2 | Xq28 | NM_004992.4 (+NM_001110792.2) | Classic Rett Syndrome |
| FOXG1 | 14q12 | NM_005249.5 | Congenital variant Rett |
| SLC6A1 | 3p25.3 | NM_003042.4 | Myoclonic-atonic epilepsy |
| CDKL5 | Xp22.13 | NM_001323289.2 | CDKL5 deficiency disorder |
| EIF2B2 | 14q24.3 | NM_014239.4 | Vanishing white matter disease |
| MEF2C | 5q14.3 | NM_002397.5 (+NM_001131005.2) | MEF2C haploinsufficiency |
๐ Output Structure
The ExtractionResult contains:
ExtractionResult
โโโ mutations: Dict[str, GeneMutation] โ keyed by GRCh38 genomic HGVS
โโโ genes_detected: List[str] โ e.g. ["MECP2"]
โโโ extraction_log: List[str] โ agent reasoning trace
โโโ tool_calls_count: int โ total tool invocations
Each GeneMutation provides:
GeneMutation
โโโ genomic_coordinates:
โ โโโ GRCh38: { assembly, hgvs, start, end, size }
โ โโโ GRCh37: { assembly, hgvs, start, end, size }
โโโ variant_type: "SNV" | "deletion" | "duplication" | "insertion" | "indel"
โโโ primary_transcript:
โ โโโ gene_id, transcript_id
โ โโโ hgvs_transcript_variant โ e.g. NM_004992.4:c.916C>T
โ โโโ protein_consequence_tlr โ e.g. NP_004983.1:p.(Arg306Cys)
โ โโโ protein_consequence_slr โ e.g. NP_004983.1:p.(R306C)
โโโ secondary_transcript (optional)
๐ ๏ธ Requirements
Python Version
- Python 3.8 or higher
Azure Services
| Service | Required? | Purpose |
|---|---|---|
| Azure OpenAI | โ Required | Agentic mutation extraction |
| Azure AI Search | Optional | Semantic search for keyword detection |
| Azure Cognitive Services | Optional | Text analytics enrichment |
Environment Variables
# Required โ Azure OpenAI
RETTX_OPENAI_ENDPOINT=https://your-openai.openai.azure.com/
RETTX_OPENAI_KEY=your-openai-key
RETTX_OPENAI_MODEL_NAME=gpt-4o # deployment name
# Optional โ Agent model (defaults to RETTX_OPENAI_MODEL_NAME if not set)
RETTX_OPENAI_AGENT_DEPLOYMENT=gpt-4o # agent-specific deployment
RETTX_OPENAI_AGENT_MODEL_VERSION=2024-11-20
# Optional โ Embeddings
RETTX_EMBEDDING_DEPLOYMENT=text-embedding-ada-002
# Optional โ Azure AI Search
RETTX_AI_SEARCH_SERVICE=your-search-service
RETTX_AI_SEARCH_API_KEY=your-search-key
RETTX_AI_SEARCH_INDEX_NAME=your-index-name
# Optional โ Azure Cognitive Services
RETTX_COGNITIVE_SERVICES_ENDPOINT=https://your-cognitive-services.cognitiveservices.azure.com/
RETTX_COGNITIVE_SERVICES_KEY=your-cognitive-services-key
๐ Processing Pipeline
The agentic extraction pipeline works as follows:
Input Text (any language)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ PHI Redaction โ Remove patient names, DOBs, IDs
โโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ AI Agent (OpenAI) โ Reads text, identifies mutations
โ โ
โ Tools available: โ
โ โข lookup_gene_registry โ gene info + RefSeq transcripts
โ โข validate_variant โ HGVS validation + coordinates
โ โข validate_complex โ CNV / genomic coordinate validation
โโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโ
โ Structured Output โ ExtractionResult with validated
โ โ GeneMutation objects
โโโโโโโโโโโโโโโโโโโโโโโโ
Key capabilities:
- Ensembl โ RefSeq remapping: Handles reports using Ensembl transcripts (ENST*) by using genomic coordinates
- Minus-strand awareness: Correctly complements alleles for genes on the reverse strand
- Old nomenclature: Normalizes legacy formats (e.g.,
502C->T,R168X) to current HGVS
๐ป Available Services
All services are lazily initialized via RettxServices:
with RettxServices(config) as services:
# Core extraction
services.agent_extraction_service # AI-powered mutation extraction
# Validation & analysis
services.variant_validator_service # HGVS validation + coordinate liftover
services.mutation_tokenizator # Mutation string tokenization
# Search & embeddings
services.embedding_service # Azure OpenAI embeddings
services.ai_search_service # Azure AI Search integration
services.keyword_detector_service # Multi-layer keyword detection
Direct Variant Validation
with RettxServices(config) as services:
vvs = services.variant_validator_service
# Validate a transcript-level variant
result = vvs.get_gene_mutation_from_transcript("NM_004992.4:c.916C>T")
print(result.primary_transcript.protein_consequence_tlr)
# โ NP_004983.1:p.(Arg306Cys)
# Validate a complex / genomic variant
result = vvs.create_gene_mutation_from_complex_variant(
assembly_build="GRCh38",
assembly_refseq="NC_000023.11",
variant_description="NC_000023.11:g.154030912G>A",
gene_symbol="MECP2"
)
Custom Configuration
class MyConfig:
"""Custom configuration for production (e.g., from Key Vault)."""
RETTX_OPENAI_ENDPOINT = "https://my-openai.openai.azure.com/"
RETTX_OPENAI_KEY = get_secret("openai-key")
RETTX_OPENAI_MODEL_NAME = "gpt-4o"
# Only set fields needed for the services you use
with RettxServices(MyConfig()) as services:
result = await services.agent_extraction_service.extract_mutations(text)
๐งช Golden Test Suite
The library includes a comprehensive golden test suite with 11 real-world genetic reports:
| Gene | Variant Type | Language | Key Feature |
|---|---|---|---|
| MECP2 | SNV, splicing, deletion, duplication | EN, ES, EL, TR | Multiple transcripts |
| FOXG1 | Frameshift deletion | TR | Non-MECP2 gene |
| SLC6A1 | Whole-gene CNV (~20kb) | ES | Copy number variant |
Run golden tests:
# Mock mode (no API calls, uses recorded responses)
python -m pytest tests/golden/ --golden-mode=mock -v
# Live mode (calls real APIs)
python -m pytest tests/golden/ --golden-mode=live -v
๐ฏ Use Cases
- ๐ฅ Clinical Genetics: Extract mutations from diagnostic reports in any language
- ๐ฌ Research: Analyze genetic data across Rett Syndrome and related conditions
- ๐ Patient Registries: Populate genetic databases with normalized HGVS nomenclature
- ๐ค Bioinformatics Pipelines: Integrate as a library or via the CLI example
- ๐ฑ Clinical Applications: Build tools with structured mutation data (dual-assembly coordinates)
๐ง Reliability
- Exponential Backoff: Automatic retry for VariantValidator and OpenAI API calls
- Graceful Degradation: Optional services (AI Search, Cognitive Services) degrade gracefully
- PHI Redaction: Patient data is stripped before any LLM processing
- Type Safety: Pydantic v2 models with runtime validation
- Context Manager: Automatic resource cleanup via
withstatement - Comprehensive Logging: Structured extraction logs with tool call traces
๐ค Contributing
We welcome contributions! Please see our GitHub repository for:
- Issue reporting
- Feature requests
- Pull request guidelines
- Development setup instructions
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support
- Issues: GitHub Issues
- Documentation: API Documentation
- Contact: procha@rettsyndrome.eu
๐ฎ Roadmap
- Additional Genes: Expand the gene registry beyond the current 6 genes
- Batch Processing: Process multiple reports in parallel with rate limiting
- Confidence Scoring: Per-mutation confidence metrics based on report quality
- Structured Report Parsing: Native support for VCF, JSON, and HL7 FHIR formats
- Cloud Deployment: Docker containers and Azure deployment templates
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rettxmutation-0.3.6.tar.gz.
File metadata
- Download URL: rettxmutation-0.3.6.tar.gz
- Upload date:
- Size: 76.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92137e0fe7f14843d1293b468486467a50095784c5a6d09888f19d985597a84d
|
|
| MD5 |
29c44e3875e42bb6c5241b3aafb1b3f6
|
|
| BLAKE2b-256 |
834986b011ce351bb9b6746932c6e080f0248f1ab21916f1007a6ad94a3b68d7
|
Provenance
The following attestation bundles were made for rettxmutation-0.3.6.tar.gz:
Publisher:
publish_pypi.yml on rett-europe/rettxmutation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rettxmutation-0.3.6.tar.gz -
Subject digest:
92137e0fe7f14843d1293b468486467a50095784c5a6d09888f19d985597a84d - Sigstore transparency entry: 963291768
- Sigstore integration time:
-
Permalink:
rett-europe/rettxmutation@1ba1e2eaaf934d61573dff659434b37af53df504 -
Branch / Tag:
refs/tags/v0.3.6 - Owner: https://github.com/rett-europe
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@1ba1e2eaaf934d61573dff659434b37af53df504 -
Trigger Event:
push
-
Statement type:
File details
Details for the file rettxmutation-0.3.6-py3-none-any.whl.
File metadata
- Download URL: rettxmutation-0.3.6-py3-none-any.whl
- Upload date:
- Size: 72.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1459f4357aad9c58b4c8795e603bb522f12604e8d81d83b54ea074271582b0a5
|
|
| MD5 |
706dc46479075fbc2a64198e901ea98e
|
|
| BLAKE2b-256 |
2ff560d13776610172e75fcb96fc785d418051aa5f42f25208d7256deb1bf628
|
Provenance
The following attestation bundles were made for rettxmutation-0.3.6-py3-none-any.whl:
Publisher:
publish_pypi.yml on rett-europe/rettxmutation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rettxmutation-0.3.6-py3-none-any.whl -
Subject digest:
1459f4357aad9c58b4c8795e603bb522f12604e8d81d83b54ea074271582b0a5 - Sigstore transparency entry: 963291770
- Sigstore integration time:
-
Permalink:
rett-europe/rettxmutation@1ba1e2eaaf934d61573dff659434b37af53df504 -
Branch / Tag:
refs/tags/v0.3.6 - Owner: https://github.com/rett-europe
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@1ba1e2eaaf934d61573dff659434b37af53df504 -
Trigger Event:
push
-
Statement type: