Extract Rett Syndrome mutations from genetic diagnosis report
Project description
rettxmutation - RettX Mutation Analysis Library
Purpose
- Analyze genetic documents systematically to:
- Extract and identify MECP2 mutations.
- Normalize mutation data for downstream applications.
- Output structured results with confidence scores for decision-making.
Features
1. Flexible Workflow
With this library you can cover different use cases.
- Batch Processing: Process multiple files in a single run.
- Single File Analysis: Handle individual files, triggered by:
- File uploads.
- Scheduled tasks.
- API calls.
- Input Types:
- Images (preprocessed to optimize OCR results).
- PDF documents (direct text extraction).
2. Systematic Workflow
- Preprocessing (for images):
- Binarization, sharpening, and contrast adjustment.
- Enhances image quality for better OCR accuracy.
- Text Extraction:
- OCR applied to extract raw text.
- Text cleaned to remove artifacts and standardize formatting.
- Keyword Detection:
- Identify MECP2-related terms and gene variants.
- Assign confidence scores to detected keywords.
- Summarization and Correction:
- Generate concise summaries using OpenAI.
- Validate and correct summaries with Azure Cognitive Services (Text Analytics for Health).
- Mutation Extraction:
- Extract potential mutations and assign confidence scores.
- Filter mutations based on user-defined thresholds.
- Data Enrichment:
- Query Ensembl.org for detailed mutation information.
- Map mutations to transcripts and protein variants.
3. Integration-Ready Outputs
- Models: Built with Pydantic v2 for seamless data validation.
- Output Formats:
- JSON (structured data).
- Objects ready for database storage (e.g., CosmosDB).
- Confidence Scores:
- Provided as-is for users to interpret and filter based on needs.
Limitations
- Basic Retry Mechanisms:
- The library includes a retry policy for specific external calls:
- Ensembl: Retries API requests for fetching variations when encountering:
- HTTP errors.
- Connection issues.
- Timeout errors.
- OpenAI: Similar retry logic ensures stability in mutation summarization and extraction tasks.
- Ensembl: Retries API requests for fetching variations when encountering:
- Retries are implemented using exponential backoff (up to 5 attempts).
- The library includes a retry policy for specific external calls:
- Error Handling Beyond Retries:
- If all retry attempts fail, the library does not provide fallback mechanisms.
- Invalid results or unhandled errors must be managed by the caller.
- MECP2 Priority:
- Current version focuses exclusively on MECP2 mutations.
- Extension to other genes or conditions is possible but not yet implemented.
Workflow Summary
- Input:
- Accept image or PDF files.
- Preprocessing:
- Enhance image quality if the input is an image.
- Text Analysis:
- Extract, clean, and summarize text (using OpenAI and Text Analytics for Health)
- Mutation Detection:
- Identify potential mutations with confidence scores.
- Enrichment:
- Fetch detailed data for detected mutations from Ensembl.org.
- Output:
- Provide structured results for integration with databases or other systems.
Use Cases
- Patient Registries:
- Populate genetic information for research or clinical databases.
- Research Tools:
- Provide insights for studies on Rett Syndrome and related conditions.
- Custom Applications:
- Integrate with applications using flexible workflows and output formats.
Design Highlights
- High Flexibility:
- Modular design supports various workflows (batch, single-file, triggered).
- Separation of Concerns:
- Focused on analysis; storage is left to external systems.
- Pydantic Models:
- Facilitate easy integration with databases like CosmosDB.
Future Enhancements
- Add support for fallback mechanisms to handle errors gracefully.
- Extend functionality to detect mutations in other genes or conditions.
- Implement additional preprocessing for specialized input types (e.g., handwritten documents).
- Enable multilingual text analysis for broader applicability (pending to validate with an extended dataset)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rettxmutation-0.0.20.tar.gz.
File metadata
- Download URL: rettxmutation-0.0.20.tar.gz
- Upload date:
- Size: 36.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f7d2052dbc1746f4bb53a0d36dd37abf14ea35fb4af4e04db8aead2487812c2
|
|
| MD5 |
6957fd5bce409b8d373eb48e47177f42
|
|
| BLAKE2b-256 |
5c77f211f8020dc21bc741e89c8710785cc5087f7b277aa68e4061ae749b06df
|
Provenance
The following attestation bundles were made for rettxmutation-0.0.20.tar.gz:
Publisher:
publish_pypi.yml on rett-europe/rettxmutation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rettxmutation-0.0.20.tar.gz -
Subject digest:
8f7d2052dbc1746f4bb53a0d36dd37abf14ea35fb4af4e04db8aead2487812c2 - Sigstore transparency entry: 162275130
- Sigstore integration time:
-
Permalink:
rett-europe/rettxmutation@62d6baefa41444e962ca0bd5f3d0b1711f7c21c1 -
Branch / Tag:
refs/tags/v0.0.20 - Owner: https://github.com/rett-europe
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@62d6baefa41444e962ca0bd5f3d0b1711f7c21c1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file rettxmutation-0.0.20-py3-none-any.whl.
File metadata
- Download URL: rettxmutation-0.0.20-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6184d1a85c21a207133d280f62e4fd549df40a85c6b4fa419ddf79e2e479b13
|
|
| MD5 |
f802ced831c4d52144f02350fd61037f
|
|
| BLAKE2b-256 |
ce260a0923011da1f9685c231bfbded692adf10fa9f5692726e5657ce8124cf5
|
Provenance
The following attestation bundles were made for rettxmutation-0.0.20-py3-none-any.whl:
Publisher:
publish_pypi.yml on rett-europe/rettxmutation
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rettxmutation-0.0.20-py3-none-any.whl -
Subject digest:
f6184d1a85c21a207133d280f62e4fd549df40a85c6b4fa419ddf79e2e479b13 - Sigstore transparency entry: 162275142
- Sigstore integration time:
-
Permalink:
rett-europe/rettxmutation@62d6baefa41444e962ca0bd5f3d0b1711f7c21c1 -
Branch / Tag:
refs/tags/v0.0.20 - Owner: https://github.com/rett-europe
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypi.yml@62d6baefa41444e962ca0bd5f3d0b1711f7c21c1 -
Trigger Event:
push
-
Statement type: