Medical cOmputational Suite for Advanced Intelligent eXtraction
Project description
The Foundation of Healthcare Data Transformation
MOSAICX represents the first intentional step toward building a living, breathing ecosystem for healthcare data intelligence. At DIGIT-X Lab, we believe that meaningful clinical insights begin with one fundamental challenge: structuring the unstructured.
Healthcare data is notoriously fragmented, locked in silos, and resistant to systematic analysis. MOSAICX addresses this by transforming unstructured medical documents into validated, interoperable data structures that become the foundation for connected healthcare intelligence.
From this:
"Pat.-Nr.: 111111111, geb. 13.03.1940, Mรผller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."
To this:
{
"patient_id": "111111111",
"age": 80,
"sex": "Female",
"mitral_valve_grade": "Normal",
"tricuspid_valve_grade": "Mild"
}
This is not just data extraction. This is infrastructureโthe systematic foundation that enables healthcare data to move, connect, and speak in a common language that both humans and AI can understand.
๐ **Quick Start
Prerequisites
โ ๏ธ IMPORTANT: You need Ollama running locally before using MOSAICX
MOSAICX uses local LLMs through Ollama for privacy and control. Install Ollama first:
๐ฅ๏ธ Install Ollama:
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or visit https://ollama.com/download for Windows/GUI installers
๐ค Install a compatible model:
# Recommended: Fast and reliable for medical text
ollama pull mistral
# Alternative: Larger model for complex cases
ollama pull llama3.1
# Verify Ollama is running
ollama list
Installation
Option 1: Standard Installation
pip install mosaicx
Option 2: With UV (Faster & Better)
uv add mosaicx
Verification
# Check if everything works
mosaicx --help
# Test with your installed model
mosaicx generate --desc "Patient demographics" --model mistral
Complete Workflow Guide
๐ The MOSAICX Process: From Description to Data
# Step 1: Generate your schema from natural language
mosaicx generate --desc "Echocardiography report with patient ID, age, sex, and all cardiac valve conditions including severity grades" --model mistral
# Step 2: See what schemas you have
mosaicx schemas
# Step 3: Extract data from your PDF
mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save results.json
# Step 4: Repeat for more files
mosaicx extract --pdf another_report.pdf --schema EchocardiographyReport --model mistral
๐ก Pro Tips for Better Results:
๐ Writing Effective Descriptions
โ DO:
# Be specific about data types and medical domains
mosaicx generate --desc "Patient demographics (ID, age, sex, birth_date) plus cardiac valve assessment from echocardiography with severity grades (Normal, Mild, Moderate, Severe) for mitral, aortic, tricuspid, and pulmonary valves"
# Include expected value formats
mosaicx generate --desc "Lab results with numeric values: hemoglobin (g/dL), white blood cell count (cells/ยตL), platelet count (K/ยตL), and categorical blood type (A, B, AB, O with +/- Rh factor)"
# Mention the document source type
mosaicx generate --desc "Radiology report findings from MRI brain scans including lesion locations, sizes in mm, enhancement patterns, and radiologist impression"
โ DON'T:
# Too vague - will generate generic fields
mosaicx generate --desc "patient data"
# Missing context - unclear what values to expect
mosaicx generate --desc "heart information from reports"
# No data types - will guess incorrectly
mosaicx generate --desc "some medical stuff from PDFs"
๐ฏ Schema Management Workflow
# List all your generated schemas
mosaicx schemas
# Extract using different schema formats:
mosaicx extract --pdf report.pdf --schema EchoReport_20250919_143022 # Schema ID
mosaicx extract --pdf report.pdf --schema echoreport_patient_cardiac.py # Filename
mosaicx extract --pdf report.pdf --schema /path/to/schema/echoreport.py # Full path
๐ Real Example: Complete Workflow
Let's say you have echocardiography reports and want to extract structured data:
# 1. Generate schema with detailed description
mosaicx generate \
--desc "Echocardiography report containing patient_id (string), age (integer), sex (Male/Female), exam_date (YYYY-MM-DD format), mitral_valve_condition (Normal/Mild/Moderate/Severe), aortic_valve_condition (Normal/Mild/Moderate/Severe), ejection_fraction (percentage as integer), and clinical_impression (free text)" \
--model mistral
# Output: Created schema 'EchocardiographyReport_20250919_143022'
# 2. Verify your schema was created
mosaicx schemas
# Shows: EchocardiographyReport_20250919_143022 | Echocardiography report containing patient demographics and valve conditions
# 3. Extract from your first report
mosaicx extract \
--pdf patient_001_echo.pdf \
--schema EchocardiographyReport_20250919_143022 \
--model mistral \
--save patient_001_results.json
# 4. Extract from multiple reports (same schema)
mosaicx extract --pdf patient_002_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral
mosaicx extract --pdf patient_003_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral
# 5. Results saved as structured JSON ready for analysis!
๐ฏ Pro Tip: If you see "connection refused" errors, make sure Ollama is running: ollama serve (usually runs automatically after installation)
Building the Foundation for Connected Healthcare Intelligence
The Challenge We Address
Healthcare generates vast amounts of data, yet most of it remains trapped in unstructured formats that resist systematic analysis. At DIGIT-X Lab, we recognized that before we can build intelligent systems that truly serve clinical care, we must solve the fundamental problem of data structure.
This is where meaningful healthcare AI beginsโnot with complex models, but with the infrastructure that allows medical data to be:
- Structured systematically and consistently
- Connected across clinical silos and departments
- Validated against clinical standards and requirements
- Transformed into knowledge that supports decision-making
Our Vision: A Living Ecosystem
MOSAICX is the first building block in our vision of a living, breathing ecosystem for healthcare data intelligence. We believe that when medical data is properly structured, insights should emerge as "free lunches"โnatural byproducts of well-designed systems rather than labor-intensive manual processes.
The ecosystem we're building:
- ๏ฟฝ๏ธ Infrastructure-First: Reliable, reproducible data transformation that scales
- ๐ Interconnected: Systems that speak to each other in meaningful ways
- ๐ฉบ Clinically Grounded: Tools built from real clinical needs, not theoretical possibilities
- ๏ฟฝ Transparent: AI that can explain its reasoning in natural language
- ๐ฑ Evolving: Adaptive systems that learn and improve with use
Why Structure Comes First
In radiologyโand healthcare broadlyโthe most sophisticated AI model is only as good as the data it can understand. Unstructured medical reports, imaging findings, and clinical notes represent decades of medical knowledge locked away from systematic analysis.
By solving the structuring problem first, we create the foundation for:
- Longitudinal patient analysis across multiple reports and timepoints
- Cross-modal integration linking imaging, lab results, and clinical notes
- Population-level insights drawn from properly standardized clinical data
- Reproducible research built on validated, structured datasets
MOSAICX transforms the labor-intensive process of manual data extraction into systematic, validated infrastructure that serves as the backbone for more sophisticated healthcare intelligence systems.
Infrastructure for Medical Data Transformation
Systematic Data Pipeline
๐ Clinical Description โ ๐ค Local LLM โ ๐๏ธ Validated Schema โ ๐ Medical Document โ ๐ Structured Extraction โ โจ Interoperable Data
MOSAICX implements a systematic approach to medical data transformation that prioritizes reproducibility, clinical validity, and infrastructure scalability.
Schema-Driven Architecture
1. Intelligent Schema Generation
mosaicx generate --desc "Echocardiography report with comprehensive cardiac valve assessment" --model mistral
Our approach begins with clinical intent: you describe your data requirements in natural language, and MOSAICX generates validated Pydantic schemas that encode both structure and clinical semantics.
Generated Infrastructure:
class EchocardiographyReport(BaseModel):
patient_id: str = Field(description="Patient identifier")
age: int = Field(description="Patient age in years", ge=0, le=150)
mitral_valve_condition: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
description="Mitral valve condition severity grade"
)
ejection_fraction: Optional[int] = Field(
description="Left ventricular ejection fraction percentage",
ge=15, le=80
)
This is not just code generationโit's clinical data infrastructure with built-in validation, semantic clarity, and interoperability standards.
2. Systematic Data Extraction
mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save structured_data.json
The extraction process prioritizes:
- Clinical accuracy through schema-guided parsing
- Data validation with automatic constraint checking
- Reproducibility via consistent local processing
- Transparency in how clinical concepts map to structured fields
Structured Output:
{
"patient_id": "0022768653",
"age": 78,
"mitral_valve_condition": "Mild",
"ejection_fraction": 55,
"extraction_metadata": {
"schema_version": "EchocardiographyReport_20250919_143022",
"processing_timestamp": "2025-09-19T14:30:22Z",
"validation_status": "PASSED"
}
}
3. Infrastructure for Scale
mosaicx schemas # Manage your data infrastructure assets
MOSAICX treats schemas as infrastructure assetsโversioned, reusable, and designed for systematic deployment across clinical workflows.
Key Infrastructure Principles:
- Local Processing: All data transformation happens within your clinical environment
- Schema Versioning: Systematic tracking of data structure evolution
- Validation Pipeline: Built-in quality control for extracted data
- Interoperability: Standards-compliant output ready for downstream systems
๐ Privacy Note: All LLM processing happens locally through Ollama. Your patient data never touches the internet.
๐ Advanced Usage & Best Practices
๐ Expert Tips for Schema Descriptions
๐ Template for Medical Reports:
mosaicx generate --desc "[Document Type] containing [Demographics] (specify fields and types), [Clinical Data] (with units and value ranges), [Categorical Fields] (list expected values), and [Free Text Fields] (impression, notes, etc.)"
๐ฉบ Real-World Examples by Medical Domain:
Radiology Reports:
mosaicx generate --desc "Brain MRI report with patient_id (string), scan_date (YYYY-MM-DD), sequence_types (T1, T2, FLAIR, DWI), lesion_count (integer), lesion_locations (frontal, parietal, temporal, occipital), largest_lesion_size_mm (float), enhancement_pattern (none, ring, nodular), and radiologist_impression (free text)"
Laboratory Results:
mosaicx generate --desc "Complete blood count results with patient_id (string), test_date (YYYY-MM-DD), hemoglobin_g_per_dl (float 10-18), hematocrit_percentage (float 30-50), white_blood_cells_per_ul (integer 4000-11000), platelet_count_k_per_ul (integer 150-450), and reference_ranges_exceeded (list of abnormal values)"
Pathology Reports:
mosaicx generate --desc "Surgical pathology report including patient_id (string), specimen_type (biopsy, resection, cytology), organ_system (breast, lung, colon, etc.), diagnosis_primary (free text), diagnosis_secondary (free text), tumor_size_cm (float), grade (1-4 or well/moderate/poor), stage_tnm (TNM format), margins_status (positive, negative, close), and pathologist_name (string)"
โก Performance Optimization
Model Selection Strategy:
# Fast processing for simple extractions
mosaicx extract --pdf simple_report.pdf --schema SimpleSchema --model mistral
# Better accuracy for complex medical terminology
mosaicx extract --pdf complex_report.pdf --schema ComplexSchema --model llama3.1
# Batch processing with consistent model
for pdf in reports/*.pdf; do
mosaicx extract --pdf "$pdf" --schema MedicalReport --model mistral --save "results/$(basename "$pdf" .pdf).json"
done
๐ Quality Control & Validation
Iterative Schema Refinement:
# Start with basic schema
mosaicx generate --desc "Basic patient demographics and primary diagnosis"
# Test on a few files, then refine
mosaicx generate --desc "Patient demographics (patient_id as 8-digit number, age 0-120, sex as M/F only) plus primary_diagnosis (ICD-10 format preferred), secondary_diagnoses (list), and admission_date (MM/DD/YYYY format)"
Result Verification:
# Save results for manual review
mosaicx extract --pdf report.pdf --schema MySchema --model mistral --save results_to_review.json
# Use debug mode for problematic extractions
mosaicx extract --pdf problematic.pdf --schema MySchema --model mistral --debug
Clinical-Grade Features for Healthcare Data Infrastructure
๐ฅ Medical Domain Intelligence
- Clinical terminology mapping: Systematic translation of medical abbreviations and multilingual terms
- Standardized value sets: Automatic mapping to clinical standards (ICD-10, SNOMED CT concepts)
- Contextual understanding: Distinction between normal findings, pathology, and uncertainty expressions
๏ฟฝ Privacy-First Architecture
- Local processing: All data transformation occurs within your clinical environment
- No cloud dependencies: Patient data never leaves your institutional network
- GDPR compliance: Built-in privacy protection by design for European healthcare environments
โ๏ธ Clinical Workflow Integration
๐ Structured Extraction Results โ Clinical Dashboard
โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Clinical Parameter โ Extracted Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Patient ID โ 0022768653 โ
โ Cardiac Function โ Preserved (EF: 55%) โ
โ Valve Assessment โ Mild mitral insufficiency โ
โ Clinical Significance โ Routine follow-up recommended โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๏ฟฝ Infrastructure Reliability
- Schema versioning: Systematic tracking of data structure evolution for longitudinal studies
- Validation pipeline: Multi-layer quality control ensuring clinical data integrity
- Error recovery: Robust handling of edge cases common in real-world medical documents
- Audit trails: Complete tracking of data transformation processes for clinical research compliance
๐ง Troubleshooting
Common Issues & Solutions
โ "Connection refused" or "Model not found" errors:
# Check if Ollama is running
ollama list
# If not installed, install it:
curl -fsSL https://ollama.com/install.sh | sh
# If installed but not running:
ollama serve
# Check available models:
ollama list
โ "No module named 'mosaicx'" error:
# Make sure you're in the right environment
pip show mosaicx
# If not installed:
pip install mosaicx
โ PDF extraction fails:
# Install additional dependencies if needed
pip install "mosaicx[pdf]" # (if this exists)
# Check if PDF is readable:
file your_pdf.pdf
โ Schema generation produces weird results:
# Try a different model
mosaicx generate --desc "your description" --model llama3.1
# Use more specific descriptions
mosaicx generate --desc "Patient demographics including age, sex, ID, and cardiac valve conditions from echocardiography reports"
๐ Still stuck? Open an issue with:
- Your OS and Python version
- Ollama version (
ollama --version) - The exact error message
- A minimal example (anonymized data!)
The Larger Vision
DIGIT-X Lab: Building Connected Healthcare Intelligence
MOSAICX represents the foundational layer of our vision for healthcare data transformation. At DIGIT-X Lab in the Department of Radiology at LMU Klinikum, we're working systematically toward healthcare systems that are:
- Interconnected: Where imaging, laboratory, and clinical data speak a common structured language
- Intelligent: Where AI operates transparently on validated, structured clinical information
- Infrastructure-Focused: Where data transformation is reliable, reproducible, and clinically grounded
- Clinically Useful: Where technology serves real healthcare needs rather than abstract possibilities
What comes next:
- Cross-modal integration: Linking structured imaging findings with clinical notes and laboratory results
- Longitudinal patient models: Systematic tracking of clinical changes over time using structured data
- Clinical decision infrastructure: AI systems that can explain their reasoning using structured clinical evidence
- Interoperable ecosystems: Healthcare data that moves seamlessly between institutions and systems
Research Collaboration
DIGIT-X Lab actively collaborates with clinical researchers, healthcare informaticists, and AI developers who share our infrastructure-first approach to healthcare intelligence.
Contact for research collaboration:
- Email: lalith.shiyam@med.uni-muenchen.de
- Institution: DIGIT-X Lab, Department of Radiology, LMU Klinikum, Munich
Citation
If MOSAICX contributes to your research or clinical work, please cite:
@software{mosaicx2025,
title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
year={2025},
url={https://github.com/LalithShiyam/MOSAICX},
institution={DIGIT-X Lab, Department of Radiology, LMU Klinikum},
note={Infrastructure for systematic healthcare data transformation}
}
Acknowledgments
Core Infrastructure:
- Ollama: Local LLM deployment enabling privacy-preserving clinical data processing
- Docling: Advanced document parsing specifically designed for complex medical document formats
- Pydantic: Data validation ensuring clinical data integrity and interoperability
- Rich: Clear, accessible interfaces for clinical workflow integration
Clinical Collaboration:
- LMU Klinikum clinical departments: Providing real-world validation environments and use cases
- DIGIT-X Lab team: Systematic development and validation of healthcare data infrastructure
- International collaborators: Multi-site validation and cross-institutional deployment
Building healthcare data infrastructure, one systematic transformation at a time.
DIGIT-X Lab
Digital Transformation in Radiology
Department of Radiology
LMU Klinikum, Munich
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mosaicx-1.0.5.tar.gz.
File metadata
- Download URL: mosaicx-1.0.5.tar.gz
- Upload date:
- Size: 491.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87838f0899ada9c3e3f370080f3c4f6e01b601c2cf3e1499c89046848e31dbe5
|
|
| MD5 |
27814c102054ff47872997907d2a0dbc
|
|
| BLAKE2b-256 |
8c9d47db29fb47f92bcc71dc62561fb8b2999a13b34d9596f9248475347f9c63
|
File details
Details for the file mosaicx-1.0.5-py3-none-any.whl.
File metadata
- Download URL: mosaicx-1.0.5-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04e9306b68019c83e46107526355255dc8be207a5c71de32bb6b734162c1d992
|
|
| MD5 |
edaad78f75e304c4b04993afb21c262d
|
|
| BLAKE2b-256 |
3704ee97247b35e2ea49ccf0b310e5364fb3bda1c359aceb53bd5f6aaaca3cb5
|