Skip to main content

Medical cOmputational Suite for Advanced Intelligent eXtraction

Project description

MOSAICX Logo

PyPI Python License Downloads Pydantic v2 Ollama Compatible

MOSAICX: The Foundation of Healthcare Data Transformation

MOSAICX represents the first intentional step toward building a living, breathing ecosystem for healthcare data intelligence. At DIGIT-X Lab, we believe that meaningful clinical insights begin with one fundamental challenge: structuring the unstructured.

Healthcare data is notoriously fragmented, locked in silos, and resistant to systematic analysis. MOSAICX addresses this by transforming unstructured medical documents into validated, interoperable data structures that become the foundation for connected healthcare intelligence.

From this:

"Pat.-Nr.: 111111111, geb. 13.03.1940, Mรผller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."

To this:

{
  "patient_id": "111111111",
  "age": 80,
  "sex": "Female", 
  "mitral_valve_grade": "Normal",
  "tricuspid_valve_grade": "Mild"
}

This is not just data extraction. This is infrastructureโ€”the systematic foundation that enables healthcare data to move, connect, and speak in a common language that both humans and AI can understand.

๐Ÿš€ **Quick Start

Prerequisites

โš ๏ธ IMPORTANT: You need Ollama running locally before using MOSAICX

MOSAICX uses local LLMs through Ollama for privacy and control. Install Ollama first:

๐Ÿ–ฅ๏ธ Install Ollama:

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or visit https://ollama.com/download for Windows/GUI installers

๐Ÿค– Install a compatible model:

# Recommended: Fast and reliable for medical text
ollama pull mistral

# Alternative: Larger model for complex cases  
ollama pull llama3.1

# Verify Ollama is running
ollama list

Installation

Option 1: Standard Installation

pip install mosaicx

Option 2: With UV (Faster & Better)

uv add mosaicx

Verification

# Check if everything works
mosaicx --help

# Test with your installed model
mosaicx generate --desc "Patient demographics" --model mistral

Complete Workflow Guide

๐Ÿ”„ The MOSAICX Process: From Description to Data

# Step 1: Generate your schema from natural language
mosaicx generate --desc "Echocardiography report with patient ID, age, sex, and all cardiac valve conditions including severity grades" --model mistral

# Step 2: See what schemas you have
mosaicx schemas

# Step 3: Extract data from your PDF
mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save results.json

# Step 4: Repeat for more files
mosaicx extract --pdf another_report.pdf --schema EchocardiographyReport --model mistral

๐Ÿ’ก Pro Tips for Better Results:

๐Ÿ“ Writing Effective Descriptions

โœ… DO:

# Be specific about data types and medical domains
mosaicx generate --desc "Patient demographics (ID, age, sex, birth_date) plus cardiac valve assessment from echocardiography with severity grades (Normal, Mild, Moderate, Severe) for mitral, aortic, tricuspid, and pulmonary valves"

# Include expected value formats  
mosaicx generate --desc "Lab results with numeric values: hemoglobin (g/dL), white blood cell count (cells/ยตL), platelet count (K/ยตL), and categorical blood type (A, B, AB, O with +/- Rh factor)"

# Mention the document source type
mosaicx generate --desc "Radiology report findings from MRI brain scans including lesion locations, sizes in mm, enhancement patterns, and radiologist impression"

โŒ DON'T:

# Too vague - will generate generic fields
mosaicx generate --desc "patient data"

# Missing context - unclear what values to expect  
mosaicx generate --desc "heart information from reports"

# No data types - will guess incorrectly
mosaicx generate --desc "some medical stuff from PDFs"

๐ŸŽฏ Schema Management Workflow

# List all your generated schemas
mosaicx schemas

# Extract using different schema formats:
mosaicx extract --pdf report.pdf --schema EchoReport_20250919_143022        # Schema ID
mosaicx extract --pdf report.pdf --schema echoreport_patient_cardiac.py     # Filename  
mosaicx extract --pdf report.pdf --schema /path/to/schema/echoreport.py     # Full path

๐Ÿ” Real Example: Complete Workflow

Let's say you have echocardiography reports and want to extract structured data:

# 1. Generate schema with detailed description
mosaicx generate \
  --desc "Echocardiography report containing patient_id (string), age (integer), sex (Male/Female), exam_date (YYYY-MM-DD format), mitral_valve_condition (Normal/Mild/Moderate/Severe), aortic_valve_condition (Normal/Mild/Moderate/Severe), ejection_fraction (percentage as integer), and clinical_impression (free text)" \
  --model mistral

# Output: Created schema 'EchocardiographyReport_20250919_143022'

# 2. Verify your schema was created
mosaicx schemas
# Shows: EchocardiographyReport_20250919_143022 | Echocardiography report containing patient demographics and valve conditions

# 3. Extract from your first report
mosaicx extract \
  --pdf patient_001_echo.pdf \
  --schema EchocardiographyReport_20250919_143022 \
  --model mistral \
  --save patient_001_results.json

# 4. Extract from multiple reports (same schema)
mosaicx extract --pdf patient_002_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral
mosaicx extract --pdf patient_003_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral

# 5. Results saved as structured JSON ready for analysis!

๐ŸŽฏ Pro Tip: If you see "connection refused" errors, make sure Ollama is running: ollama serve (usually runs automatically after installation)


Building the Foundation for Connected Healthcare Intelligence

The Challenge We Address

Healthcare generates vast amounts of data, yet most of it remains trapped in unstructured formats that resist systematic analysis. At DIGIT-X Lab, we recognized that before we can build intelligent systems that truly serve clinical care, we must solve the fundamental problem of data structure.

This is where meaningful healthcare AI beginsโ€”not with complex models, but with the infrastructure that allows medical data to be:

  • Structured systematically and consistently
  • Connected across clinical silos and departments
  • Validated against clinical standards and requirements
  • Transformed into knowledge that supports decision-making

Our Vision: A Living Ecosystem

MOSAICX is the first building block in our vision of a living, breathing ecosystem for healthcare data intelligence. We believe that when medical data is properly structured, insights should emerge as "free lunches"โ€”natural byproducts of well-designed systems rather than labor-intensive manual processes.

The ecosystem we're building:

  • ๏ฟฝ๏ธ Infrastructure-First: Reliable, reproducible data transformation that scales
  • ๐Ÿ”— Interconnected: Systems that speak to each other in meaningful ways
  • ๐Ÿฉบ Clinically Grounded: Tools built from real clinical needs, not theoretical possibilities
  • ๏ฟฝ Transparent: AI that can explain its reasoning in natural language
  • ๐ŸŒฑ Evolving: Adaptive systems that learn and improve with use

Why Structure Comes First

In radiologyโ€”and healthcare broadlyโ€”the most sophisticated AI model is only as good as the data it can understand. Unstructured medical reports, imaging findings, and clinical notes represent decades of medical knowledge locked away from systematic analysis.

By solving the structuring problem first, we create the foundation for:

  • Longitudinal patient analysis across multiple reports and timepoints
  • Cross-modal integration linking imaging, lab results, and clinical notes
  • Population-level insights drawn from properly standardized clinical data
  • Reproducible research built on validated, structured datasets

MOSAICX transforms the labor-intensive process of manual data extraction into systematic, validated infrastructure that serves as the backbone for more sophisticated healthcare intelligence systems.


Infrastructure for Medical Data Transformation

Systematic Data Pipeline

๐Ÿ“ Clinical Description โ†’ ๐Ÿค– Local LLM โ†’ ๐Ÿ—๏ธ Validated Schema โ†’ ๐Ÿ“„ Medical Document โ†’ ๐Ÿ” Structured Extraction โ†’ โœจ Interoperable Data

MOSAICX implements a systematic approach to medical data transformation that prioritizes reproducibility, clinical validity, and infrastructure scalability.

Schema-Driven Architecture

1. Intelligent Schema Generation

mosaicx generate --desc "Echocardiography report with comprehensive cardiac valve assessment" --model mistral

Our approach begins with clinical intent: you describe your data requirements in natural language, and MOSAICX generates validated Pydantic schemas that encode both structure and clinical semantics.

Generated Infrastructure:

class EchocardiographyReport(BaseModel):
    patient_id: str = Field(description="Patient identifier")
    age: int = Field(description="Patient age in years", ge=0, le=150)
    mitral_valve_condition: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
        description="Mitral valve condition severity grade"
    )
    ejection_fraction: Optional[int] = Field(
        description="Left ventricular ejection fraction percentage",
        ge=15, le=80
    )

This is not just code generationโ€”it's clinical data infrastructure with built-in validation, semantic clarity, and interoperability standards.

2. Systematic Data Extraction

mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save structured_data.json

The extraction process prioritizes:

  • Clinical accuracy through schema-guided parsing
  • Data validation with automatic constraint checking
  • Reproducibility via consistent local processing
  • Transparency in how clinical concepts map to structured fields

Structured Output:

{
  "patient_id": "0022768653",
  "age": 78,
  "mitral_valve_condition": "Mild", 
  "ejection_fraction": 55,
  "extraction_metadata": {
    "schema_version": "EchocardiographyReport_20250919_143022",
    "processing_timestamp": "2025-09-19T14:30:22Z",
    "validation_status": "PASSED"
  }
}

3. Infrastructure for Scale

mosaicx schemas  # Manage your data infrastructure assets

MOSAICX treats schemas as infrastructure assetsโ€”versioned, reusable, and designed for systematic deployment across clinical workflows.

Key Infrastructure Principles:

  • Local Processing: All data transformation happens within your clinical environment
  • Schema Versioning: Systematic tracking of data structure evolution
  • Validation Pipeline: Built-in quality control for extracted data
  • Interoperability: Standards-compliant output ready for downstream systems

๐Ÿ”’ Privacy Note: All LLM processing happens locally through Ollama. Your patient data never touches the internet.


๐ŸŽ“ Advanced Usage & Best Practices

๐Ÿ† Expert Tips for Schema Descriptions

๐Ÿ“‹ Template for Medical Reports:

mosaicx generate --desc "[Document Type] containing [Demographics] (specify fields and types), [Clinical Data] (with units and value ranges), [Categorical Fields] (list expected values), and [Free Text Fields] (impression, notes, etc.)"

๐Ÿฉบ Real-World Examples by Medical Domain:

Radiology Reports:

mosaicx generate --desc "Brain MRI report with patient_id (string), scan_date (YYYY-MM-DD), sequence_types (T1, T2, FLAIR, DWI), lesion_count (integer), lesion_locations (frontal, parietal, temporal, occipital), largest_lesion_size_mm (float), enhancement_pattern (none, ring, nodular), and radiologist_impression (free text)"

Laboratory Results:

mosaicx generate --desc "Complete blood count results with patient_id (string), test_date (YYYY-MM-DD), hemoglobin_g_per_dl (float 10-18), hematocrit_percentage (float 30-50), white_blood_cells_per_ul (integer 4000-11000), platelet_count_k_per_ul (integer 150-450), and reference_ranges_exceeded (list of abnormal values)"

Pathology Reports:

mosaicx generate --desc "Surgical pathology report including patient_id (string), specimen_type (biopsy, resection, cytology), organ_system (breast, lung, colon, etc.), diagnosis_primary (free text), diagnosis_secondary (free text), tumor_size_cm (float), grade (1-4 or well/moderate/poor), stage_tnm (TNM format), margins_status (positive, negative, close), and pathologist_name (string)"

โšก Performance Optimization

Model Selection Strategy:

# Fast processing for simple extractions
mosaicx extract --pdf simple_report.pdf --schema SimpleSchema --model mistral

# Better accuracy for complex medical terminology  
mosaicx extract --pdf complex_report.pdf --schema ComplexSchema --model llama3.1

# Batch processing with consistent model
for pdf in reports/*.pdf; do
    mosaicx extract --pdf "$pdf" --schema MedicalReport --model mistral --save "results/$(basename "$pdf" .pdf).json"
done

๐Ÿ” Quality Control & Validation

Iterative Schema Refinement:

# Start with basic schema
mosaicx generate --desc "Basic patient demographics and primary diagnosis"

# Test on a few files, then refine
mosaicx generate --desc "Patient demographics (patient_id as 8-digit number, age 0-120, sex as M/F only) plus primary_diagnosis (ICD-10 format preferred), secondary_diagnoses (list), and admission_date (MM/DD/YYYY format)"

Result Verification:

# Save results for manual review
mosaicx extract --pdf report.pdf --schema MySchema --model mistral --save results_to_review.json

# Use debug mode for problematic extractions
mosaicx extract --pdf problematic.pdf --schema MySchema --model mistral --debug

Clinical-Grade Features for Healthcare Data Infrastructure

๐Ÿฅ Medical Domain Intelligence

  • Clinical terminology mapping: Systematic translation of medical abbreviations and multilingual terms
  • Standardized value sets: Automatic mapping to clinical standards (ICD-10, SNOMED CT concepts)
  • Contextual understanding: Distinction between normal findings, pathology, and uncertainty expressions

๏ฟฝ Privacy-First Architecture

  • Local processing: All data transformation occurs within your clinical environment
  • No cloud dependencies: Patient data never leaves your institutional network
  • GDPR compliance: Built-in privacy protection by design for European healthcare environments

โš•๏ธ Clinical Workflow Integration

๐Ÿ“Š Structured Extraction Results โ†’ Clinical Dashboard

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Clinical Parameter       โ”ƒ Extracted Value                 โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Patient ID               โ”‚ 0022768653                      โ”‚
โ”‚ Cardiac Function         โ”‚ Preserved (EF: 55%)             โ”‚
โ”‚ Valve Assessment         โ”‚ Mild mitral insufficiency       โ”‚
โ”‚ Clinical Significance    โ”‚ Routine follow-up recommended   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๏ฟฝ Infrastructure Reliability

  • Schema versioning: Systematic tracking of data structure evolution for longitudinal studies
  • Validation pipeline: Multi-layer quality control ensuring clinical data integrity
  • Error recovery: Robust handling of edge cases common in real-world medical documents
  • Audit trails: Complete tracking of data transformation processes for clinical research compliance

๐Ÿ”ง Troubleshooting

Common Issues & Solutions

โŒ "Connection refused" or "Model not found" errors:

# Check if Ollama is running
ollama list

# If not installed, install it:
curl -fsSL https://ollama.com/install.sh | sh

# If installed but not running:
ollama serve

# Check available models:
ollama list

โŒ "No module named 'mosaicx'" error:

# Make sure you're in the right environment
pip show mosaicx

# If not installed:
pip install mosaicx

โŒ PDF extraction fails:

# Install additional dependencies if needed
pip install "mosaicx[pdf]"  # (if this exists)

# Check if PDF is readable:
file your_pdf.pdf

โŒ Schema generation produces weird results:

# Try a different model
mosaicx generate --desc "your description" --model llama3.1

# Use more specific descriptions
mosaicx generate --desc "Patient demographics including age, sex, ID, and cardiac valve conditions from echocardiography reports"

๐Ÿ†˜ Still stuck? Open an issue with:

  • Your OS and Python version
  • Ollama version (ollama --version)
  • The exact error message
  • A minimal example (anonymized data!)

The Larger Vision

DIGIT-X Lab: Building Connected Healthcare Intelligence

MOSAICX represents the foundational layer of our vision for healthcare data transformation. At DIGIT-X Lab in the Department of Radiology at LMU Klinikum, we're working systematically toward healthcare systems that are:

  • Interconnected: Where imaging, laboratory, and clinical data speak a common structured language
  • Intelligent: Where AI operates transparently on validated, structured clinical information
  • Infrastructure-Focused: Where data transformation is reliable, reproducible, and clinically grounded
  • Clinically Useful: Where technology serves real healthcare needs rather than abstract possibilities

What comes next:

  • Cross-modal integration: Linking structured imaging findings with clinical notes and laboratory results
  • Longitudinal patient models: Systematic tracking of clinical changes over time using structured data
  • Clinical decision infrastructure: AI systems that can explain their reasoning using structured clinical evidence
  • Interoperable ecosystems: Healthcare data that moves seamlessly between institutions and systems

Research Collaboration

DIGIT-X Lab actively collaborates with clinical researchers, healthcare informaticists, and AI developers who share our infrastructure-first approach to healthcare intelligence.

Contact for research collaboration:

Citation

If MOSAICX contributes to your research or clinical work, please cite:

@software{mosaicx2025,
  title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
  year={2025},
  url={https://github.com/LalithShiyam/MOSAICX},
  institution={DIGIT-X Lab, Department of Radiology, LMU Klinikum},
  note={Infrastructure for systematic healthcare data transformation}
}

Acknowledgments

Core Infrastructure:

  • Ollama: Local LLM deployment enabling privacy-preserving clinical data processing
  • Docling: Advanced document parsing specifically designed for complex medical document formats
  • Pydantic: Data validation ensuring clinical data integrity and interoperability
  • Rich: Clear, accessible interfaces for clinical workflow integration

Clinical Collaboration:

  • LMU Klinikum clinical departments: Providing real-world validation environments and use cases
  • DIGIT-X Lab team: Systematic development and validation of healthcare data infrastructure
  • International collaborators: Multi-site validation and cross-institutional deployment

Building healthcare data infrastructure, one systematic transformation at a time.

DIGIT-X Lab
Digital Transformation in Radiology
Department of Radiology
LMU Klinikum, Munich

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicx-1.0.7.tar.gz (495.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicx-1.0.7-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file mosaicx-1.0.7.tar.gz.

File metadata

  • Download URL: mosaicx-1.0.7.tar.gz
  • Upload date:
  • Size: 495.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.18

File hashes

Hashes for mosaicx-1.0.7.tar.gz
Algorithm Hash digest
SHA256 965ca24fd4c489c4f6ea51ecbe5db486db200abecfe9ecb985ad5f88729f42ca
MD5 cdab879d60504015292ba23aa30a0c4a
BLAKE2b-256 a20ebc7a249f27ee4362f304c319caa2dd5877c46316e4348636a9cb0449bba5

See more details on using hashes here.

File details

Details for the file mosaicx-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: mosaicx-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.18

File hashes

Hashes for mosaicx-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 42094033ec778fad3f8e29163b21598849f1fbc54e14cf07f5dd92047dcbe0d1
MD5 1bb485bc96cdd5deeda7092ff6611e5f
BLAKE2b-256 86993e8b33e2c88dd67d2de98a423f82f803d50a2a0352af045aaa08f6a2a22d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page