Medical cOmputational Suite for Advanced Intelligent eXtraction

These details have not been verified by PyPI

Project links

Project description

MOSAICX 🏥🤖

Medical cOmputational Suite for Advanced Intelligent eXtraction

"We built this because manually extracting data from thousands of medical reports was slowly killing our souls."
— The DIGIT-X Team, after another late night of copy-pasting patient data

🎯 What MOSAICX Actually Does

MOSAICX turns this nightmare:

"Pat.-Nr.: 111111111, geb. 13.03.1940, Müller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."

Into this blessing:

{
  "patient_id": "111111111",
  "age": 80,
  "sex": "Female", 
  "mitral_valve_grade": "Normal",
  "tricuspid_valve_grade": "Mild"
}

The honest truth: This tool was born out of pure desperation at DIGIT-X Lab when we realized we had 50,000+ radiology reports to process and our research budget couldn't afford a small army of medical students with Red Bull addictions.

🚀 Quick Start (Because Time is Money)

Installation

Option 1: Standard Installation

pip install mosaicx

Option 2: With UV (Faster & Better)

uv add mosaicx

Basic Usage

# 1. Generate a schema from natural language
mosaicx generate --desc "Patient demographics with valve conditions"

# 2. Extract data from PDF reports  
mosaicx extract --pdf report.pdf --schema PatientValveReport

# 3. Profit (literally, in research publications)

That's it. Seriously. We spent months making this as simple as possible because we're researchers, not software engineers, and we have better things to do than debug YAML files.

🏥 Why We Built This (The Real Story)

The Problem

At DIGIT-X Lab (LMU University Hospital), we had:

📄 50,000+ medical reports in PDF format
🧠 Brilliant researchers who shouldn't be doing data entry
⏰ Deadlines that don't care about your manual extraction process
💰 Limited budgets (welcome to academic research)

Existing Solutions Were...

� Too expensive (enterprise NLP solutions cost more than our coffee budget)
🎯 Too generic (built for business documents, not medical reports)
🔒 Too cloud-dependent (patient data doesn't leave our servers, period)
🤖 Too rigid (required predefined schemas that never match reality)

Our Approach

We said "screw it" and built something that actually works for medical researchers:

🏠 Runs locally (your patient data stays in your building)
🧠 Uses local LLMs (Ollama + your own models)
📝 Generates schemas from plain English (describe what you want, get code)
🔧 Actually handles real medical text (German medical terms, inconsistent formats, coffee stains)
🎨 Pretty terminal output (because we're human beings who appreciate beauty)

🛠 How It Actually Works

The Magic Pipeline

📄 PDF → 📝 Text (Docling) → 🤖 LLM + Schema → ✨ Structured Data

Schema Generation

mosaicx generate --desc "Echocardiography report with valve assessments"

Uses local LLMs to understand your requirements
Generates proper Pydantic models with validation
Saves both Python classes and JSON schemas
No more manually writing data models!

Data Extraction

mosaicx extract --pdf echo_report.pdf --schema PatientValveReport --model mistral

Robust PDF text extraction (handles scanned docs, tables, weird formatting)
Schema-driven extraction with validation
Falls back gracefully when models get creative
Silent error handling (no more spam in your terminal)

🎨 Features We're Actually Proud Of

🧠 Smart Schema Coercion

Handles German medical terms → English schema values
"physiologische Insuffizienz" → "Normal" (because we live in Germany)
Case-insensitive matching (because doctors don't follow style guides)

🛡️ Bulletproof Error Handling

Multiple fallback strategies when models fail
JSON repair attempts (because GPT sometimes gets creative)
Graceful degradation (something is better than nothing)

🎭 Clean Terminal Experience

✨ Schema Model: PatientValveReport ✨

📋 Extraction Results: PatientValveReport
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Field                    ┃ Extracted Value                 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ patient_id               │ 0022768653                      │
│ valve_condition          │ Mild insufficiency              │
└──────────────────────────┴─────────────────────────────────┘

🔐 Privacy-First Architecture

All processing happens on your hardware
No cloud APIs (your data never leaves your network)
GDPR compliant by design (because we're in Europe)

📊 Real-World Performance

What we've tested it on:

✅ German echocardiography reports (our bread and butter)
✅ Mixed-language medical documents (German/English clinical notes)
✅ Scanned PDFs (with OCR quality ranging from "perfect" to "help me")
✅ 50,000+ reports (and counting)

Models that work well:

🥇 Mistral (fast, reliable, good with medical terminology)
🥈 DeepSeek R1 70B (slower but handles complex cases)
🥉 Llama 3 (solid baseline performance)

Honest accuracy rates:

📊 ~85-90% field extraction accuracy on clean reports
📊 ~70-80% on challenging scanned documents
📊 ~95% when you fine-tune the schema descriptions

(These numbers are from actual usage, not cherry-picked benchmarks)

🤝 Contributing (We Need Your Help)

What We'd Love Help With:

🌍 More language support (French medical terms, anyone?)
🏥 New medical domains (pathology, radiology, lab reports)
🐛 Bug reports (especially weird edge cases we haven't seen)
📚 Documentation (making this more accessible to non-programmers)

How to Contribute:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-medical-nlp
Test on real medical data (anonymized, please!)
Submit a pull request with examples

We're academics, so we appreciate proper citations and detailed explanations of your improvements.

📜 License & Citation

License

AGPL-3.0 (GNU Affero General Public License v3.0)

Translation: You can use it, modify it, and distribute it freely. If you improve it and share your improvements publicly, you need to share your code too. Fair's fair.

Citation

If MOSAICX helps with your research, we'd appreciate a citation:

@software{mosaicx2025,
  title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
  year={2025},
  url={https://github.com/LalithShiyam/MOSAICX},
  institution={DIGIT-X Lab, LMU Radiology, LMU University Hospital}
}

👥 The Team Behind This

DIGIT-X Lab @ LMU University Hospital

🧠 Lalith Kumar Shiyam Sundar, PhD - Lead Developer & Chief Coffee Consumer
👥 DIGIT-X Lab Team - The people who actually test this stuff

Contact: lalith.shiyam@med.uni-muenchen.de
Lab: https://www.digit-x-lab.com
Location: Munich, Germany 🇩🇪

🙏 Acknowledgments

Thanks to:

☕ Coffee (the real MVP of this project)
🦙 Ollama team (for making local LLMs actually usable)
📄 Docling team (for solving PDF extraction so we didn't have to)
🐍 Pydantic team (for making data validation not terrible)
🎨 Rich library (for making our terminals beautiful)
🏥 Our clinical collaborators (for providing endless edge cases)
🎓 LMU University Hospital (for letting us build cool stuff)

🔮 What's Next?

Roadmap:

🌐 Web interface (for the point-and-click crowd)
📊 Batch processing tools (because one PDF at a time is for amateurs)
🤖 Fine-tuned medical models (when we get more GPU budget)
🔌 API endpoints (for the developers among us)
� Mobile app (just kidding, we're not monsters)

Help Us Prioritize:

Open an issue with your use case. We build what people actually need, not what sounds cool in academic papers.

💡 Final Thoughts

MOSAICX isn't perfect. It's not going to solve all your medical data problems overnight. But it's honest, it's practical, and it was built by people who actually use it every day.

We built this tool because we needed it, and we're sharing it because we think you might need it too. If it saves you even half the time it's saved us, we've done our job.

Happy extracting! 🚀

Built with ❤️, ☕, and occasional frustration at DIGIT-X Lab, Munich

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

Feb 27, 2026

2.0.0a2 pre-release

Feb 27, 2026

2.0.0a1 pre-release

Feb 17, 2026

1.1.1

Oct 21, 2025

1.1.0

Sep 26, 2025

1.0.10

Sep 26, 2025

1.0.9

Sep 22, 2025

1.0.8

Sep 20, 2025

1.0.7

Sep 20, 2025

1.0.6

Sep 19, 2025

1.0.5

Sep 19, 2025

This version

1.0.4

Sep 18, 2025

1.0.2

Sep 18, 2025

1.0.1

Sep 18, 2025

1.0.0

Sep 11, 2025

0.0.1

Apr 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicx-1.0.4.tar.gz (233.0 kB view details)

Uploaded Sep 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mosaicx-1.0.4-py3-none-any.whl (49.2 kB view details)

Uploaded Sep 18, 2025 Python 3

File details

Details for the file mosaicx-1.0.4.tar.gz.

File metadata

Download URL: mosaicx-1.0.4.tar.gz
Upload date: Sep 18, 2025
Size: 233.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for mosaicx-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`594a8c8ca1a34a8579ec71e2a5ec83e002f6d205f224242a2d06aa6674dd147f`
MD5	`48fc60db3f1df6c57cbe7d22272e13f8`
BLAKE2b-256	`93dd42595027cb82360522a0e616d86b36a851671268452edff21b135332d2ab`

See more details on using hashes here.

File details

Details for the file mosaicx-1.0.4-py3-none-any.whl.

File metadata

Download URL: mosaicx-1.0.4-py3-none-any.whl
Upload date: Sep 18, 2025
Size: 49.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for mosaicx-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e05f80e9a779997e32e7937edf51df8b335af728afa0f2ddff430b9f0451d043`
MD5	`938391f32bb8bac45b8f14b40d2d2114`
BLAKE2b-256	`e20746d537de6d261c42a62306530e3a2b15487ec9e2281ab413a7026c0832ad`

See more details on using hashes here.

mosaicx 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MOSAICX 🏥🤖

Medical cOmputational Suite for Advanced Intelligent eXtraction

🎯 What MOSAICX Actually Does

🚀 Quick Start (Because Time is Money)

Installation

Basic Usage

🏥 Why We Built This (The Real Story)

The Problem

Existing Solutions Were...

Our Approach

🛠 How It Actually Works

The Magic Pipeline

Schema Generation

Data Extraction

🎨 Features We're Actually Proud Of

🧠 Smart Schema Coercion

🛡️ Bulletproof Error Handling

🎭 Clean Terminal Experience

🔐 Privacy-First Architecture

📊 Real-World Performance

🤝 Contributing (We Need Your Help)

What We'd Love Help With:

How to Contribute:

📜 License & Citation

License

Citation

👥 The Team Behind This

DIGIT-X Lab @ LMU University Hospital

🙏 Acknowledgments

🔮 What's Next?

Roadmap:

Help Us Prioritize:

💡 Final Thoughts

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes