Medical cOmputational Suite for Advanced Intelligent eXtraction
Project description
MOSAICX ๐ฅ๐ค
Medical cOmputational Suite for Advanced Intelligent eXtraction
"We built this because manually extracting data from thousands of medical reports was slowly killing our souls."
โ The DIGIT-X Team, after another late night of copy-pasting patient data
๐ฏ What MOSAICX Actually Does
MOSAICX turns this nightmare:
"Pat.-Nr.: 111111111, geb. 13.03.1940, Mรผller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."
Into this blessing:
{
"patient_id": "111111111",
"age": 80,
"sex": "Female",
"mitral_valve_grade": "Normal",
"tricuspid_valve_grade": "Mild"
}
The honest truth: This tool was born out of pure desperation at DIGIT-X Lab when we realized we had 50,000+ radiology reports to process and our research budget couldn't afford a small army of medical students with Red Bull addictions.
๐ Quick Start (Because Time is Money)
Installation
Option 1: Standard Installation
pip install mosaicx
Option 2: With UV (Faster & Better)
uv add mosaicx
Basic Usage
# 1. Generate a schema from natural language
mosaicx generate --desc "Patient demographics with valve conditions"
# 2. Extract data from PDF reports
mosaicx extract --pdf report.pdf --schema PatientValveReport
# 3. Profit (literally, in research publications)
That's it. Seriously. We spent months making this as simple as possible because we're researchers, not software engineers, and we have better things to do than debug YAML files.
๐ฅ Why We Built This (The Real Story)
The Problem
At DIGIT-X Lab (LMU University Hospital), we had:
- ๐ 50,000+ medical reports in PDF format
- ๐ง Brilliant researchers who shouldn't be doing data entry
- โฐ Deadlines that don't care about your manual extraction process
- ๐ฐ Limited budgets (welcome to academic research)
Existing Solutions Were...
- ๏ฟฝ Too expensive (enterprise NLP solutions cost more than our coffee budget)
- ๐ฏ Too generic (built for business documents, not medical reports)
- ๐ Too cloud-dependent (patient data doesn't leave our servers, period)
- ๐ค Too rigid (required predefined schemas that never match reality)
Our Approach
We said "screw it" and built something that actually works for medical researchers:
- ๐ Runs locally (your patient data stays in your building)
- ๐ง Uses local LLMs (Ollama + your own models)
- ๐ Generates schemas from plain English (describe what you want, get code)
- ๐ง Actually handles real medical text (German medical terms, inconsistent formats, coffee stains)
- ๐จ Pretty terminal output (because we're human beings who appreciate beauty)
๐ How It Actually Works
The Magic Pipeline
๐ PDF โ ๐ Text (Docling) โ ๐ค LLM + Schema โ โจ Structured Data
Schema Generation
mosaicx generate --desc "Echocardiography report with valve assessments"
- Uses local LLMs to understand your requirements
- Generates proper Pydantic models with validation
- Saves both Python classes and JSON schemas
- No more manually writing data models!
Data Extraction
mosaicx extract --pdf echo_report.pdf --schema PatientValveReport --model mistral
- Robust PDF text extraction (handles scanned docs, tables, weird formatting)
- Schema-driven extraction with validation
- Falls back gracefully when models get creative
- Silent error handling (no more spam in your terminal)
๐จ Features We're Actually Proud Of
๐ง Smart Schema Coercion
- Handles German medical terms โ English schema values
- "physiologische Insuffizienz" โ "Normal" (because we live in Germany)
- Case-insensitive matching (because doctors don't follow style guides)
๐ก๏ธ Bulletproof Error Handling
- Multiple fallback strategies when models fail
- JSON repair attempts (because GPT sometimes gets creative)
- Graceful degradation (something is better than nothing)
๐ญ Clean Terminal Experience
โจ Schema Model: PatientValveReport โจ
๐ Extraction Results: PatientValveReport
โโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Field โ Extracted Value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ patient_id โ 0022768653 โ
โ valve_condition โ Mild insufficiency โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Privacy-First Architecture
- All processing happens on your hardware
- No cloud APIs (your data never leaves your network)
- GDPR compliant by design (because we're in Europe)
๐ Real-World Performance
What we've tested it on:
- โ German echocardiography reports (our bread and butter)
- โ Mixed-language medical documents (German/English clinical notes)
- โ Scanned PDFs (with OCR quality ranging from "perfect" to "help me")
- โ 50,000+ reports (and counting)
Models that work well:
- ๐ฅ Mistral (fast, reliable, good with medical terminology)
- ๐ฅ DeepSeek R1 70B (slower but handles complex cases)
- ๐ฅ Llama 3 (solid baseline performance)
Honest accuracy rates:
- ๐ ~85-90% field extraction accuracy on clean reports
- ๐ ~70-80% on challenging scanned documents
- ๐ ~95% when you fine-tune the schema descriptions
(These numbers are from actual usage, not cherry-picked benchmarks)
๐ค Contributing (We Need Your Help)
What We'd Love Help With:
- ๐ More language support (French medical terms, anyone?)
- ๐ฅ New medical domains (pathology, radiology, lab reports)
- ๐ Bug reports (especially weird edge cases we haven't seen)
- ๐ Documentation (making this more accessible to non-programmers)
How to Contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-medical-nlp - Test on real medical data (anonymized, please!)
- Submit a pull request with examples
We're academics, so we appreciate proper citations and detailed explanations of your improvements.
๐ License & Citation
License
AGPL-3.0 (GNU Affero General Public License v3.0)
Translation: You can use it, modify it, and distribute it freely. If you improve it and share your improvements publicly, you need to share your code too. Fair's fair.
Citation
If MOSAICX helps with your research, we'd appreciate a citation:
@software{mosaicx2025,
title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
year={2025},
url={https://github.com/LalithShiyam/MOSAICX},
institution={DIGIT-X Lab, LMU Radiology, LMU University Hospital}
}
๐ฅ The Team Behind This
DIGIT-X Lab @ LMU University Hospital
- ๐ง Lalith Kumar Shiyam Sundar, PhD - Lead Developer & Chief Coffee Consumer
- ๐ฅ DIGIT-X Lab Team - The people who actually test this stuff
Contact: lalith.shiyam@med.uni-muenchen.de
Lab: https://www.digit-x-lab.com
Location: Munich, Germany ๐ฉ๐ช
๐ Acknowledgments
Thanks to:
- โ Coffee (the real MVP of this project)
- ๐ฆ Ollama team (for making local LLMs actually usable)
- ๐ Docling team (for solving PDF extraction so we didn't have to)
- ๐ Pydantic team (for making data validation not terrible)
- ๐จ Rich library (for making our terminals beautiful)
- ๐ฅ Our clinical collaborators (for providing endless edge cases)
- ๐ LMU University Hospital (for letting us build cool stuff)
๐ฎ What's Next?
Roadmap:
- ๐ Web interface (for the point-and-click crowd)
- ๐ Batch processing tools (because one PDF at a time is for amateurs)
- ๐ค Fine-tuned medical models (when we get more GPU budget)
- ๐ API endpoints (for the developers among us)
- ๏ฟฝ Mobile app (just kidding, we're not monsters)
Help Us Prioritize:
Open an issue with your use case. We build what people actually need, not what sounds cool in academic papers.
๐ก Final Thoughts
MOSAICX isn't perfect. It's not going to solve all your medical data problems overnight. But it's honest, it's practical, and it was built by people who actually use it every day.
We built this tool because we needed it, and we're sharing it because we think you might need it too. If it saves you even half the time it's saved us, we've done our job.
Happy extracting! ๐
Built with โค๏ธ, โ, and occasional frustration at DIGIT-X Lab, Munich
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mosaicx-1.0.4.tar.gz.
File metadata
- Download URL: mosaicx-1.0.4.tar.gz
- Upload date:
- Size: 233.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
594a8c8ca1a34a8579ec71e2a5ec83e002f6d205f224242a2d06aa6674dd147f
|
|
| MD5 |
48fc60db3f1df6c57cbe7d22272e13f8
|
|
| BLAKE2b-256 |
93dd42595027cb82360522a0e616d86b36a851671268452edff21b135332d2ab
|
File details
Details for the file mosaicx-1.0.4-py3-none-any.whl.
File metadata
- Download URL: mosaicx-1.0.4-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e05f80e9a779997e32e7937edf51df8b335af728afa0f2ddff430b9f0451d043
|
|
| MD5 |
938391f32bb8bac45b8f14b40d2d2114
|
|
| BLAKE2b-256 |
e20746d537de6d261c42a62306530e3a2b15487ec9e2281ab413a7026c0832ad
|