Skip to main content

Medical cOmputational Suite for Advanced Intelligent eXtraction

Project description

MOSAICX ๐Ÿฅ๐Ÿค–

Medical cOmputational Suite for Advanced Intelligent eXtraction

PyPI version Python 3.11+ License: AGPL-3.0

"We built this because manually extracting data from thousands of medical reports was slowly killing our souls."
โ€” The DIGIT-X Team, after another late night of copy-pasting patient data


๐ŸŽฏ What MOSAICX Actually Does

MOSAICX turns this nightmare:

"Pat.-Nr.: 111111111, geb. 13.03.1940, Mรผller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."

Into this blessing:

{
  "patient_id": "111111111",
  "age": 80,
  "sex": "Female", 
  "mitral_valve_grade": "Normal",
  "tricuspid_valve_grade": "Mild"
}

The honest truth: This tool was born out of pure desperation at DIGIT-X Lab when we realized we had 50,000+ radiology reports to process and our research budget couldn't afford a small army of medical students with Red Bull addictions.


๐Ÿš€ Quick Start (Because Time is Money)

Installation

Option 1: Standard Installation

pip install mosaicx

Option 2: With UV (Faster & Better)

uv add mosaicx

Basic Usage

# 1. Generate a schema from natural language
mosaicx generate --desc "Patient demographics with valve conditions"

# 2. Extract data from PDF reports  
mosaicx extract --pdf report.pdf --schema PatientValveReport

# 3. Profit (literally, in research publications)

That's it. Seriously. We spent months making this as simple as possible because we're researchers, not software engineers, and we have better things to do than debug YAML files.


๐Ÿฅ Why We Built This (The Real Story)

The Problem

At DIGIT-X Lab (LMU University Hospital), we had:

  • ๐Ÿ“„ 50,000+ medical reports in PDF format
  • ๐Ÿง  Brilliant researchers who shouldn't be doing data entry
  • โฐ Deadlines that don't care about your manual extraction process
  • ๐Ÿ’ฐ Limited budgets (welcome to academic research)

Existing Solutions Were...

  • ๏ฟฝ Too expensive (enterprise NLP solutions cost more than our coffee budget)
  • ๐ŸŽฏ Too generic (built for business documents, not medical reports)
  • ๐Ÿ”’ Too cloud-dependent (patient data doesn't leave our servers, period)
  • ๐Ÿค– Too rigid (required predefined schemas that never match reality)

Our Approach

We said "screw it" and built something that actually works for medical researchers:

  • ๐Ÿ  Runs locally (your patient data stays in your building)
  • ๐Ÿง  Uses local LLMs (Ollama + your own models)
  • ๐Ÿ“ Generates schemas from plain English (describe what you want, get code)
  • ๐Ÿ”ง Actually handles real medical text (German medical terms, inconsistent formats, coffee stains)
  • ๐ŸŽจ Pretty terminal output (because we're human beings who appreciate beauty)

๐Ÿ›  How It Actually Works

The Magic Pipeline

๐Ÿ“„ PDF โ†’ ๐Ÿ“ Text (Docling) โ†’ ๐Ÿค– LLM + Schema โ†’ โœจ Structured Data

Schema Generation

mosaicx generate --desc "Echocardiography report with valve assessments"
  • Uses local LLMs to understand your requirements
  • Generates proper Pydantic models with validation
  • Saves both Python classes and JSON schemas
  • No more manually writing data models!

Data Extraction

mosaicx extract --pdf echo_report.pdf --schema PatientValveReport --model mistral
  • Robust PDF text extraction (handles scanned docs, tables, weird formatting)
  • Schema-driven extraction with validation
  • Falls back gracefully when models get creative
  • Silent error handling (no more spam in your terminal)

๐ŸŽจ Features We're Actually Proud Of

๐Ÿง  Smart Schema Coercion

  • Handles German medical terms โ†’ English schema values
  • "physiologische Insuffizienz" โ†’ "Normal" (because we live in Germany)
  • Case-insensitive matching (because doctors don't follow style guides)

๐Ÿ›ก๏ธ Bulletproof Error Handling

  • Multiple fallback strategies when models fail
  • JSON repair attempts (because GPT sometimes gets creative)
  • Graceful degradation (something is better than nothing)

๐ŸŽญ Clean Terminal Experience

โœจ Schema Model: PatientValveReport โœจ

๐Ÿ“‹ Extraction Results: PatientValveReport
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Field                    โ”ƒ Extracted Value                 โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ patient_id               โ”‚ 0022768653                      โ”‚
โ”‚ valve_condition          โ”‚ Mild insufficiency              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ” Privacy-First Architecture

  • All processing happens on your hardware
  • No cloud APIs (your data never leaves your network)
  • GDPR compliant by design (because we're in Europe)

๐Ÿ“Š Real-World Performance

What we've tested it on:

  • โœ… German echocardiography reports (our bread and butter)
  • โœ… Mixed-language medical documents (German/English clinical notes)
  • โœ… Scanned PDFs (with OCR quality ranging from "perfect" to "help me")
  • โœ… 50,000+ reports (and counting)

Models that work well:

  • ๐Ÿฅ‡ Mistral (fast, reliable, good with medical terminology)
  • ๐Ÿฅˆ DeepSeek R1 70B (slower but handles complex cases)
  • ๐Ÿฅ‰ Llama 3 (solid baseline performance)

Honest accuracy rates:

  • ๐Ÿ“Š ~85-90% field extraction accuracy on clean reports
  • ๐Ÿ“Š ~70-80% on challenging scanned documents
  • ๐Ÿ“Š ~95% when you fine-tune the schema descriptions

(These numbers are from actual usage, not cherry-picked benchmarks)


๐Ÿค Contributing (We Need Your Help)

What We'd Love Help With:

  • ๐ŸŒ More language support (French medical terms, anyone?)
  • ๐Ÿฅ New medical domains (pathology, radiology, lab reports)
  • ๐Ÿ› Bug reports (especially weird edge cases we haven't seen)
  • ๐Ÿ“š Documentation (making this more accessible to non-programmers)

How to Contribute:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-medical-nlp
  3. Test on real medical data (anonymized, please!)
  4. Submit a pull request with examples

We're academics, so we appreciate proper citations and detailed explanations of your improvements.


๐Ÿ“œ License & Citation

License

AGPL-3.0 (GNU Affero General Public License v3.0)

Translation: You can use it, modify it, and distribute it freely. If you improve it and share your improvements publicly, you need to share your code too. Fair's fair.

Citation

If MOSAICX helps with your research, we'd appreciate a citation:

@software{mosaicx2025,
  title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
  year={2025},
  url={https://github.com/LalithShiyam/MOSAICX},
  institution={DIGIT-X Lab, LMU Radiology, LMU University Hospital}
}

๐Ÿ‘ฅ The Team Behind This

DIGIT-X Lab @ LMU University Hospital

  • ๐Ÿง  Lalith Kumar Shiyam Sundar, PhD - Lead Developer & Chief Coffee Consumer
  • ๐Ÿ‘ฅ DIGIT-X Lab Team - The people who actually test this stuff

Contact: lalith.shiyam@med.uni-muenchen.de
Lab: https://www.digit-x-lab.com
Location: Munich, Germany ๐Ÿ‡ฉ๐Ÿ‡ช


๐Ÿ™ Acknowledgments

Thanks to:

  • โ˜• Coffee (the real MVP of this project)
  • ๐Ÿฆ™ Ollama team (for making local LLMs actually usable)
  • ๐Ÿ“„ Docling team (for solving PDF extraction so we didn't have to)
  • ๐Ÿ Pydantic team (for making data validation not terrible)
  • ๐ŸŽจ Rich library (for making our terminals beautiful)
  • ๐Ÿฅ Our clinical collaborators (for providing endless edge cases)
  • ๐ŸŽ“ LMU University Hospital (for letting us build cool stuff)

๐Ÿ”ฎ What's Next?

Roadmap:

  • ๐ŸŒ Web interface (for the point-and-click crowd)
  • ๐Ÿ“Š Batch processing tools (because one PDF at a time is for amateurs)
  • ๐Ÿค– Fine-tuned medical models (when we get more GPU budget)
  • ๐Ÿ”Œ API endpoints (for the developers among us)
  • ๏ฟฝ Mobile app (just kidding, we're not monsters)

Help Us Prioritize:

Open an issue with your use case. We build what people actually need, not what sounds cool in academic papers.


๐Ÿ’ก Final Thoughts

MOSAICX isn't perfect. It's not going to solve all your medical data problems overnight. But it's honest, it's practical, and it was built by people who actually use it every day.

We built this tool because we needed it, and we're sharing it because we think you might need it too. If it saves you even half the time it's saved us, we've done our job.

Happy extracting! ๐Ÿš€


Built with โค๏ธ, โ˜•, and occasional frustration at DIGIT-X Lab, Munich

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicx-1.0.4.tar.gz (233.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicx-1.0.4-py3-none-any.whl (49.2 kB view details)

Uploaded Python 3

File details

Details for the file mosaicx-1.0.4.tar.gz.

File metadata

  • Download URL: mosaicx-1.0.4.tar.gz
  • Upload date:
  • Size: 233.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for mosaicx-1.0.4.tar.gz
Algorithm Hash digest
SHA256 594a8c8ca1a34a8579ec71e2a5ec83e002f6d205f224242a2d06aa6674dd147f
MD5 48fc60db3f1df6c57cbe7d22272e13f8
BLAKE2b-256 93dd42595027cb82360522a0e616d86b36a851671268452edff21b135332d2ab

See more details on using hashes here.

File details

Details for the file mosaicx-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: mosaicx-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 49.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for mosaicx-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e05f80e9a779997e32e7937edf51df8b335af728afa0f2ddff430b9f0451d043
MD5 938391f32bb8bac45b8f14b40d2d2114
BLAKE2b-256 e20746d537de6d261c42a62306530e3a2b15487ec9e2281ab413a7026c0832ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page