Medical cOmputational Suite for Advanced Intelligent eXtraction
Project description
DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.
How It Works
flowchart LR
A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
B --> C["DSPy Pipeline"]
C --> D["Validated JSON"]
style A fill:#B5A89A,stroke:#8a7e72,color:#fff
style B fill:#E87461,stroke:#c25a49,color:#fff
style C fill:#E87461,stroke:#c25a49,color:#fff
style D fill:#B5A89A,stroke:#8a7e72,color:#fff
MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.
Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.
Quick Start
# Install MOSAICX
pip install mosaicx # or: uv add mosaicx / pipx install mosaicx
# Start a local LLM (Apple Silicon via vLLM-MLX)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 --port 8000
# Point MOSAICX at it
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8
export MOSAICX_API_BASE=http://localhost:8000/v1
# Extract structured data from a report
mosaicx extract --document report.pdf --mode radiology
[!TIP] Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.
What You Can Do
| Capability | Commands | Guide |
|---|---|---|
| Extract structured data from clinical documents | mosaicx extract, mosaicx batch |
Pipelines |
| Create and manage schemas for custom extraction targets | mosaicx schema generate / list / refine |
Schemas & Templates |
| De-identify reports (LLM + regex belt-and-suspenders) | mosaicx deidentify |
CLI Reference |
| Summarize patient timelines across multiple reports | mosaicx summarize |
CLI Reference |
| Optimize pipelines with labeled data (DSPy) | mosaicx optimize, mosaicx eval |
Optimization |
| Extend with custom pipelines, MCP server, Python SDK | mosaicx pipeline new, mosaicx mcp serve |
Developer Guide |
Run any command with --help for full options. Complete reference: docs/cli-reference.md
Recipes
# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology
# Schema-driven extraction (define your own fields)
mosaicx schema generate --description "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --schema EchoReport
# Batch-process a folder of reports
mosaicx batch --input-dir ./reports --output-dir ./structured --mode radiology --format jsonl
# De-identify a clinical note
mosaicx deidentify --document note.txt
# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001
See the full CLI Reference for every flag and option.
Privacy
[!IMPORTANT] Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.
LLM Backends
MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.
| Backend | Port | Example |
|---|---|---|
| Ollama | 11434 | Works out-of-the-box, no config needed |
| llama.cpp | 8080 | llama-server -m model.gguf --port 8080 |
| vLLM | 8000 | vllm serve gpt-oss:120b |
| SGLang | 30000 | python -m sglang.launch_server --model-path gpt-oss:120b |
| vLLM-MLX | 8000 | vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 (Apple Silicon) |
export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1 # point at your server
export MOSAICX_API_KEY=dummy # or your real key for cloud APIs
SSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md
OCR Engines
| Engine | Approach | Best for |
|---|---|---|
| Surya | Layout detection + recognition | Clean printed text, fast |
| Chandra | Vision-Language Model (Qwen3-VL 9B) | Handwriting, complex layouts, tables |
By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.
Configuration
# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8 # model name
export MOSAICX_API_BASE=http://localhost:8000/v1 # server URL
export MOSAICX_API_KEY=dummy # or real key for cloud
# View active config
mosaicx config show
Full variable reference, .env file setup, and backend scenarios: docs/configuration.md
Documentation
| Guide | Description |
|---|---|
| Getting Started | Install, first extraction, basics |
| CLI Reference | Every command, every flag, examples |
| Pipelines | Pipeline inputs/outputs, JSONL formats |
| Schemas & Templates | Create and manage extraction schemas |
| Optimization | Improve accuracy with DSPy optimizers |
| Configuration | Env vars, backends, OCR, export formats |
| MCP Server | AI agent integration via MCP |
| Developer Guide | Custom pipelines, Python SDK |
| Architecture | System design, key decisions |
Development
git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]" # or: uv sync --group dev
pytest tests/ -q
See Developer Guide for custom pipelines and the Python SDK.
Citation
@software{mosaicx2025,
title = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year = {2025},
url = {https://github.com/DIGIT-X-Lab/MOSAICX},
doi = {10.5281/zenodo.17601890}
}
License
Apache 2.0 -- see LICENSE.
Contact
Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mosaicx-2.0.0a1.tar.gz.
File metadata
- Download URL: mosaicx-2.0.0a1.tar.gz
- Upload date:
- Size: 3.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8678c6ee804460eaca6626cea7245dad494dffda2e9068b116ffd5c45d33c259
|
|
| MD5 |
8740ded9afc940d1f584bb71d63bd46c
|
|
| BLAKE2b-256 |
0aaa07403c9efc67d18fac534b98639169abf7297459e1b05c3f76bda1e981d9
|
File details
Details for the file mosaicx-2.0.0a1-py3-none-any.whl.
File metadata
- Download URL: mosaicx-2.0.0a1-py3-none-any.whl
- Upload date:
- Size: 107.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a4c7ef52a7ae6380096170316496f17984c5d85ab86eb08f2cb5737f14b028e
|
|
| MD5 |
3b957d3678099954fa98a381b2f267f2
|
|
| BLAKE2b-256 |
32996f71de6b571123e84c10657d98a98697c3a98d23ccc42246953f1e158ea6
|