Medical cOmputational Suite for Advanced Intelligent eXtraction
Project description
DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.
How It Works
flowchart LR
A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
B --> C["DSPy Pipeline"]
C --> D["Validated JSON"]
style A fill:#B5A89A,stroke:#8a7e72,color:#fff
style B fill:#E87461,stroke:#c25a49,color:#fff
style C fill:#E87461,stroke:#c25a49,color:#fff
style D fill:#B5A89A,stroke:#8a7e72,color:#fff
MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.
Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.
Quick Start
One-line install (Mac or Linux):
curl -fsSL https://raw.githubusercontent.com/DIGIT-X-Lab/MOSAICX/master/scripts/setup.sh | bash
Or install manually and let the setup wizard configure everything:
pip install mosaicx
mosaicx setup
Then extract structured data from a report:
mosaicx extract --document report.pdf --mode radiology
Check health anytime with mosaicx doctor. See the full Quickstart guide for details.
Install Extras
pip install 'mosaicx[mcp]' # + MCP server for AI agents
pip install 'mosaicx[query]' # + fast tabular query stack (duckdb + polars)
pip install 'mosaicx[all]' # everything
Developer Fast Loop (Mac + vLLM-MLX, 120B)
# 1) serve your local model (already downloaded)
vllm-mlx serve mlx-community/gpt-oss-120b-4bit --port 8000
# 2) point MOSAICX to that server
export MOSAICX_LM=openai/mlx-community/gpt-oss-120b-4bit
export MOSAICX_API_BASE=http://127.0.0.1:8000/v1
export MOSAICX_API_KEY=dummy
# 3) verify the endpoint
curl -sS --max-time 5 http://127.0.0.1:8000/v1/models
# 4) run extraction + claim verify + query
mosaicx extract --document report.pdf --mode radiology -o output.json
mosaicx verify --document report.pdf --claim "patient BP is 128/82" --level thorough
mosaicx query --document report.pdf --chat --trace
[!TIP] Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.
[!TIP] Want the fastest first success? Follow docs/quickstart.md to run
extract,verify, andqueryend-to-end in ~10 minutes.
What You Can Do
| Capability | Commands | Guide |
|---|---|---|
| Extract structured data from clinical documents | mosaicx extract |
Pipelines |
| Create and manage templates for custom extraction targets | mosaicx template create / list / refine |
Schemas & Templates |
| Verify claims and outputs against source evidence | mosaicx verify |
CLI Reference |
| Query sources conversationally with citations | mosaicx query |
CLI Reference |
| De-identify reports (LLM + regex belt-and-suspenders) | mosaicx deidentify |
CLI Reference |
| Summarize patient timelines across multiple reports | mosaicx summarize |
CLI Reference |
| Optimize pipelines with labeled data (DSPy) | mosaicx optimize, mosaicx eval |
Optimization |
| Extend with custom pipelines, MCP server, Python SDK | mosaicx pipeline new, mosaicx mcp serve |
Developer Guide |
Run any command with --help for full options. Complete reference: docs/cli-reference.md
Recipes
# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology
# Template-driven extraction (define your own fields)
mosaicx template create --describe "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --template EchoReport
# Batch-process a folder of reports
mosaicx extract --dir ./reports --output-dir ./structured --mode radiology --format jsonl
# De-identify a clinical note
mosaicx deidentify --document note.txt
# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001
See the full CLI Reference for every flag and option.
Privacy
[!IMPORTANT] Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.
LLM Backends
MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.
| Backend | Port | Example |
|---|---|---|
| Ollama | 11434 | Works out-of-the-box, no config needed |
| llama.cpp | 8080 | llama-server -m model.gguf --port 8080 |
| vLLM | 8000 | vllm serve gpt-oss:120b |
| SGLang | 30000 | python -m sglang.launch_server --model-path gpt-oss:120b |
| vLLM-MLX | 8000 | vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 (Apple Silicon) |
export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1 # point at your server
export MOSAICX_API_KEY=dummy # or your real key for cloud APIs
SSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md
OCR Engines
| Engine | Approach | Best for |
|---|---|---|
| Surya | Layout detection + recognition | Clean printed text, fast |
| Chandra | Vision-Language Model (Qwen3-VL 9B) | Handwriting, complex layouts, tables |
By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.
Configuration
# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8 # model name
export MOSAICX_API_BASE=http://localhost:8000/v1 # server URL
export MOSAICX_API_KEY=dummy # or real key for cloud
# View active config
mosaicx config show
Full variable reference, .env file setup, and backend scenarios: docs/configuration.md
Documentation
| Guide | Description |
|---|---|
| Quickstart | Fast setup and first successful run in ~10 minutes |
| Getting Started | Install, first extraction, basics |
| Verify Guide | Truth/adjudication workflows for claims and extraction output |
| Query Guide | Grounded multi-turn querying with evidence and confidence |
| Troubleshooting | Debug slow query, wrong stats, fallback, and runtime issues |
| Production Checklist | Deploy with reproducibility, gating, and auditability controls |
| CLI Reference | Every command, every flag, examples |
| Pipelines | Pipeline inputs/outputs, JSONL formats |
| Schemas & Templates | Create and manage extraction schemas |
| Optimization | Improve accuracy with DSPy optimizers |
| Configuration | Env vars, backends, OCR, export formats |
| MCP Server | AI agent integration via MCP |
| Developer Guide | Custom pipelines, Python SDK |
| Architecture | System design, key decisions |
Development
git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]" # or: uv sync --group dev
pytest tests/ -q
See Developer Guide for custom pipelines and the Python SDK.
Citation
@software{mosaicx2025,
title = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year = {2025},
url = {https://github.com/DIGIT-X-Lab/MOSAICX},
doi = {10.5281/zenodo.17601890}
}
License
Apache 2.0 -- see LICENSE.
Contact
Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mosaicx-2.0.0.tar.gz.
File metadata
- Download URL: mosaicx-2.0.0.tar.gz
- Upload date:
- Size: 3.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15ad079becdef6d5e0b848cf3b68beec69961a37efce56e790a74fddfe79f795
|
|
| MD5 |
5ec1daf665c72e4a86336cb1e2048e69
|
|
| BLAKE2b-256 |
8c958d260e0a58df8d9813cfc2ac3faf19a3097dd5b55ae04f9bf855e0a23c40
|
File details
Details for the file mosaicx-2.0.0-py3-none-any.whl.
File metadata
- Download URL: mosaicx-2.0.0-py3-none-any.whl
- Upload date:
- Size: 285.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab675220456d48483aba92d1e9e687e6b8596a703461e633e482f181ecdb7849
|
|
| MD5 |
e0ca49c937f9dc1d79618dfa04a8da62
|
|
| BLAKE2b-256 |
2b7b235fe8fa9374f6f50883e2246f08213313061c2121638fdd9db81d9705a7
|