Skip to main content

Medical cOmputational Suite for Advanced Intelligent eXtraction

Project description

MOSAICX

PyPI DOI Python License Downloads

DIGIT-X Lab · LMU Munich
Turn unstructured medical documents into validated, machine-readable JSON.
Runs locally — no PHI leaves your machine.


How It Works

flowchart LR
    A["PDF / Image / Text"] --> B["Dual-Engine OCR"]
    B --> C["DSPy Pipeline"]
    C --> D["Validated JSON"]

    style A fill:#B5A89A,stroke:#8a7e72,color:#fff
    style B fill:#E87461,stroke:#c25a49,color:#fff
    style C fill:#E87461,stroke:#c25a49,color:#fff
    style D fill:#B5A89A,stroke:#8a7e72,color:#fff

MOSAICX ships with specialized pipelines for radiology and pathology reports, a generic extraction mode that adapts to any document, plus de-identification and patient timeline summarization. Every pipeline is a DSPy module -- meaning it can be optimized with labeled data for your specific use case.

Why MOSAICX? -- Fully local (no PHI leaves your machine), schema-driven (define exactly what to extract), dual-engine OCR (handles scans and handwriting), and DSPy-optimizable (improve accuracy with your own labeled data). One CLI for radiology, pathology, de-identification, and summarization.

Quick Start

# Install MOSAICX
pip install mosaicx               # or: uv add mosaicx / pipx install mosaicx

# Start a local LLM (Apple Silicon via vLLM-MLX)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 --port 8000

# Point MOSAICX at it
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8
export MOSAICX_API_BASE=http://localhost:8000/v1

# Extract structured data from a report
mosaicx extract --document report.pdf --mode radiology

[!TIP] Not on Apple Silicon? Use Ollama, vLLM, or any OpenAI-compatible server. See the Getting Started guide for all backend options.

What You Can Do

Capability Commands Guide
Extract structured data from clinical documents mosaicx extract, mosaicx batch Pipelines
Create and manage schemas for custom extraction targets mosaicx schema generate / list / refine Schemas & Templates
De-identify reports (LLM + regex belt-and-suspenders) mosaicx deidentify CLI Reference
Summarize patient timelines across multiple reports mosaicx summarize CLI Reference
Optimize pipelines with labeled data (DSPy) mosaicx optimize, mosaicx eval Optimization
Extend with custom pipelines, MCP server, Python SDK mosaicx pipeline new, mosaicx mcp serve Developer Guide

Run any command with --help for full options. Complete reference: docs/cli-reference.md

Recipes

# Radiology report -> structured JSON
mosaicx extract --document ct_chest.pdf --mode radiology

# Schema-driven extraction (define your own fields)
mosaicx schema generate --description "echo report with LVEF, valve grades, impression"
mosaicx extract --document echo.pdf --schema EchoReport

# Batch-process a folder of reports
mosaicx batch --input-dir ./reports --output-dir ./structured --mode radiology --format jsonl

# De-identify a clinical note
mosaicx deidentify --document note.txt

# Patient timeline from multiple reports
mosaicx summarize --dir ./patient_001/ --patient P001

See the full CLI Reference for every flag and option.

Privacy

[!IMPORTANT] Data stays on your machine. MOSAICX runs against a local inference server by default -- no external API calls, no cloud uploads. For HIPAA/GDPR compliance guidance and cloud backend caveats, see Configuration.

LLM Backends

MOSAICX talks to any OpenAI-compatible endpoint via DSPy + litellm. Pick the backend that fits your hardware -- override with env vars.

Backend Port Example
Ollama 11434 Works out-of-the-box, no config needed
llama.cpp 8080 llama-server -m model.gguf --port 8080
vLLM 8000 vllm serve gpt-oss:120b
SGLang 30000 python -m sglang.launch_server --model-path gpt-oss:120b
vLLM-MLX 8000 vllm-mlx serve mlx-community/gpt-oss-20b-MXFP4-Q8 (Apple Silicon)
export MOSAICX_LM=openai/gpt-oss:120b
export MOSAICX_API_BASE=http://localhost:8000/v1   # point at your server
export MOSAICX_API_KEY=dummy                       # or your real key for cloud APIs

SSH tunneling, vLLM-MLX setup, batch tuning, and benchmarking: docs/configuration.md

OCR Engines

Engine Approach Best for
Surya Layout detection + recognition Clean printed text, fast
Chandra Vision-Language Model (Qwen3-VL 9B) Handwriting, complex layouts, tables

By default both engines run in parallel, score each page, and pick the best result. Override with MOSAICX_OCR_ENGINE=surya or chandra.

Configuration

# Essential vars -- point at your local server
export MOSAICX_LM=openai/mlx-community/gpt-oss-20b-MXFP4-Q8   # model name
export MOSAICX_API_BASE=http://localhost:8000/v1                # server URL
export MOSAICX_API_KEY=dummy                                    # or real key for cloud

# View active config
mosaicx config show

Full variable reference, .env file setup, and backend scenarios: docs/configuration.md

Documentation

Guide Description
Getting Started Install, first extraction, basics
CLI Reference Every command, every flag, examples
Pipelines Pipeline inputs/outputs, JSONL formats
Schemas & Templates Create and manage extraction schemas
Optimization Improve accuracy with DSPy optimizers
Configuration Env vars, backends, OCR, export formats
MCP Server AI agent integration via MCP
Developer Guide Custom pipelines, Python SDK
Architecture System design, key decisions

Development

git clone https://github.com/DIGIT-X-Lab/MOSAICX.git
cd MOSAICX
pip install -e ".[dev]"          # or: uv sync --group dev
pytest tests/ -q

See Developer Guide for custom pipelines and the Python SDK.

Citation

@software{mosaicx2025,
  title   = {MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author  = {Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
  year    = {2025},
  url     = {https://github.com/DIGIT-X-Lab/MOSAICX},
  doi     = {10.5281/zenodo.17601890}
}

License

Apache 2.0 -- see LICENSE.

Contact

Research: lalith.shiyam@med.uni-muenchen.de | Commercial: lalith@zenta.solutions | Issues: github.com/DIGIT-X-Lab/MOSAICX/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaicx-2.0.0a1.tar.gz (3.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaicx-2.0.0a1-py3-none-any.whl (107.8 kB view details)

Uploaded Python 3

File details

Details for the file mosaicx-2.0.0a1.tar.gz.

File metadata

  • Download URL: mosaicx-2.0.0a1.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for mosaicx-2.0.0a1.tar.gz
Algorithm Hash digest
SHA256 8678c6ee804460eaca6626cea7245dad494dffda2e9068b116ffd5c45d33c259
MD5 8740ded9afc940d1f584bb71d63bd46c
BLAKE2b-256 0aaa07403c9efc67d18fac534b98639169abf7297459e1b05c3f76bda1e981d9

See more details on using hashes here.

File details

Details for the file mosaicx-2.0.0a1-py3-none-any.whl.

File metadata

  • Download URL: mosaicx-2.0.0a1-py3-none-any.whl
  • Upload date:
  • Size: 107.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for mosaicx-2.0.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a4c7ef52a7ae6380096170316496f17984c5d85ab86eb08f2cb5737f14b028e
MD5 3b957d3678099954fa98a381b2f267f2
BLAKE2b-256 32996f71de6b571123e84c10657d98a98697c3a98d23ccc42246953f1e158ea6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page