Skip to main content

Infrastructure for AI-assisted clinical research with EHR datasets

Project description

M4: A Toolbox for LLMs on Clinical Data

M4 Logo

Query clinical datasets with natural language through Claude, Cursor, or any MCP client

Python MCP Tests

M4 is an infrastructure layer for multimodal EHR data that provides LLM agents with a unified toolbox for querying clinical datasets. It supports tabular data and clinical notes, dynamically selecting tools by modality to query MIMIC-IV, eICU, and custom datasets through a single natural-language interface.

Usage example

M4 is a fork of the M3 project and would not be possible without it 🫶 Please cite their work when using M4!

Quickstart (3 steps)

1. Install uv

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Initialize M4

mkdir my-research && cd my-research
uv init && uv add m4-infra
uv run m4 init mimic-iv-demo

This downloads the free MIMIC-IV demo dataset (~16MB) and sets up a local DuckDB database.

3. Connect your AI client

Claude Desktop:

uv run m4 config claude --quick

Other clients (Cursor, LibreChat, etc.):

uv run m4 config --quick

Copy the generated JSON into your client's MCP settings, restart, and start asking questions!

Different setup options
  • If you don't want to use uv, you can just run pip install m4-infra

  • If you want to use Docker, look at docs/DEVELOPMENT.md

Code Execution

For complex analysis that goes beyond simple queries, M4 provides a Python API that returns Python data types instead of formatted strings (e.g. pd.DataFrame for SQL queries). This transforms M4 from a query tool into a complete clinical data analysis environment.

from m4 import set_dataset, execute_query, get_schema

set_dataset("mimic-iv")

# Get schema as a dict
schema = get_schema()
print(schema['tables'])  # ['admissions', 'diagnoses_icd', ...]

# Query returns a pandas DataFrame
df = execute_query("""
    SELECT diagnosis, COUNT(*) as n
    FROM diagnoses_icd
    GROUP BY diagnosis
    ORDER BY n DESC
    LIMIT 10
""")

# Use full pandas power: filter, join, compute statistics
df[df['n'] > 100].plot(kind='bar')

The API uses the same tools as the MCP server, so behavior is consistent. But instead of parsing text, you get DataFrames you can immediately analyze, visualize, or feed into downstream pipelines.

When to use code execution:

  • Multi-step analyses where each query informs the next
  • Large result sets (thousands of rows) that shouldn't flood your context
  • Statistical computations, survival analysis, cohort characterization
  • Building reproducible analysis notebooks

See Code Execution Guide for the full API reference.

Agent Skills

M4 ships with skills that teach AI coding assistants how to use the Python API effectively. Skills are contextual prompts that activate when relevant—when you ask about clinical data analysis, the assistant automatically knows how to use M4's API.

Supported tools: Claude Code, Cursor, Cline, Codex CLI, Gemini CLI, GitHub Copilot

m4 skills                        # Interactive tool selection
m4 skills --tools claude,cursor  # Install for specific tools
m4 skills --list                 # Show installed skills
m4 config claude --skills        # Install during Claude Desktop setup

See Skills Guide for details on the available skills and how to create custom ones.

Example Questions

Once connected, try asking:

Tabular data (mimic-iv, eicu):

  • "What tables are available in the database?"
  • "Show me the race distribution in hospital admissions"
  • "Find all ICU stays longer than 7 days"
  • "What are the most common lab tests?"

Clinical notes (mimic-iv-note):

  • "Search for notes mentioning diabetes"
  • "List all notes for patient 10000032"
  • "Get the full discharge summary for this patient"

Supported Datasets

Dataset Modality Size Access Local BigQuery
mimic-iv-demo Tabular 100 patients Free Yes No
mimic-iv Tabular 365k patients PhysioNet credentialed Yes Yes
mimic-iv-note Notes 331k notes PhysioNet credentialed Yes Yes
eicu Tabular 200k+ patients PhysioNet credentialed Yes Yes

These datasets are supported out of the box. However, it is possible to add any other custom dataset by following these instructions.

Switch datasets anytime:

m4 use mimic-iv     # Switch to full MIMIC-IV
m4 status           # Show active dataset details
m4 status --all     # List all available datasets
Setting up MIMIC-IV or eICU (credentialed datasets)
  1. Get PhysioNet credentials: Complete the credentialing process and sign the data use agreement for the dataset.

  2. Download the data:

    # For MIMIC-IV
    wget -r -N -c -np --user YOUR_USERNAME --ask-password \
      https://physionet.org/files/mimiciv/3.1/ \
      -P m4_data/raw_files/mimic-iv
    
    # For eICU
    wget -r -N -c -np --user YOUR_USERNAME --ask-password \
      https://physionet.org/files/eicu-crd/2.0/ \
      -P m4_data/raw_files/eicu
    

    Put the downloaded data in a m4_data directory that ideally is located within the project directory. Name the directory for the dataset mimic-iv/eicu.

  3. Initialize:

    m4 init mimic-iv   # or: m4 init eicu
    

This converts the CSV files to Parquet format and creates a local DuckDB database.

Available Tools

M4 exposes these tools to your AI client. Tools are filtered based on the active dataset's modality.

Dataset Management:

Tool Description
list_datasets List available datasets and their status
set_dataset Switch the active dataset

Tabular Data Tools (mimic-iv, mimic-iv-demo, eicu):

Tool Description
get_database_schema List all available tables
get_table_info Get column details and sample data
execute_query Run SQL SELECT queries

Clinical Notes Tools (mimic-iv-note):

Tool Description
search_notes Full-text search with snippets
get_note Retrieve a single note by ID
list_patient_notes List notes for a patient (metadata only)

More Documentation

Guide Description
Code Execution Python API for programmatic access
Skills Claude Code skills for contextual assistance
Tools Reference MCP tool documentation
BigQuery Setup Google Cloud for full datasets
Custom Datasets Add your own PhysioNet datasets
Development Contributing, testing, architecture
OAuth2 Authentication Enterprise security setup

Roadmap

M4 is designed as a growing toolbox for LLM agents working with EHR data. Planned and ongoing directions include:

  • More Tools

    • Implement tools for current modalities (e.g. statistical reports, RAG)
    • Add tools for new modalities (images, waveforms)
  • Better context handling

    • Concise, dataset-aware context for LLM agents
  • Dataset expansion

    • Out-of-the-box support for additional PhysioNet datasets
    • Improved support for institutional/custom EHR schemas
  • Evaluation & reproducibility

    • Session export and replay
    • Evaluation with the latest LLMs and smaller expert models

The roadmap reflects current development goals and may evolve as the project matures.

Troubleshooting

"Parquet not found" error:

m4 init mimic-iv-demo --force

MCP client won't connect: Check client logs (Claude Desktop: Help → View Logs) and ensure the config JSON is valid.

Need to reconfigure:

m4 config claude --quick   # Regenerate Claude Desktop config
m4 config --quick          # Regenerate generic config

Citation

M4 builds on the M3 project. Please cite:

@article{attrach2025conversational,
  title={Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis},
  author={Attrach, Rafi Al and Moreira, Pedro and Fani, Rajna and Umeton, Renato and Celi, Leo Anthony},
  journal={arXiv preprint arXiv:2507.01053},
  year={2025}
}

Report an Issue · Contribute

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m4_infra-0.0.0.dev0.tar.gz (153.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

m4_infra-0.0.0.dev0-py3-none-any.whl (152.8 kB view details)

Uploaded Python 3

File details

Details for the file m4_infra-0.0.0.dev0.tar.gz.

File metadata

  • Download URL: m4_infra-0.0.0.dev0.tar.gz
  • Upload date:
  • Size: 153.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m4_infra-0.0.0.dev0.tar.gz
Algorithm Hash digest
SHA256 a9d13530253cf93eb7891eca73c159c511aae5086609002869d2d8db2b729d3d
MD5 dabbd9ca2e35db72e7e58d849dbffcbf
BLAKE2b-256 da7777727fa69a5ec211e5c29ef58f1cb2389cb1b2e732435f2713d88fc98311

See more details on using hashes here.

Provenance

The following attestation bundles were made for m4_infra-0.0.0.dev0.tar.gz:

Publisher: publish.yaml on hannesill/m4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file m4_infra-0.0.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: m4_infra-0.0.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 152.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m4_infra-0.0.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 2de20a73c7c3c11b9d9af9a9ee577501e70a051dc8da753d54f24efd0d5adcc8
MD5 4d2a5cae34983fe1dc771257bd572b8c
BLAKE2b-256 f011c2bb4ac0a1bd777df4550c791aef2b246c19496ecdb0c8172d2804863e1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for m4_infra-0.0.0.dev0-py3-none-any.whl:

Publisher: publish.yaml on hannesill/m4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page