Infrastructure for AI-assisted clinical research with EHR datasets

These details have not been verified by PyPI

Project links

Project description

M4: Infrastructure for AI-Assisted Clinical Research

M4 Logo

Give your AI agents clinical intelligence & access to MIMIC-IV, eICU, and more

M4 is infrastructure for AI-assisted clinical research. Initialize MIMIC-IV, eICU, or custom datasets as fast local databases (with optional BigQuery for cloud access). Your AI agents get specialized tools (MCP, Python API) and clinical knowledge (agent skills) to query and analyze them.

Usage example – M4 MCP | Usage example – Code Execution

M4 builds on the M3 project. Please cite their work when using M4!

Why M4?

Clinical research shouldn't require mastering database schemas. Whether you're screening a hypothesis, characterizing a cohort, or running a multi-step survival analysis—you should be able to describe what you want and get clinically meaningful results.

M4 makes this possible by giving AI agents deep clinical knowledge:

Understand clinical semantics. LLMs can write SQL, but have a harder time with (dataset-specific) clinical semantics. M4's comprehensive agent skills encode validated clinical concepts—so "find sepsis patients" produces clinically correct queries on any supported dataset.

Work across modalities. Clinical research with M4 spans structured data, clinical notes, and (soon) waveforms and imaging. M4 dynamically selects tools based on what each dataset contains—query labs in MIMIC-IV, search discharge summaries in MIMIC-IV-Note, all through the same interface.

Go beyond chat. Data exploration and simple research questions work great via MCP. But real research requires iteration: explore a cohort, compute statistics, visualize distributions, refine criteria. M4's Python API returns DataFrames that integrate with pandas, scipy, and matplotlib—turning your AI assistant into a research partner that can execute complete analysis workflows.

Cross-dataset research. You should be able to ask for multi-dataset queries or cross-dataset comparisons. M4 makes this easier than ever as the AI can switch between your initialized datasets on its own, allowing it to do cross-dataset tasks for you.

Quickstart (3 steps)

1. Install uv

macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows (PowerShell):

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Initialize M4

mkdir my-research && cd my-research
uv init && uv add m4-infra
source .venv/bin/activate  # Windows: .venv\Scripts\activate
m4 init mimic-iv-demo

This downloads the free MIMIC-IV demo dataset (~16MB) and sets up a local DuckDB database.

3. Connect your AI client

Claude Desktop:

m4 config claude --quick

Other clients (Cursor, LibreChat, etc.):

m4 config --quick

Copy the generated JSON into your client's MCP settings, restart, and start asking questions!

Different setup options

If you don't want to use uv, you can just run pip install m4-infra
If you want to use Docker, look at docs/DEVELOPMENT.md

Code Execution

For complex analysis that goes beyond simple queries, M4 provides a Python API that returns Python data types instead of formatted strings (e.g. pd.DataFrame for SQL queries). This transforms M4 from a query tool into a complete clinical data analysis environment.

from m4 import set_dataset, execute_query, get_schema

set_dataset("mimic-iv")

# Get schema as a dict
schema = get_schema()
print(schema['tables'])  # ['mimiciv_hosp.admissions', 'mimiciv_hosp.diagnoses_icd', ...]

# Query returns a pandas DataFrame
df = execute_query("""
    SELECT icd_code, COUNT(*) as n
    FROM mimiciv_hosp.diagnoses_icd
    GROUP BY icd_code
    ORDER BY n DESC
    LIMIT 10
""")

# Use full pandas power: filter, join, compute statistics
df[df['n'] > 100].plot(kind='bar')

The API uses the same tools as the MCP server, so behavior is consistent. But instead of parsing text, you get DataFrames you can immediately analyze, visualize, or feed into downstream pipelines.

When to use code execution:

Multi-step analyses where each query informs the next
Large result sets (thousands of rows) that shouldn't flood your context
Statistical computations, survival analysis, cohort characterization
Building reproducible analysis notebooks

See Code Execution Guide for the full API reference and this example session for a walkthrough.

Agent Skills

M4 ships with 17 skills that teach AI coding assistants clinical research patterns. Skills activate automatically when relevant—ask about "SOFA scores" or "sepsis cohorts" and Claude uses validated SQL from MIT-LCP repositories.

Included skills:

API: m4-api for Python API usage
Severity Scores: SOFA, APACHE III, SAPS-II, OASIS, LODS, SIRS
Sepsis: Sepsis-3 cohort identification, suspected infection
Organ Failure: KDIGO AKI staging
Measurements: GCS calculation, baseline creatinine, vasopressor equivalents
Cohort Selection: First ICU stay identification
Data Quality: Table relationships, MIMIC-eICU mapping, research pitfalls

Supported tools: Claude Code, Cursor, Cline, Codex CLI, Gemini CLI, GitHub Copilot

m4 skills                        # Interactive tool selection
m4 skills --tools claude,cursor  # Install for specific tools
m4 skills --list                 # Show installed skills

See Skills Guide for the full list and how to create custom skills.

Example Questions

Once connected, try asking:

Tabular data (mimic-iv, eicu):

"What tables are available in the database?"
"Show me the race distribution in hospital admissions"
"Find all ICU stays longer than 7 days"
"What are the most common lab tests?"

Derived concept tables (mimic-iv, after m4 init-derived):

"What are the average SOFA scores for patients with sepsis?"
"Show KDIGO AKI staging distribution across ICU stays"
"Find patients on norepinephrine with SOFA > 10"
"What is the 30-day mortality for patients with Charlson index > 5?"

Clinical notes (mimic-iv-note):

"Search for notes mentioning diabetes"
"List all notes for patient 10000032"
"Get the full discharge summary for this patient"

Supported Datasets

Dataset	Modality	Size	Access	Local	BigQuery	Derived Tables
mimic-iv-demo	Tabular	100 patients	Free	Yes	No	No
mimic-iv	Tabular	365k patients	PhysioNet credentialed	Yes	Yes	Yes (63 tables)
mimic-iv-note	Notes	331k notes	PhysioNet credentialed	Yes	Yes	No
eicu	Tabular	200k+ patients	PhysioNet credentialed	Yes	Yes	No

These datasets are supported out of the box. However, it is possible to add any other custom dataset by following these instructions.

Switch datasets or backends anytime:

m4 use mimic-iv     # Switch to full MIMIC-IV
m4 backend bigquery # Switch to BigQuery (or duckdb)
m4 status           # Show active dataset and backend
m4 status --all     # List all available datasets
m4 status --derived # Show per-table derived materialization status

Derived concept tables (MIMIC-IV only):

m4 init-derived mimic-iv         # Materialize ~63 derived tables (SOFA, sepsis3, KDIGO, etc.)
m4 init-derived mimic-iv --list  # List available derived tables without materializing

After running m4 init mimic-iv, you are prompted whether to materialize derived tables. You can also run m4 init-derived separately at any time. Derived tables are created in the mimiciv_derived schema (e.g., mimiciv_derived.sofa) and are immediately queryable. The SQL is vendored from the mimic-code repository -- production-tested and DuckDB-compatible. BigQuery users already have these tables available via physionet-data.mimiciv_derived and do not need to run init-derived.

Setting up MIMIC-IV or eICU (credentialed datasets)

Get PhysioNet credentials: Complete the credentialing process and sign the data use agreement for the dataset.

Download the data:

# For MIMIC-IV
wget -r -N -c -np --cut-dirs=2 -nH --user YOUR_USERNAME --ask-password \
  https://physionet.org/files/mimiciv/3.1/ \
  -P m4_data/raw_files/mimic-iv

# For eICU
wget -r -N -c -np --cut-dirs=2 -nH --user YOUR_USERNAME --ask-password \
  https://physionet.org/files/eicu-crd/2.0/ \
  -P m4_data/raw_files/eicu

The --cut-dirs=2 -nH flags ensure CSV files land directly in m4_data/raw_files/mimic-iv/ rather than a nested physionet.org/files/... structure.

Initialize:
```
m4 init mimic-iv   # or: m4 init eicu
```

This converts the CSV files to Parquet format and creates a local DuckDB database.

Available Tools

M4 exposes these tools to your AI client. Tools are filtered based on the active dataset's modality.

Dataset Management:

Tool	Description
`list_datasets`	List available datasets and their status
`set_dataset`	Switch the active dataset

Tabular Data Tools (mimic-iv, mimic-iv-demo, eicu):

Tool	Description
`get_database_schema`	List all available tables
`get_table_info`	Get column details and sample data
`execute_query`	Run SQL SELECT queries

Clinical Notes Tools (mimic-iv-note):

Tool	Description
`search_notes`	Full-text search with snippets
`get_note`	Retrieve a single note by ID
`list_patient_notes`	List notes for a patient (metadata only)

Guide	Description
Architecture	Design philosophy, system overview, clinical semantics
Code Execution	Python API for programmatic access
Skills	17 clinical research skills and custom skill creation
Tools Reference	MCP tool documentation
BigQuery Setup	Google Cloud for full datasets
Custom Datasets	Add your own PhysioNet datasets
Development	Contributing, testing, code style
OAuth2 Authentication	Enterprise security setup

Roadmap

M4 is infrastructure for AI-assisted clinical research. Current priorities:

Clinical Semantics
- More concept mappings (comorbidity indices, medication classes)
- Semantic search over clinical notes (beyond keyword matching)
- More agent skills that provide meaningful clinical knowledge
New Modalities
- Waveforms (ECG, arterial blood pressure)
- Imaging (chest X-rays)
Clinical Research Agents
- Skills and guardrails that enforce scientific integrity and best practices (documentation, etc.)
- Query logging and session export
- Result fingerprints for audit trails

Troubleshooting

"Parquet not found" error:

m4 init mimic-iv-demo --force

MCP client won't connect: Check client logs (Claude Desktop: Help → View Logs) and ensure the config JSON is valid.

m4 command opens GNU M4 instead of the CLI: On macOS/Linux, m4 is a built-in system utility. Make sure your virtual environment is activated (source .venv/bin/activate) so that the correct m4 binary is found first. Alternatively, use uv run m4 [command] to run within the project environment without activating it.

Need to reconfigure:

m4 config claude --quick   # Regenerate Claude Desktop config
m4 config --quick          # Regenerate generic config

Citation

M4 builds on the M3 project. Please cite:

@article{attrach2025conversational,
  title={Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis},
  author={Attrach, Rafi Al and Moreira, Pedro and Fani, Rajna and Umeton, Renato and Celi, Leo Anthony},
  journal={arXiv preprint arXiv:2507.01053},
  year={2025}
}

Report an Issue · Contribute

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.3

May 30, 2026

0.5.2

May 30, 2026

0.5.1

May 30, 2026

0.5.0

May 30, 2026

0.4.5

May 29, 2026

0.4.4

May 28, 2026

0.4.3

Feb 19, 2026

This version

0.4.2

Jan 30, 2026

0.4.1

Jan 28, 2026

0.4.0

Jan 28, 2026

0.0.0.dev0 pre-release

Jan 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m4_infra-0.4.2.tar.gz (232.3 kB view details)

Uploaded Jan 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

m4_infra-0.4.2-py3-none-any.whl (231.9 kB view details)

Uploaded Jan 30, 2026 Python 3

File details

Details for the file m4_infra-0.4.2.tar.gz.

File metadata

Download URL: m4_infra-0.4.2.tar.gz
Upload date: Jan 30, 2026
Size: 232.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m4_infra-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`34f540239441a7ae39dbdaa9a65212446266a2e0bf857462e4a374616b7795f7`
MD5	`e7e4d63662c5b60930189c34ca0c61ee`
BLAKE2b-256	`ac35de88ff7226701d3b11ba48ae54ab15f2d97d0605a966ab29ce0e51e809b4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for m4_infra-0.4.2.tar.gz:

Publisher: publish.yaml on hannesill/m4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: m4_infra-0.4.2.tar.gz
- Subject digest: 34f540239441a7ae39dbdaa9a65212446266a2e0bf857462e4a374616b7795f7
- Sigstore transparency entry: 871576082
- Sigstore integration time: Jan 30, 2026
Source repository:
- Permalink: hannesill/m4@51ec4204d5ef4fcff2847f9e031eb75950483d1e
- Branch / Tag: refs/tags/v0.4.2
- Owner: https://github.com/hannesill
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@51ec4204d5ef4fcff2847f9e031eb75950483d1e
- Trigger Event: release

File details

Details for the file m4_infra-0.4.2-py3-none-any.whl.

File metadata

Download URL: m4_infra-0.4.2-py3-none-any.whl
Upload date: Jan 30, 2026
Size: 231.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m4_infra-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2d1138742699a383b565aea6796793f7cadd9e46eee25aa808c38a32c338bc1`
MD5	`87ccf6e8b000da23e9de2aa9ddc256bf`
BLAKE2b-256	`3d02b0a919f71fd60bb52d7853b98104cff06b39b102c2b97732bad29c58277d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for m4_infra-0.4.2-py3-none-any.whl:

Publisher: publish.yaml on hannesill/m4

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: m4_infra-0.4.2-py3-none-any.whl
- Subject digest: e2d1138742699a383b565aea6796793f7cadd9e46eee25aa808c38a32c338bc1
- Sigstore transparency entry: 871576086
- Sigstore integration time: Jan 30, 2026
Source repository:
- Permalink: hannesill/m4@51ec4204d5ef4fcff2847f9e031eb75950483d1e
- Branch / Tag: refs/tags/v0.4.2
- Owner: https://github.com/hannesill
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@51ec4204d5ef4fcff2847f9e031eb75950483d1e
- Trigger Event: release

m4-infra 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

M4: Infrastructure for AI-Assisted Clinical Research

Why M4?

Quickstart (3 steps)

1. Install uv

2. Initialize M4

3. Connect your AI client

Code Execution

Agent Skills

Example Questions

Supported Datasets

Available Tools

More Documentation

Roadmap

Troubleshooting

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance