Skip to main content

A tiny EHR dataset for learning, prototyping, and building consisting 100 patients de identified data

Project description

TinyEHR : A Tiny Electronic Health Records Dataset for Learning, Prototyping, and Building

TinyEHR is a small, open, reproducible clinical dataset with 100 patients available in two formats - MIMIC and OMOP. It is derived from the MIMIC-IV Clinical Database Demo v2.2, the publicly available subset of MIMIC-IV published by the MIT Laboratory for Computational Physiology.

Openly available, no credentialing or data use agreements required. Install and start exploring clinical data in seconds.

Website tinyehr.org
GitHub github.com/vidulpanickan/TinyEHR
HuggingFace datasets/vidulpanickan/TinyEHR
PyPI pip install tinyehr

Install

pip install tinyehr

Python API

import tinyehr

# Quick reference of all functions
tinyehr.help()

# Overview of all tables with row counts
tinyehr.info()
tinyehr.info(format="tinyehr_omop_format")

# List table names
tinyehr.list_tables()
tinyehr.list_tables(format="tinyehr_omop_format")

# Column names, types, and sample rows for a table
tinyehr.describe_table("patients")
tinyehr.describe_table("person", format="tinyehr_omop_format")

# Find tables by keyword in table and column names
tinyehr.search_tables("lab")
tinyehr.search_tables("drug")

# Load a table as a pandas DataFrame
patients = tinyehr.load_table("patients")
person = tinyehr.load_table("person", format="tinyehr_omop_format")

# All data for one patient across all tables
data = tinyehr.get_patient(10000032)
data["admissions"]    # DataFrame of this patient's admissions
data["labevents"]     # DataFrame of this patient's labs
data["noteevents"]    # DataFrame of this patient's notes

# Build a local SQLite database
db_path = tinyehr.build_sqlite(format="tinyehr_mimic_format")
db_path = tinyehr.build_sqlite(format="tinyehr_omop_format")

# Query the SQLite database
import sqlite3
conn = sqlite3.connect(db_path)
conn.execute("SELECT * FROM admissions LIMIT 5").fetchall()

Direct from HuggingFace

import pandas as pd

patients = pd.read_parquet(
    "hf://datasets/vidulpanickan/tinyehr/tinyehr_mimic_format/patients.parquet"
)

No dependencies beyond pandas and pyarrow.

Trouble downloading?

You can download the raw CSV files directly from GitHub:

  1. Go to github.com/vidulpanickan/TinyEHR
  2. Click the green Code button
  3. Select Download ZIP

Or clone via terminal:

git clone https://github.com/vidulpanickan/TinyEHR.git

Formats

TinyEHR ships in two formats from the same underlying patient cohort:

MIMIC format follows the original MIMIC-IV schema with dates shifted to realistic years, ICD codes reformatted with decimal points, and 4,580 clinical notes generated using LLM based on patient visit profiles.

OMOP format follows the OHDSI CDM v5.3.1 schema with hashed person IDs, dates shifted to realistic years, and clinical codes mapped to standardized medical vocabularies. ICD codes in source_value fields are stored without decimal points, following the OMOP billing/claims convention.

For full dataset structure, schema documentation, and table details, visit About The Data.

Differences from MIMIC-IV Demo

TinyEHR applies four targeted transformations to the original MIMIC-IV Demo data. All clinical values, patient demographics, table structures, referential integrity, and row counts are unchanged.

Transformation What changed Why
Date shifting All dates shifted from synthetic 2100+ range to realistic 2010s-2020s using per-patient offsets derived from anchor_year_group. Affects 21 MIMIC tables and 15 OMOP tables. Offsets saved in metadata/date_offsets.csv. Realistic dates for prototyping.
ICD code formatting (MIMIC only) Decimal points inserted into ICD codes (E119E11.9, 396139.61). ICD-10-PCS codes left unchanged. OMOP source_value fields are not modified (preserves billing/claims format). Matches real-world clinical code formatting.
Clinical notes 4,580 notes across 14 types, generated from each patient's profile during their hospital visit including demographics, diagnoses, and admission data. Added as noteevents (MIMIC) and note (OMOP). The original Demo has no clinical notes.
OMOP note concepts 19 note-related concepts added to 2b_concept.csv (10 Note Type, 7 LOINC Document Ontology, 2 utility). Row count: 3,885 - 3,904. Required for OMOP note table concept references.

What's New (v0.2.0)

  • Parquet type enforcement: all column types now match the official MIMIC-IV and OMOP CDM DDL schemas exactly (integers, floats, timestamps, strings)
  • CSV type enforcement: when loading CSVs without pyarrow, types are applied from bundled DDL schema files instead of relying on pandas auto-inference
  • OMOP source values: no longer formatted with decimal points, preserving the billing/claims convention
  • ICD-9 procedure codes: decimal point now correctly placed after 2nd digit (396139.61)
  • Clinical notes: regenerated from patient profiles with correct admission dates

License

  • Code (this Python package): MIT License
  • Data (the TinyEHR dataset): ODbL-1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyehr-0.2.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyehr-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file tinyehr-0.2.0.tar.gz.

File metadata

  • Download URL: tinyehr-0.2.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tinyehr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 db6deeb38d3c2bd0712ba3453740b8f58c2feb38172499147be7adf91f1ab002
MD5 c8096edee0de8bd39f103a9af1dd1a7d
BLAKE2b-256 eeb3ddefc77e08ff92e6a65bcd109141059b0e152cfd301d1067ff873caaddac

See more details on using hashes here.

File details

Details for the file tinyehr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tinyehr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for tinyehr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8762b3b1b0e3a4f3aac7eb3d6b5cdcf03df112668157afb7731bd1b4403f3a21
MD5 f5f5236f98742538d8923ce513851242
BLAKE2b-256 d54d89d2f08430b96e972d9ab6bd13fd35caa4a422437668301d6772098944e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page