A tiny EHR dataset for learning, prototyping, and building consisting 100 patients de identified data
Project description
TinyEHR : A Tiny Electronic Health Records Dataset for Learning, Prototyping, and Building
TinyEHR is a small, open, reproducible clinical dataset with 100 patients available in two formats - MIMIC and OMOP. It is derived from the MIMIC-IV Clinical Database Demo v2.2, the publicly available subset of MIMIC-IV published by the MIT Laboratory for Computational Physiology.
Openly available, no credentialing or data use agreements required. Install and start exploring clinical data in seconds.
| Website | tinyehr.org |
| GitHub | github.com/vidulpanickan/TinyEHR |
| HuggingFace | datasets/vidulpanickan/TinyEHR |
| PyPI | pip install tinyehr |
Install
pip install tinyehr
Python API
import tinyehr
# Quick reference of all functions
tinyehr.help()
# Overview of all tables with row counts
tinyehr.info()
tinyehr.info(format="tinyehr_omop_format")
# List table names
tinyehr.list_tables()
tinyehr.list_tables(format="tinyehr_omop_format")
# Column names, types, and sample rows for a table
tinyehr.describe_table("patients")
tinyehr.describe_table("person", format="tinyehr_omop_format")
# Find tables by keyword in table and column names
tinyehr.search_tables("lab")
tinyehr.search_tables("drug")
# Load a table as a pandas DataFrame
patients = tinyehr.load_table("patients")
person = tinyehr.load_table("person", format="tinyehr_omop_format")
# All data for one patient across all tables
data = tinyehr.get_patient(10000032)
data["admissions"] # DataFrame of this patient's admissions
data["labevents"] # DataFrame of this patient's labs
data["noteevents"] # DataFrame of this patient's notes
# Build a local SQLite database
db_path = tinyehr.build_sqlite(format="tinyehr_mimic_format")
db_path = tinyehr.build_sqlite(format="tinyehr_omop_format")
# Query the SQLite database
import sqlite3
conn = sqlite3.connect(db_path)
conn.execute("SELECT * FROM admissions LIMIT 5").fetchall()
Direct from HuggingFace
import pandas as pd
patients = pd.read_parquet(
"hf://datasets/vidulpanickan/tinyehr/tinyehr_mimic_format/patients.parquet"
)
No dependencies beyond pandas and pyarrow.
Trouble downloading?
You can download the raw CSV files directly from GitHub:
- Go to github.com/vidulpanickan/TinyEHR
- Click the green Code button
- Select Download ZIP
Or clone via terminal:
git clone https://github.com/vidulpanickan/TinyEHR.git
Formats
TinyEHR ships in two formats from the same underlying patient cohort:
MIMIC format follows the original MIMIC-IV schema with dates shifted to realistic years, ICD codes reformatted with decimal points, and 4,580 clinical notes generated using LLM based on patient visit profiles.
OMOP format follows the OHDSI CDM v5.3.1 schema with hashed person IDs, dates shifted to realistic years, and clinical codes mapped to standardized medical vocabularies. ICD codes in source_value fields are stored without decimal points, following the OMOP billing/claims convention.
For full dataset structure, schema documentation, and table details, visit About The Data.
Differences from MIMIC-IV Demo
TinyEHR applies four targeted transformations to the original MIMIC-IV Demo data. All clinical values, patient demographics, table structures, referential integrity, and row counts are unchanged.
| Transformation | What changed | Why |
|---|---|---|
| Date shifting | All dates shifted from synthetic 2100+ range to realistic 2010s-2020s using per-patient offsets derived from anchor_year_group. Affects 21 MIMIC tables and 15 OMOP tables. Offsets saved in metadata/date_offsets.csv. |
Realistic dates for prototyping. |
| ICD code formatting (MIMIC only) | Decimal points inserted into ICD codes (E119 → E11.9, 3961 → 39.61). ICD-10-PCS codes left unchanged. OMOP source_value fields are not modified (preserves billing/claims format). |
Matches real-world clinical code formatting. |
| Clinical notes | 4,580 notes across 14 types, generated from each patient's profile during their hospital visit including demographics, diagnoses, and admission data. Added as noteevents (MIMIC) and note (OMOP). |
The original Demo has no clinical notes. |
| OMOP note concepts | 19 note-related concepts added to 2b_concept.csv (10 Note Type, 7 LOINC Document Ontology, 2 utility). Row count: 3,885 - 3,904. |
Required for OMOP note table concept references. |
What's New (v0.2.0)
- Parquet type enforcement: all column types now match the official MIMIC-IV and OMOP CDM DDL schemas exactly (integers, floats, timestamps, strings)
- CSV type enforcement: when loading CSVs without pyarrow, types are applied from bundled DDL schema files instead of relying on pandas auto-inference
- OMOP source values: no longer formatted with decimal points, preserving the billing/claims convention
- ICD-9 procedure codes: decimal point now correctly placed after 2nd digit (
3961→39.61) - Clinical notes: regenerated from patient profiles with correct admission dates
License
- Code (this Python package): MIT License
- Data (the TinyEHR dataset): ODbL-1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tinyehr-0.2.0.tar.gz.
File metadata
- Download URL: tinyehr-0.2.0.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db6deeb38d3c2bd0712ba3453740b8f58c2feb38172499147be7adf91f1ab002
|
|
| MD5 |
c8096edee0de8bd39f103a9af1dd1a7d
|
|
| BLAKE2b-256 |
eeb3ddefc77e08ff92e6a65bcd109141059b0e152cfd301d1067ff873caaddac
|
File details
Details for the file tinyehr-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tinyehr-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8762b3b1b0e3a4f3aac7eb3d6b5cdcf03df112668157afb7731bd1b4403f3a21
|
|
| MD5 |
f5f5236f98742538d8923ce513851242
|
|
| BLAKE2b-256 |
d54d89d2f08430b96e972d9ab6bd13fd35caa4a422437668301d6772098944e2
|