A tiny EHR dataset for learning, prototyping, and building — 100 patients in MIMIC and OMOP formats.
Project description
TinyEHR : A Tiny Electronic Health Records Dataset for Learning, Prototyping, and Building
TinyEHR is a small, open, reproducible clinical dataset with 100 patients available in two formats - MIMIC and OMOP. It is derived from the MIMIC-IV Clinical Database Demo v2.2, the publicly available subset of MIMIC-IV published by the MIT Laboratory for Computational Physiology.
Open and ready to use — no credentialing and no data use agreements. Install and start exploring clinical data in seconds.
| Website | tinyehr.org |
| GitHub | github.com/vidulpanickan/TinyEHR |
| HuggingFace | datasets/vidulpanickan/TinyEHR |
| PyPI | pip install tinyehr |
Install
pip install tinyehr
Python API
import tinyehr
# Quick reference of all functions
tinyehr.help()
# Overview of all tables with row counts
tinyehr.info()
tinyehr.info(format="tinyehr_omop_format")
# List table names
tinyehr.list_tables()
tinyehr.list_tables(format="tinyehr_omop_format")
# Column names, types, and sample rows for a table
tinyehr.describe_table("patients")
tinyehr.describe_table("person", format="tinyehr_omop_format")
# Find tables by keyword in table and column names
tinyehr.search_tables("lab")
tinyehr.search_tables("drug")
# Load a table as a pandas DataFrame
patients = tinyehr.load_table("patients")
person = tinyehr.load_table("person", format="tinyehr_omop_format")
# All data for one patient across all tables
data = tinyehr.get_patient(10000032)
data["admissions"] # DataFrame of this patient's admissions
data["labevents"] # DataFrame of this patient's labs
data["noteevents"] # DataFrame of this patient's notes
# Build a local SQLite database
db_path = tinyehr.build_sqlite(format="tinyehr_mimic_format")
db_path = tinyehr.build_sqlite(format="tinyehr_omop_format")
# Query the SQLite database
import sqlite3
conn = sqlite3.connect(db_path)
conn.execute("SELECT * FROM admissions LIMIT 5").fetchall()
Direct from HuggingFace
import pandas as pd
patients = pd.read_parquet(
"hf://datasets/vidulpanickan/tinyehr/tinyehr_mimic_format/patients.parquet"
)
No dependencies beyond pandas and pyarrow.
Trouble downloading?
You can download the raw CSV files directly from GitHub:
- Go to github.com/vidulpanickan/TinyEHR
- Click the green Code button
- Select Download ZIP
Or clone via terminal:
git clone https://github.com/vidulpanickan/TinyEHR.git
Formats
TinyEHR ships in two formats from the same underlying patient cohort:
MIMIC format follows the original MIMIC-IV schema with dates shifted to realistic years, ICD codes reformatted with decimal points, and 4,580 synthetic clinical notes added.
OMOP format follows the OHDSI CDM v5.3.1 schema with hashed person IDs, dates shifted to realistic years, ICD codes formatted with periods, and clinical codes mapped to SNOMED, LOINC, and RxNorm via a custom MIMIC specific concept vocabulary.
For full dataset structure, schema documentation, and table details, visit tinyehr.org.
Differences from MIMIC-IV Demo
TinyEHR applies four targeted transformations to the original MIMIC-IV Demo data. All clinical values, patient demographics, table structures, referential integrity, and row counts are unchanged.
| Transformation | What changed | Why |
|---|---|---|
| Date shifting | All dates shifted from synthetic 2100+ range to realistic 2010s-2020s using per-patient offsets derived from anchor_year_group. Affects 21 MIMIC tables and 15 OMOP tables. Offsets saved in metadata/date_offsets.csv. |
Realistic dates for teaching and prototyping. |
| ICD code formatting | Decimal points inserted into ICD codes (E119 - E11.9, V707 - V70.7). ICD-10-PCS codes left unchanged. Affects diagnoses_icd, d_icd_diagnoses, procedures_icd, d_icd_procedures (MIMIC) and condition_source_value, procedure_source_value (OMOP). |
Matches real-world clinical code formatting. |
| Synthetic clinical notes | 4,580 notes across 14 types added (not present in original Demo). Generated using a large language model, grounded in each patient's demographics, diagnoses, and admission data. Added as noteevents (MIMIC) and note (OMOP) with proper concept mappings. |
The original Demo has no clinical notes. |
| OMOP note concepts | 19 note-related concepts added to 2b_concept.csv (10 Note Type, 7 LOINC Document Ontology, 2 utility). Row count: 3,885 - 3,904. |
Required for OMOP note table concept references. |
License
- Code (this Python package): MIT License
- Data (the TinyEHR dataset): ODbL-1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tinyehr-0.1.0.tar.gz.
File metadata
- Download URL: tinyehr-0.1.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
643ecf7f2975db16b05867329c76f85a7b320795e57e7c92a36740c53e79f0ba
|
|
| MD5 |
ffd58e9a7f8bdd06d2e90b83eb0fcab2
|
|
| BLAKE2b-256 |
45d09684d40c0e94c5516f3b0a8832725d6d73cf2024b833b23540785c6d8ee1
|
File details
Details for the file tinyehr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tinyehr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c06f849d96041be86dcafca751ffc258d9e8452abc780fd925e7a4feb8fadf9
|
|
| MD5 |
84054b0677051a611a1b6118ec14feec
|
|
| BLAKE2b-256 |
980a438a764eb067be89978986437211393167c53ee62402abdf3cec25252533
|