Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata
Project description
ECGBench
Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.
ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.
Website: vlbthambawita.github.io/ECGBench
Installation
Base (config, catalogue, validation, splitting)
pip install ecgbench
With PyTorch support
pip install ecgbench[torch]
With everything
pip install ecgbench[all]
From source (development)
git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"
Quick Start
from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader
# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)
for batch in loader:
signals = batch["signal"] # (B, 12, 5000) float32 tensor
ecg_ids = batch["record_id"]
break
Dataset Catalogue
Query the curated index of 64 ECG datasets:
import ecgbench
# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")
# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")
# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")
# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")
# List categories
ecgbench.categories()
# Get as pandas DataFrame
df = ecgbench.to_dataframe()
Loading ECG Data
Standard train/val/test splits
from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader
train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)
K-fold cross-validation
for k in range(1, 11):
val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
test_fold = k % 10 + 1
test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")
ECGDataset parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
str | DatasetConfig |
required | Dataset slug or config object |
split |
str |
"train" |
"train", "val", or "test" |
version |
str |
"clean" |
"clean" or "original" |
data_path |
Path | str | None |
None |
Path to signal files; auto-downloads if None |
sampling_rate |
int | None |
None |
Sampling rate (default: dataset's default) |
fold_numbers |
list[int] | None |
None |
Specific folds to load; None = all |
transform |
Callable | None |
None |
Transform applied to signal tensor |
metadata_source |
str |
"hf" |
"hf" (HuggingFace) or "local" |
Output format
Each sample is a dict:
signal-- float32 tensor(leads, samples)record_id-- record identifiersplit,fold-- split name and fold number- All other CSV columns as tensors (numeric) or raw values (str/dict)
Data Versions
clean(default): only records that pass all quality checksoriginal: all records withis_validandquality_issuescolumns
Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.
Validation
ECGBench validates every signal file before splitting:
- missing_leads -- lead entirely NaN or all-zero
- nan_values -- any NaN in signal
- truncated_signal -- fewer samples than expected
- flat_line -- lead with near-zero variance
- corrupt_header -- unreadable signal file
- amplitude_outlier -- samples outside physiological range
Results are saved in validation_report.json with per-record details.
Croissant Metadata
Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original
Adding a New Dataset
- Copy
ecgbench/data/configs/_template.yamlto<slug>.yaml, fill in fields - Run
python scripts/generate_splits.py --dataset <slug> --data-path /path/to/data/ - Check
validation_report.json-- review excluded records - If custom logic needed, create
ecgbench/splitting/strategies/<slug>.pywith@register("<slug>") - Run
pytest - Upload:
python scripts/upload_to_huggingface.py --data-dir output/ --datasets <slug>
CLI Commands
# Full pipeline: validate + split + Croissant
python scripts/generate_splits.py --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/
# Standalone Croissant generation (per version)
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original
# Upload to HuggingFace Hub
python scripts/upload_to_huggingface.py --data-dir output/ --datasets ptbxl
API Reference
Config
load_config(slug)-- load DatasetConfig from YAMLlist_available_configs()-- list dataset slugs with configs
Catalogue
list_datasets()-- all 64 datasets as CatalogueEntry objectssearch(query, category, access)-- filter datasetsget_dataset(name)-- look up by namecategories()-- unique categoriesto_dataframe()-- as pandas DataFrame
Dataset
ECGDataset(dataset, split, ...)-- unified PyTorch Datasetecg_collate_fn(batch)-- custom collate for DataLoader
Validation
validate_dataset(data_path, config)-- run quality checksgenerate_report(result, config)-- generate report dictsave_report(result, config, path)-- save report JSON
Splitting
split_dataset(df, labels, config)-- generate foldsexport_splits(split_result, val_result, output_dir, config)-- write CSVsget_splitter(slug)-- get dataset-specific splitter
Croissant
generate_croissant(config, splits_dir)-- generate JSON-LDsave_croissant(config, splits_dir)-- save to filevalidate_croissant(path)-- validate JSON-LD
Download
download_dataset(config)-- download from sourceresolve_data_path(path, config)-- resolve or download
Development
uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest
Citation
If you use ECGBench in your research, please cite:
@software{ecgbench,
author = {Thambawita, Vajira},
title = {ECGBench: Reproducible ECG Benchmark Datasets},
url = {https://github.com/vlbthambawita/ECGBench}
}
License
MIT License -- see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ecgbench-0.9.1.tar.gz.
File metadata
- Download URL: ecgbench-0.9.1.tar.gz
- Upload date:
- Size: 38.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a83b94bfef9488fbd9b3bff6e676eeef8ff1928ee8d8585b65264dc4b3cc7675
|
|
| MD5 |
94e7dc4bddea19132e8d5c79cc2d779a
|
|
| BLAKE2b-256 |
2556800302b143b0bf895603ec6cf6152e403001b49ba1472fcf0adb89e885bb
|
Provenance
The following attestation bundles were made for ecgbench-0.9.1.tar.gz:
Publisher:
publish-pypi.yml on vlbthambawita/ECGBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecgbench-0.9.1.tar.gz -
Subject digest:
a83b94bfef9488fbd9b3bff6e676eeef8ff1928ee8d8585b65264dc4b3cc7675 - Sigstore transparency entry: 1317486633
- Sigstore integration time:
-
Permalink:
vlbthambawita/ECGBench@5afc4a8135b4b4a66e377ec401a46aecfb45cf74 -
Branch / Tag:
refs/tags/v0.9.1 - Owner: https://github.com/vlbthambawita
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5afc4a8135b4b4a66e377ec401a46aecfb45cf74 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ecgbench-0.9.1-py3-none-any.whl.
File metadata
- Download URL: ecgbench-0.9.1-py3-none-any.whl
- Upload date:
- Size: 47.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
823f081b23ae83996280a1fccea4887759cf5e0b6243b722ce0b8d1d3bdd8fd4
|
|
| MD5 |
aa04204256c5b5f819581397d8190dc5
|
|
| BLAKE2b-256 |
5566734fc59c833bff5bbc002e11c45ebe8b0dfcded8bfb27c13bdb4e91c4725
|
Provenance
The following attestation bundles were made for ecgbench-0.9.1-py3-none-any.whl:
Publisher:
publish-pypi.yml on vlbthambawita/ECGBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ecgbench-0.9.1-py3-none-any.whl -
Subject digest:
823f081b23ae83996280a1fccea4887759cf5e0b6243b722ce0b8d1d3bdd8fd4 - Sigstore transparency entry: 1317486640
- Sigstore integration time:
-
Permalink:
vlbthambawita/ECGBench@5afc4a8135b4b4a66e377ec401a46aecfb45cf74 -
Branch / Tag:
refs/tags/v0.9.1 - Owner: https://github.com/vlbthambawita
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@5afc4a8135b4b4a66e377ec401a46aecfb45cf74 -
Trigger Event:
push
-
Statement type: