Skip to main content

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata

Project description

ECGBench

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.

ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.

Website: vlbthambawita.github.io/ECGBench

Installation

Base (config, catalogue, validation, splitting)

pip install ecgbench

With PyTorch support

pip install ecgbench[torch]

With everything

pip install ecgbench[all]

From source (development)

git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"

Quick Start

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

for batch in loader:
    signals = batch["signal"]   # (B, 12, 5000) float32 tensor
    ecg_ids = batch["record_id"]
    break

Dataset Catalogue

Query the curated index of 64 ECG datasets:

import ecgbench

# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")

# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")

# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")

# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")

# List categories
ecgbench.categories()

# Get as pandas DataFrame
df = ecgbench.to_dataframe()

Loading ECG Data

Standard train/val/test splits

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")

loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

K-fold cross-validation

for k in range(1, 11):
    val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
    test_fold = k % 10 + 1
    test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
    train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
    train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")

ECGDataset parameters

Parameter Type Default Description
dataset str | DatasetConfig required Dataset slug or config object
split str "train" "train", "val", or "test"
version str "clean" "clean" or "original"
data_path Path | str | None None Path to signal files; auto-downloads if None
sampling_rate int | None None Sampling rate (default: dataset's default)
fold_numbers list[int] | None None Specific folds to load; None = all
transform Callable | None None Transform applied to signal tensor
metadata_source str "hf" "hf" (HuggingFace) or "local"

Output format

Each sample is a dict:

  • signal -- float32 tensor (leads, samples)
  • record_id -- record identifier
  • split, fold -- split name and fold number
  • All other CSV columns as tensors (numeric) or raw values (str/dict)

Data Versions

  • clean (default): only records that pass all quality checks
  • original: all records with is_valid and quality_issues columns

Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.

Validation

ECGBench validates every signal file before splitting:

  • missing_leads -- lead entirely NaN or all-zero
  • nan_values -- any NaN in signal
  • truncated_signal -- fewer samples than expected
  • flat_line -- lead with near-zero variance
  • corrupt_header -- unreadable signal file
  • amplitude_outlier -- samples outside physiological range

Results are saved in validation_report.json with per-record details.

Croissant Metadata

Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:

python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

Adding a New Dataset

  1. Copy ecgbench/data/configs/_template.yaml to <slug>.yaml, fill in fields
  2. Run python scripts/generate_splits.py --dataset <slug> --data-path /path/to/data/
  3. Check validation_report.json -- review excluded records
  4. If custom logic needed, create ecgbench/splitting/strategies/<slug>.py with @register("<slug>")
  5. Run pytest
  6. Upload: python scripts/upload_to_huggingface.py --data-dir output/ --datasets <slug>

CLI Commands

# Full pipeline: validate + split + Croissant
python scripts/generate_splits.py --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/

# Standalone Croissant generation (per version)
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

# Upload to HuggingFace Hub
python scripts/upload_to_huggingface.py --data-dir output/ --datasets ptbxl

API Reference

Config

  • load_config(slug) -- load DatasetConfig from YAML
  • list_available_configs() -- list dataset slugs with configs

Catalogue

  • list_datasets() -- all 64 datasets as CatalogueEntry objects
  • search(query, category, access) -- filter datasets
  • get_dataset(name) -- look up by name
  • categories() -- unique categories
  • to_dataframe() -- as pandas DataFrame

Dataset

  • ECGDataset(dataset, split, ...) -- unified PyTorch Dataset
  • ecg_collate_fn(batch) -- custom collate for DataLoader

Validation

  • validate_dataset(data_path, config) -- run quality checks
  • generate_report(result, config) -- generate report dict
  • save_report(result, config, path) -- save report JSON

Splitting

  • split_dataset(df, labels, config) -- generate folds
  • export_splits(split_result, val_result, output_dir, config) -- write CSVs
  • get_splitter(slug) -- get dataset-specific splitter

Croissant

  • generate_croissant(config, splits_dir) -- generate JSON-LD
  • save_croissant(config, splits_dir) -- save to file
  • validate_croissant(path) -- validate JSON-LD

Download

  • download_dataset(config) -- download from source
  • resolve_data_path(path, config) -- resolve or download

Development

uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest

Citation

If you use ECGBench in your research, please cite:

@software{ecgbench,
  author = {Thambawita, Vajira},
  title = {ECGBench: Reproducible ECG Benchmark Datasets},
  url = {https://github.com/vlbthambawita/ECGBench}
}

License

MIT License -- see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecgbench-0.9.1.tar.gz (38.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecgbench-0.9.1-py3-none-any.whl (47.3 kB view details)

Uploaded Python 3

File details

Details for the file ecgbench-0.9.1.tar.gz.

File metadata

  • Download URL: ecgbench-0.9.1.tar.gz
  • Upload date:
  • Size: 38.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.9.1.tar.gz
Algorithm Hash digest
SHA256 a83b94bfef9488fbd9b3bff6e676eeef8ff1928ee8d8585b65264dc4b3cc7675
MD5 94e7dc4bddea19132e8d5c79cc2d779a
BLAKE2b-256 2556800302b143b0bf895603ec6cf6152e403001b49ba1472fcf0adb89e885bb

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.9.1.tar.gz:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ecgbench-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: ecgbench-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 47.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 823f081b23ae83996280a1fccea4887759cf5e0b6243b722ce0b8d1d3bdd8fd4
MD5 aa04204256c5b5f819581397d8190dc5
BLAKE2b-256 5566734fc59c833bff5bbc002e11c45ebe8b0dfcded8bfb27c13bdb4e91c4725

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.9.1-py3-none-any.whl:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page