Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vlbthambawita

These details have not been verified by PyPI

Project description

ECGBench

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.

ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.

Website: vlbthambawita.github.io/ECGBench

Installation

Base (config, catalogue, validation, splitting)

pip install ecgbench

With PyTorch support

pip install ecgbench[torch]

With everything

pip install ecgbench[all]

From source (development)

git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"

Quick Start

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

for batch in loader:
    signals = batch["signal"]   # (B, 12, 5000) float32 tensor
    ecg_ids = batch["record_id"]
    break

Dataset Catalogue

Query the curated index of 64 ECG datasets:

import ecgbench

# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")

# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")

# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")

# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")

# List categories
ecgbench.categories()

# Get as pandas DataFrame
df = ecgbench.to_dataframe()

Loading ECG Data

Standard train/val/test splits

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")

loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

K-fold cross-validation

for k in range(1, 11):
    val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
    test_fold = k % 10 + 1
    test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
    train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
    train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")

ECGDataset parameters

Parameter	Type	Default	Description
`dataset`	`str \| DatasetConfig`	required	Dataset slug or config object
`split`	`str`	`"train"`	`"train"`, `"val"`, or `"test"`
`version`	`str`	`"clean"`	`"clean"` or `"original"`
`data_path`	`Path \| str \| None`	`None`	Path to signal files; auto-downloads if None
`sampling_rate`	`int \| None`	`None`	Sampling rate (default: dataset's default)
`fold_numbers`	`list[int] \| None`	`None`	Specific folds to load; None = all
`transform`	`Callable \| None`	`None`	Transform applied to signal tensor
`metadata_source`	`str`	`"hf"`	`"hf"` (HuggingFace) or `"local"`

Output format

Each sample is a dict:

signal -- float32 tensor (leads, samples)
record_id -- record identifier
split, fold -- split name and fold number
All other CSV columns as tensors (numeric) or raw values (str/dict)

Data Versions

clean (default): only records that pass all quality checks
original: all records with is_valid and quality_issues columns

Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.

Validation

ECGBench validates every signal file before splitting:

missing_leads -- lead entirely NaN or all-zero
nan_values -- any NaN in signal
truncated_signal -- fewer samples than expected
flat_line -- lead with near-zero variance
corrupt_header -- unreadable signal file
amplitude_outlier -- samples outside physiological range

Results are saved in validation_report.json with per-record details.

Croissant Metadata

Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:

python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

Adding a New Dataset

Copy ecgbench/data/configs/_template.yaml to <slug>.yaml, fill in fields
Run python scripts/generate_splits.py --dataset <slug> --data-path /path/to/data/
Check validation_report.json -- review excluded records
If custom logic needed, create ecgbench/splitting/strategies/<slug>.py with @register("<slug>")
Run pytest
Upload: python scripts/upload_to_huggingface.py --data-dir output/ --datasets <slug>

CLI Commands

# Full pipeline: validate + split + Croissant
python scripts/generate_splits.py --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/

# Standalone Croissant generation (per version)
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
python scripts/generate_croissant.py --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

# Upload to HuggingFace Hub
python scripts/upload_to_huggingface.py --data-dir output/ --datasets ptbxl

API Reference

Config

load_config(slug) -- load DatasetConfig from YAML
list_available_configs() -- list dataset slugs with configs

Catalogue

list_datasets() -- all 64 datasets as CatalogueEntry objects
search(query, category, access) -- filter datasets
get_dataset(name) -- look up by name
categories() -- unique categories
to_dataframe() -- as pandas DataFrame

Dataset

ECGDataset(dataset, split, ...) -- unified PyTorch Dataset
ecg_collate_fn(batch) -- custom collate for DataLoader

Validation

validate_dataset(data_path, config) -- run quality checks
generate_report(result, config) -- generate report dict
save_report(result, config, path) -- save report JSON

Splitting

split_dataset(df, labels, config) -- generate folds
export_splits(split_result, val_result, output_dir, config) -- write CSVs
get_splitter(slug) -- get dataset-specific splitter

Croissant

generate_croissant(config, splits_dir) -- generate JSON-LD
save_croissant(config, splits_dir) -- save to file
validate_croissant(path) -- validate JSON-LD

Download

download_dataset(config) -- download from source
resolve_data_path(path, config) -- resolve or download

Development

uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest

Citation

If you use ECGBench in your research, please cite:

@software{ecgbench,
  author = {Thambawita, Vajira},
  title = {ECGBench: Reproducible ECG Benchmark Datasets},
  url = {https://github.com/vlbthambawita/ECGBench}
}

License

MIT License -- see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vlbthambawita

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.9.2

Apr 16, 2026

This version

0.9.1

Apr 16, 2026

0.3.2

Apr 14, 2026

0.3.0

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecgbench-0.9.1.tar.gz (38.2 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ecgbench-0.9.1-py3-none-any.whl (47.3 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file ecgbench-0.9.1.tar.gz.

File metadata

Download URL: ecgbench-0.9.1.tar.gz
Upload date: Apr 16, 2026
Size: 38.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.9.1.tar.gz
Algorithm	Hash digest
SHA256	`a83b94bfef9488fbd9b3bff6e676eeef8ff1928ee8d8585b65264dc4b3cc7675`
MD5	`94e7dc4bddea19132e8d5c79cc2d779a`
BLAKE2b-256	`2556800302b143b0bf895603ec6cf6152e403001b49ba1472fcf0adb89e885bb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.9.1.tar.gz:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecgbench-0.9.1.tar.gz
- Subject digest: a83b94bfef9488fbd9b3bff6e676eeef8ff1928ee8d8585b65264dc4b3cc7675
- Sigstore transparency entry: 1317486633
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: vlbthambawita/ECGBench@5afc4a8135b4b4a66e377ec401a46aecfb45cf74
- Branch / Tag: refs/tags/v0.9.1
- Owner: https://github.com/vlbthambawita
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@5afc4a8135b4b4a66e377ec401a46aecfb45cf74
- Trigger Event: push

File details

Details for the file ecgbench-0.9.1-py3-none-any.whl.

File metadata

Download URL: ecgbench-0.9.1-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 47.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.9.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`823f081b23ae83996280a1fccea4887759cf5e0b6243b722ce0b8d1d3bdd8fd4`
MD5	`aa04204256c5b5f819581397d8190dc5`
BLAKE2b-256	`5566734fc59c833bff5bbc002e11c45ebe8b0dfcded8bfb27c13bdb4e91c4725`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.9.1-py3-none-any.whl:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecgbench-0.9.1-py3-none-any.whl
- Subject digest: 823f081b23ae83996280a1fccea4887759cf5e0b6243b722ce0b8d1d3bdd8fd4
- Sigstore transparency entry: 1317486640
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: vlbthambawita/ECGBench@5afc4a8135b4b4a66e377ec401a46aecfb45cf74
- Branch / Tag: refs/tags/v0.9.1
- Owner: https://github.com/vlbthambawita
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@5afc4a8135b4b4a66e377ec401a46aecfb45cf74
- Trigger Event: push

ecgbench 0.9.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ECGBench

Installation

Base (config, catalogue, validation, splitting)

With PyTorch support

With everything

From source (development)

Quick Start

Dataset Catalogue

Loading ECG Data

Standard train/val/test splits

K-fold cross-validation

ECGDataset parameters

Output format

Data Versions

Validation

Croissant Metadata

Adding a New Dataset

CLI Commands

API Reference

Config

Catalogue

Dataset

Validation

Splitting

Croissant

Download

Development

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance