Skip to main content

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata

Project description

ECGBench

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.

ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.

Website: vlbthambawita.github.io/ECGBench

Installation

Base (config, catalogue, validation, splitting)

pip install ecgbench

With PyTorch support

pip install ecgbench[torch]

With everything

pip install ecgbench[all]

From source (development)

git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"

Quick Start

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

for batch in loader:
    signals = batch["signal"]   # (B, 12, 5000) float32 tensor
    ecg_ids = batch["record_id"]
    break

Dataset Catalogue

Query the curated index of 64 ECG datasets:

import ecgbench

# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")

# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")

# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")

# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")

# List categories
ecgbench.categories()

# Get as pandas DataFrame
df = ecgbench.to_dataframe()

Loading ECG Data

Standard train/val/test splits

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")

loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

K-fold cross-validation

for k in range(1, 11):
    val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
    test_fold = k % 10 + 1
    test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
    train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
    train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")

ECGDataset parameters

Parameter Type Default Description
dataset str | DatasetConfig required Dataset slug or config object
split str "train" "train", "val", or "test"
version str "clean" "clean" or "original"
data_path Path | str | None None Path to signal files; auto-downloads if None
sampling_rate int | None None Sampling rate (default: dataset's default)
fold_numbers list[int] | None None Specific folds to load; None = all
transform Callable | None None Transform applied to signal tensor
metadata_source str "hf" "hf" (HuggingFace) or "local"

Output format

Each sample is a dict:

  • signal -- float32 tensor (leads, samples)
  • record_id -- record identifier
  • split, fold -- split name and fold number
  • All other CSV columns as tensors (numeric) or raw values (str/dict)

Data Versions

  • clean (default): only records that pass all quality checks
  • original: all records with is_valid and quality_issues columns

Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.

Validation

ECGBench validates every signal file before splitting:

  • missing_leads -- lead entirely NaN or all-zero
  • nan_values -- any NaN in signal
  • truncated_signal -- fewer samples than expected
  • flat_line -- lead with near-zero variance
  • corrupt_header -- unreadable signal file
  • amplitude_outlier -- samples outside physiological range

Results are saved in validation_report.json with per-record details.

Croissant Metadata

Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

Adding a New Dataset

  1. Copy ecgbench/data/configs/_template.yaml to <slug>.yaml, fill in fields
  2. Run ecgbench splits --dataset <slug> --data-path /path/to/data/
  3. Check validation_report.json -- review excluded records
  4. If custom logic needed, create ecgbench/splitting/strategies/<slug>.py with @register("<slug>")
  5. Run pytest
  6. Upload: ecgbench upload --data-dir output/ --datasets <slug>

CLI

Installing ecgbench adds a single ecgbench console command with three subcommands:

ecgbench --help               # top-level help
ecgbench <command> --help     # per-subcommand flags
ecgbench --version            # package version
Subcommand Purpose
splits Full pipeline -- validate signals, generate 10-fold splits, export CSVs, and write Croissant metadata
croissant Generate Croissant 1.1 JSON-LD for an already-split dataset directory
upload Upload fold CSVs and metadata to HuggingFace Hub (requires ecgbench[hf])

Every subcommand has an equivalent Python function (run_splits, run_croissant, run_upload) with the same arguments, so the same workflow can be driven from a notebook or downstream code.

ecgbench splits

Runs the full pipeline: validate -> split -> export -> Croissant. Writes output/<dataset>/{original,clean}/ by default.

ecgbench splits --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/
ecgbench splits --dataset ptbxl                        # auto-download
ecgbench splits --dataset chapman_shaoxing \
    --data-path /data/chapman/ \
    --output-dir /data/outputs/chapman/ \
    --n-folds 10 --max-workers 8
Flag Type Default Description
--dataset str required Dataset slug (e.g. ptbxl, chapman_shaoxing)
--data-path path auto-download Path to the dataset root directory
--output-dir path output/<dataset>/ Output directory for fold CSVs + metadata
--sampling-rate int config default Sampling rate to validate against
--n-folds int 10 Number of cross-validation folds
--max-workers int 4 Parallel workers for signal validation
--skip-validation flag off Skip signal validation (faster; no quality flags)
--skip-croissant flag off Skip Croissant metadata generation

Python equivalent:

import ecgbench

result = ecgbench.run_splits(
    dataset="ptbxl",
    data_path="/path/to/ptb-xl/1.0.3/",
    output_dir=None,          # -> output/ptbxl/
    sampling_rate=None,       # -> config default_sampling_rate
    n_folds=10,
    max_workers=4,
    skip_validation=False,
    skip_croissant=False,
)
# result is a dict with: dataset, dataset_name, output_dir,
# original={total,train,val,test}, clean={total,train,val,test}, excluded

ecgbench croissant

Standalone Croissant 1.1 JSON-LD generator for an existing splits directory. Run once per version (clean and original).

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/    --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --validate
Flag Type Default Description
--dataset str required Dataset slug
--splits-dir path required Version directory to scan (e.g. output/ptbxl/clean/)
--output path <splits-dir>/croissant.json Where to write the JSON-LD
--version clean|original clean Version label to record in the Croissant file
--validate flag off Validate the file after writing (non-zero exit if invalid)

Python equivalent:

from pathlib import Path
import ecgbench

saved_path: Path = ecgbench.run_croissant(
    dataset="ptbxl",
    splits_dir="output/ptbxl/clean/",
    output=None,              # -> splits_dir/croissant.json
    version="clean",
    validate=True,            # raises RuntimeError if the file does not validate
)

Requires the croissant extra (pip install ecgbench[croissant]).

ecgbench upload

Uploads each dataset's original/ and clean/ CSV folds, plus validation_report.json and croissant.json if present, to a HuggingFace Hub dataset repository. One or more dataset slugs can be uploaded in a single call.

ecgbench upload --data-dir output/ --datasets ptbxl
ecgbench upload --data-dir output/ --datasets ptbxl chapman_shaoxing
ecgbench upload --data-dir output/ --datasets ptbxl --dry-run
ecgbench upload --data-dir output/ --datasets ptbxl \
    --hf-repo-id your-org/ECGBench
Flag Type Default Description
--data-dir path required Root directory containing per-dataset subdirectories
--datasets list required One or more dataset slugs to upload
--hf-repo-id str vlbthambawita/ECGBench Target HuggingFace dataset repo ID
--dry-run flag off Print the files that would be uploaded, without uploading

Authentication resolves in this order: token= argument (Python API only) -> HF_TOKEN env var -> HUGGINGFACE_HUB_TOKEN env var -> .env file in the current working directory. Run with --dry-run first to review the file list.

Python equivalent:

import ecgbench

counts: dict[str, int] = ecgbench.run_upload(
    data_dir="output/",
    datasets=["ptbxl", "chapman_shaoxing"],
    hf_repo_id="vlbthambawita/ECGBench",
    dry_run=False,
    token=None,               # falls back to env / .env
)
# counts: {"ptbxl": 42, "chapman_shaoxing": 42}

Requires the hf extra (pip install ecgbench[hf]).

API Reference

Config

  • load_config(slug) -- load DatasetConfig from YAML
  • list_available_configs() -- list dataset slugs with configs

Catalogue

  • list_datasets() -- all 64 datasets as CatalogueEntry objects
  • search(query, category, access) -- filter datasets
  • get_dataset(name) -- look up by name
  • categories() -- unique categories
  • to_dataframe() -- as pandas DataFrame

Dataset

  • ECGDataset(dataset, split, ...) -- unified PyTorch Dataset
  • ecg_collate_fn(batch) -- custom collate for DataLoader

Validation

  • validate_dataset(data_path, config) -- run quality checks
  • generate_report(result, config) -- generate report dict
  • save_report(result, config, path) -- save report JSON

Splitting

  • split_dataset(df, labels, config) -- generate folds
  • export_splits(split_result, val_result, output_dir, config) -- write CSVs
  • get_splitter(slug) -- get dataset-specific splitter

Croissant

  • generate_croissant(config, splits_dir) -- generate JSON-LD
  • save_croissant(config, splits_dir) -- save to file
  • validate_croissant(path) -- validate JSON-LD

Download

  • download_dataset(config) -- download from source
  • resolve_data_path(path, config) -- resolve or download

Pipelines (CLI + Python API)

  • run_splits(dataset, ...) -- full validate + split + Croissant pipeline (same as ecgbench splits)
  • run_croissant(dataset, splits_dir, ...) -- standalone Croissant generation (same as ecgbench croissant)
  • run_upload(data_dir, datasets, ...) -- HuggingFace Hub upload (same as ecgbench upload)

Development

uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest

Citation

If you use ECGBench in your research, please cite:

@software{ecgbench,
  author = {Thambawita, Vajira},
  title = {ECGBench: Reproducible ECG Benchmark Datasets},
  url = {https://github.com/vlbthambawita/ECGBench}
}

License

MIT License -- see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecgbench-0.11.0.tar.gz (51.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecgbench-0.11.0-py3-none-any.whl (93.5 kB view details)

Uploaded Python 3

File details

Details for the file ecgbench-0.11.0.tar.gz.

File metadata

  • Download URL: ecgbench-0.11.0.tar.gz
  • Upload date:
  • Size: 51.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.11.0.tar.gz
Algorithm Hash digest
SHA256 cbdc65c3790e1da2c6a25c590926caa28418e7553cce82e9f1eb050815db56c3
MD5 0fd41402625b97e0823fc2b5dbf97edf
BLAKE2b-256 dfa615255447fa252cafaadb0a541a805abc2e96466a7bd5a48ab5477d3c87e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.11.0.tar.gz:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ecgbench-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: ecgbench-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 93.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32acce6a2e299fd34c5a54594cd641e797237fc5ec0d04cc34a3421e20514ca1
MD5 d3753d4251f97ecced03f8af969cae37
BLAKE2b-256 d535333f837c35bb17822f8c1e96abba28bc4de5328acb81d0d0d6d881e228e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.11.0-py3-none-any.whl:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page