Skip to main content

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata

Project description

ECGBench

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.

ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.

Website: vlbthambawita.github.io/ECGBench

Installation

Base (config, catalogue, validation, splitting)

pip install ecgbench

With PyTorch support

pip install ecgbench[torch]

With everything

pip install ecgbench[all]

From source (development)

git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"

Quick Start

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

for batch in loader:
    signals = batch["signal"]   # (B, 12, 5000) float32 tensor
    ecg_ids = batch["record_id"]
    break

Dataset Catalogue

Query the curated index of 64 ECG datasets:

import ecgbench

# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")

# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")

# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")

# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")

# List categories
ecgbench.categories()

# Get as pandas DataFrame
df = ecgbench.to_dataframe()

Loading ECG Data

Standard train/val/test splits

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")

loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

K-fold cross-validation

for k in range(1, 11):
    val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
    test_fold = k % 10 + 1
    test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
    train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
    train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")

ECGDataset parameters

Parameter Type Default Description
dataset str | DatasetConfig required Dataset slug or config object
split str "train" "train", "val", or "test"
version str "clean" "clean" or "original"
data_path Path | str | None None Path to signal files; auto-downloads if None
sampling_rate int | None None Sampling rate (default: dataset's default)
fold_numbers list[int] | None None Specific folds to load; None = all
transform Callable | None None Transform applied to signal tensor
metadata_source str "hf" "hf" (HuggingFace) or "local"

Output format

Each sample is a dict:

  • signal -- float32 tensor (leads, samples)
  • record_id -- record identifier
  • split, fold -- split name and fold number
  • All other CSV columns as tensors (numeric) or raw values (str/dict)

Data Versions

  • clean (default): only records that pass all quality checks
  • original: all records with is_valid and quality_issues columns

Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.

Validation

ECGBench validates every signal file before splitting:

  • missing_leads -- lead entirely NaN or all-zero
  • nan_values -- any NaN in signal
  • truncated_signal -- fewer samples than expected
  • flat_line -- lead with near-zero variance
  • corrupt_header -- unreadable signal file
  • amplitude_outlier -- samples outside physiological range

Results are saved in validation_report.json with per-record details.

Croissant Metadata

Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

Adding a New Dataset

  1. Copy ecgbench/data/configs/_template.yaml to <slug>.yaml, fill in fields
  2. Run ecgbench splits --dataset <slug> --data-path /path/to/data/
  3. Check validation_report.json -- review excluded records
  4. If custom logic needed, create ecgbench/splitting/strategies/<slug>.py with @register("<slug>")
  5. Run pytest
  6. Upload: ecgbench upload --data-dir output/ --datasets <slug>

CLI

Installing ecgbench adds a single ecgbench console command with three subcommands:

ecgbench --help               # top-level help
ecgbench <command> --help     # per-subcommand flags
ecgbench --version            # package version
Subcommand Purpose
splits Full pipeline -- validate signals, generate 10-fold splits, export CSVs, and write Croissant metadata
croissant Generate Croissant 1.1 JSON-LD for an already-split dataset directory
upload Upload fold CSVs and metadata to HuggingFace Hub (requires ecgbench[hf])

Every subcommand has an equivalent Python function (run_splits, run_croissant, run_upload) with the same arguments, so the same workflow can be driven from a notebook or downstream code.

ecgbench splits

Runs the full pipeline: validate -> split -> export -> Croissant. Writes output/<dataset>/{original,clean}/ by default.

ecgbench splits --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/
ecgbench splits --dataset ptbxl                        # auto-download
ecgbench splits --dataset chapman_shaoxing \
    --data-path /data/chapman/ \
    --output-dir /data/outputs/chapman/ \
    --n-folds 10 --max-workers 8
Flag Type Default Description
--dataset str required Dataset slug (e.g. ptbxl, chapman_shaoxing)
--data-path path auto-download Path to the dataset root directory
--output-dir path output/<dataset>/ Output directory for fold CSVs + metadata
--sampling-rate int config default Sampling rate to validate against
--n-folds int 10 Number of cross-validation folds
--max-workers int 4 Parallel workers for signal validation
--skip-validation flag off Skip signal validation (faster; no quality flags)
--skip-croissant flag off Skip Croissant metadata generation

Python equivalent:

import ecgbench

result = ecgbench.run_splits(
    dataset="ptbxl",
    data_path="/path/to/ptb-xl/1.0.3/",
    output_dir=None,          # -> output/ptbxl/
    sampling_rate=None,       # -> config default_sampling_rate
    n_folds=10,
    max_workers=4,
    skip_validation=False,
    skip_croissant=False,
)
# result is a dict with: dataset, dataset_name, output_dir,
# original={total,train,val,test}, clean={total,train,val,test}, excluded

ecgbench croissant

Standalone Croissant 1.1 JSON-LD generator for an existing splits directory. Run once per version (clean and original).

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/    --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --validate
Flag Type Default Description
--dataset str required Dataset slug
--splits-dir path required Version directory to scan (e.g. output/ptbxl/clean/)
--output path <splits-dir>/croissant.json Where to write the JSON-LD
--version clean|original clean Version label to record in the Croissant file
--validate flag off Validate the file after writing (non-zero exit if invalid)

Python equivalent:

from pathlib import Path
import ecgbench

saved_path: Path = ecgbench.run_croissant(
    dataset="ptbxl",
    splits_dir="output/ptbxl/clean/",
    output=None,              # -> splits_dir/croissant.json
    version="clean",
    validate=True,            # raises RuntimeError if the file does not validate
)

Requires the croissant extra (pip install ecgbench[croissant]).

ecgbench upload

Uploads each dataset's original/ and clean/ CSV folds, plus validation_report.json and croissant.json if present, to a HuggingFace Hub dataset repository. One or more dataset slugs can be uploaded in a single call.

ecgbench upload --data-dir output/ --datasets ptbxl
ecgbench upload --data-dir output/ --datasets ptbxl chapman_shaoxing
ecgbench upload --data-dir output/ --datasets ptbxl --dry-run
ecgbench upload --data-dir output/ --datasets ptbxl \
    --hf-repo-id your-org/ECGBench
Flag Type Default Description
--data-dir path required Root directory containing per-dataset subdirectories
--datasets list required One or more dataset slugs to upload
--hf-repo-id str vlbthambawita/ECGBench Target HuggingFace dataset repo ID
--dry-run flag off Print the files that would be uploaded, without uploading

Authentication resolves in this order: token= argument (Python API only) -> HF_TOKEN env var -> HUGGINGFACE_HUB_TOKEN env var -> .env file in the current working directory. Run with --dry-run first to review the file list.

Python equivalent:

import ecgbench

counts: dict[str, int] = ecgbench.run_upload(
    data_dir="output/",
    datasets=["ptbxl", "chapman_shaoxing"],
    hf_repo_id="vlbthambawita/ECGBench",
    dry_run=False,
    token=None,               # falls back to env / .env
)
# counts: {"ptbxl": 42, "chapman_shaoxing": 42}

Requires the hf extra (pip install ecgbench[hf]).

API Reference

Config

  • load_config(slug) -- load DatasetConfig from YAML
  • list_available_configs() -- list dataset slugs with configs

Catalogue

  • list_datasets() -- all 64 datasets as CatalogueEntry objects
  • search(query, category, access) -- filter datasets
  • get_dataset(name) -- look up by name
  • categories() -- unique categories
  • to_dataframe() -- as pandas DataFrame

Dataset

  • ECGDataset(dataset, split, ...) -- unified PyTorch Dataset
  • ecg_collate_fn(batch) -- custom collate for DataLoader

Validation

  • validate_dataset(data_path, config) -- run quality checks
  • generate_report(result, config) -- generate report dict
  • save_report(result, config, path) -- save report JSON

Splitting

  • split_dataset(df, labels, config) -- generate folds
  • export_splits(split_result, val_result, output_dir, config) -- write CSVs
  • get_splitter(slug) -- get dataset-specific splitter

Croissant

  • generate_croissant(config, splits_dir) -- generate JSON-LD
  • save_croissant(config, splits_dir) -- save to file
  • validate_croissant(path) -- validate JSON-LD

Download

  • download_dataset(config) -- download from source
  • resolve_data_path(path, config) -- resolve or download

Pipelines (CLI + Python API)

  • run_splits(dataset, ...) -- full validate + split + Croissant pipeline (same as ecgbench splits)
  • run_croissant(dataset, splits_dir, ...) -- standalone Croissant generation (same as ecgbench croissant)
  • run_upload(data_dir, datasets, ...) -- HuggingFace Hub upload (same as ecgbench upload)

Development

uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest

Citation

If you use ECGBench in your research, please cite:

@software{ecgbench,
  author = {Thambawita, Vajira},
  title = {ECGBench: Reproducible ECG Benchmark Datasets},
  url = {https://github.com/vlbthambawita/ECGBench}
}

License

MIT License -- see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecgbench-0.12.0.tar.gz (51.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ecgbench-0.12.0-py3-none-any.whl (93.5 kB view details)

Uploaded Python 3

File details

Details for the file ecgbench-0.12.0.tar.gz.

File metadata

  • Download URL: ecgbench-0.12.0.tar.gz
  • Upload date:
  • Size: 51.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.12.0.tar.gz
Algorithm Hash digest
SHA256 4b12261213674dda8a528e1db7f9544aeb7da854050b14486a06c144635cc1dc
MD5 28e3cc4f7411d1cd5c2ad42be1f68620
BLAKE2b-256 2d858914ca3197543e277a7208dcd3fa9ae4a1bbacb88fd43a067ee80225d76c

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.12.0.tar.gz:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ecgbench-0.12.0-py3-none-any.whl.

File metadata

  • Download URL: ecgbench-0.12.0-py3-none-any.whl
  • Upload date:
  • Size: 93.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 daabe7de2723f56ad27c8607e6844c092d1a60e4a75b5c8bb27737b5e4ea0cfa
MD5 2abae3fff80d1a763f8940576bd5445e
BLAKE2b-256 ba98047d7d693fba4e4d9b7a2431cb4c919fd7e348616d5f5dc44412258575d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.12.0-py3-none-any.whl:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page