Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vlbthambawita

These details have not been verified by PyPI

Project description

ECGBench

Reproducible ECG benchmark datasets with standardised splits, validation, and Croissant metadata.

ECGBench provides a curated catalogue of 64 publicly available ECG datasets, a config-driven pipeline for generating validated fold splits, and a unified PyTorch Dataset class for loading any supported dataset.

Website: vlbthambawita.github.io/ECGBench

Installation

Base (config, catalogue, validation, splitting)

pip install ecgbench

With PyTorch support

pip install ecgbench[torch]

With everything

pip install ecgbench[all]

From source (development)

git clone https://github.com/vlbthambawita/ECGBench.git
cd ECGBench
uv pip install -e ".[dev]"

Quick Start

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

# Load PTB-XL training data (downloads fold CSVs from HuggingFace Hub)
train_ds = ECGDataset("ptbxl", split="train", data_path="/path/to/ptb-xl/1.0.3/")
loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

for batch in loader:
    signals = batch["signal"]   # (B, 12, 5000) float32 tensor
    ecg_ids = batch["record_id"]
    break

Dataset Catalogue

Query the curated index of 64 ECG datasets:

import ecgbench

# List all datasets
datasets = ecgbench.list_datasets()
print(f"{len(datasets)} datasets available")

# Search by name, origin, format, or paper
ecgbench.search("PTB-XL")

# Filter by category and access type
ecgbench.search(category="12-Lead (PhysioNet)", access="Open")

# Look up a single dataset
ecgbench.get_dataset("MIMIC-IV-ECG")

# List categories
ecgbench.categories()

# Get as pandas DataFrame
df = ecgbench.to_dataframe()

Loading ECG Data

Standard train/val/test splits

from ecgbench import ECGDataset, ecg_collate_fn
from torch.utils.data import DataLoader

train_ds = ECGDataset("ptbxl", split="train", data_path="/data/ptb-xl/1.0.3/")
val_ds = ECGDataset("ptbxl", split="val", data_path="/data/ptb-xl/1.0.3/")
test_ds = ECGDataset("ptbxl", split="test", data_path="/data/ptb-xl/1.0.3/")

loader = DataLoader(train_ds, batch_size=32, collate_fn=ecg_collate_fn)

K-fold cross-validation

for k in range(1, 11):
    val_ds = ECGDataset("ptbxl", split="val", fold_numbers=[k], data_path="...")
    test_fold = k % 10 + 1
    test_ds = ECGDataset("ptbxl", split="test", fold_numbers=[test_fold], data_path="...")
    train_folds = [f for f in range(1, 11) if f != k and f != test_fold]
    train_ds = ECGDataset("ptbxl", split="train", fold_numbers=train_folds, data_path="...")

ECGDataset parameters

Parameter	Type	Default	Description
`dataset`	`str \| DatasetConfig`	required	Dataset slug or config object
`split`	`str`	`"train"`	`"train"`, `"val"`, or `"test"`
`version`	`str`	`"clean"`	`"clean"` or `"original"`
`data_path`	`Path \| str \| None`	`None`	Path to signal files; auto-downloads if None
`sampling_rate`	`int \| None`	`None`	Sampling rate (default: dataset's default)
`fold_numbers`	`list[int] \| None`	`None`	Specific folds to load; None = all
`transform`	`Callable \| None`	`None`	Transform applied to signal tensor
`metadata_source`	`str`	`"hf"`	`"hf"` (HuggingFace) or `"local"`

Output format

Each sample is a dict:

signal -- float32 tensor (leads, samples)
record_id -- record identifier
split, fold -- split name and fold number
All other CSV columns as tensors (numeric) or raw values (str/dict)

Data Versions

clean (default): only records that pass all quality checks
original: all records with is_valid and quality_issues columns

Both versions share identical fold assignments. Use original when you need all records or want to filter manually; use clean for standard benchmarking.

Validation

ECGBench validates every signal file before splitting:

missing_leads -- lead entirely NaN or all-zero
nan_values -- any NaN in signal
truncated_signal -- fewer samples than expected
flat_line -- lead with near-zero variance
corrupt_header -- unreadable signal file
amplitude_outlier -- samples outside physiological range

Results are saved in validation_report.json with per-record details.

Croissant Metadata

Both clean/ and original/ versions include MLCommons Croissant 1.1 JSON-LD metadata (croissant.json) with SHA-256 hashes for reproducibility. The full pipeline generates both automatically. For standalone generation:

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original

Adding a New Dataset

Copy ecgbench/data/configs/_template.yaml to <slug>.yaml, fill in fields
Run ecgbench splits --dataset <slug> --data-path /path/to/data/
Check validation_report.json -- review excluded records
If custom logic needed, create ecgbench/splitting/strategies/<slug>.py with @register("<slug>")
Run pytest
Upload: ecgbench upload --data-dir output/ --datasets <slug>

CLI

Installing ecgbench adds a single ecgbench console command with three subcommands:

ecgbench --help               # top-level help
ecgbench <command> --help     # per-subcommand flags
ecgbench --version            # package version

Subcommand	Purpose
`splits`	Full pipeline -- validate signals, generate 10-fold splits, export CSVs, and write Croissant metadata
`croissant`	Generate Croissant 1.1 JSON-LD for an already-split dataset directory
`upload`	Upload fold CSVs and metadata to HuggingFace Hub (requires `ecgbench[hf]`)

Every subcommand has an equivalent Python function (run_splits, run_croissant, run_upload) with the same arguments, so the same workflow can be driven from a notebook or downstream code.

`ecgbench splits`

Runs the full pipeline: validate -> split -> export -> Croissant. Writes output/<dataset>/{original,clean}/ by default.

ecgbench splits --dataset ptbxl --data-path /path/to/ptb-xl/1.0.3/
ecgbench splits --dataset ptbxl                        # auto-download
ecgbench splits --dataset chapman_shaoxing \
    --data-path /data/chapman/ \
    --output-dir /data/outputs/chapman/ \
    --n-folds 10 --max-workers 8

Flag	Type	Default	Description
`--dataset`	str	required	Dataset slug (e.g. `ptbxl`, `chapman_shaoxing`)
`--data-path`	path	auto-download	Path to the dataset root directory
`--output-dir`	path	`output/<dataset>/`	Output directory for fold CSVs + metadata
`--sampling-rate`	int	config default	Sampling rate to validate against
`--n-folds`	int	`10`	Number of cross-validation folds
`--max-workers`	int	`4`	Parallel workers for signal validation
`--skip-validation`	flag	off	Skip signal validation (faster; no quality flags)
`--skip-croissant`	flag	off	Skip Croissant metadata generation

Python equivalent:

import ecgbench

result = ecgbench.run_splits(
    dataset="ptbxl",
    data_path="/path/to/ptb-xl/1.0.3/",
    output_dir=None,          # -> output/ptbxl/
    sampling_rate=None,       # -> config default_sampling_rate
    n_folds=10,
    max_workers=4,
    skip_validation=False,
    skip_croissant=False,
)
# result is a dict with: dataset, dataset_name, output_dir,
# original={total,train,val,test}, clean={total,train,val,test}, excluded

`ecgbench croissant`

Standalone Croissant 1.1 JSON-LD generator for an existing splits directory. Run once per version (clean and original).

ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/    --version clean
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/original/ --version original
ecgbench croissant --dataset ptbxl --splits-dir output/ptbxl/clean/ --validate

Flag	Type	Default	Description
`--dataset`	str	required	Dataset slug
`--splits-dir`	path	required	Version directory to scan (e.g. `output/ptbxl/clean/`)
`--output`	path	`<splits-dir>/croissant.json`	Where to write the JSON-LD
`--version`	`clean`\|`original`	`clean`	Version label to record in the Croissant file
`--validate`	flag	off	Validate the file after writing (non-zero exit if invalid)

Python equivalent:

from pathlib import Path
import ecgbench

saved_path: Path = ecgbench.run_croissant(
    dataset="ptbxl",
    splits_dir="output/ptbxl/clean/",
    output=None,              # -> splits_dir/croissant.json
    version="clean",
    validate=True,            # raises RuntimeError if the file does not validate
)

Requires the croissant extra (pip install ecgbench[croissant]).

`ecgbench upload`

Uploads each dataset's original/ and clean/ CSV folds, plus validation_report.json and croissant.json if present, to a HuggingFace Hub dataset repository. One or more dataset slugs can be uploaded in a single call.

ecgbench upload --data-dir output/ --datasets ptbxl
ecgbench upload --data-dir output/ --datasets ptbxl chapman_shaoxing
ecgbench upload --data-dir output/ --datasets ptbxl --dry-run
ecgbench upload --data-dir output/ --datasets ptbxl \
    --hf-repo-id your-org/ECGBench

Flag	Type	Default	Description
`--data-dir`	path	required	Root directory containing per-dataset subdirectories
`--datasets`	list	required	One or more dataset slugs to upload
`--hf-repo-id`	str	`vlbthambawita/ECGBench`	Target HuggingFace dataset repo ID
`--dry-run`	flag	off	Print the files that would be uploaded, without uploading

Authentication resolves in this order: token= argument (Python API only) -> HF_TOKEN env var -> HUGGINGFACE_HUB_TOKEN env var -> .env file in the current working directory. Run with --dry-run first to review the file list.

Python equivalent:

import ecgbench

counts: dict[str, int] = ecgbench.run_upload(
    data_dir="output/",
    datasets=["ptbxl", "chapman_shaoxing"],
    hf_repo_id="vlbthambawita/ECGBench",
    dry_run=False,
    token=None,               # falls back to env / .env
)
# counts: {"ptbxl": 42, "chapman_shaoxing": 42}

Requires the hf extra (pip install ecgbench[hf]).

API Reference

Config

load_config(slug) -- load DatasetConfig from YAML
list_available_configs() -- list dataset slugs with configs

Catalogue

list_datasets() -- all 64 datasets as CatalogueEntry objects
search(query, category, access) -- filter datasets
get_dataset(name) -- look up by name
categories() -- unique categories
to_dataframe() -- as pandas DataFrame

Dataset

ECGDataset(dataset, split, ...) -- unified PyTorch Dataset
ecg_collate_fn(batch) -- custom collate for DataLoader

Validation

validate_dataset(data_path, config) -- run quality checks
generate_report(result, config) -- generate report dict
save_report(result, config, path) -- save report JSON

Splitting

split_dataset(df, labels, config) -- generate folds
export_splits(split_result, val_result, output_dir, config) -- write CSVs
get_splitter(slug) -- get dataset-specific splitter

Croissant

generate_croissant(config, splits_dir) -- generate JSON-LD
save_croissant(config, splits_dir) -- save to file
validate_croissant(path) -- validate JSON-LD

Download

download_dataset(config) -- download from source
resolve_data_path(path, config) -- resolve or download

Pipelines (CLI + Python API)

run_splits(dataset, ...) -- full validate + split + Croissant pipeline (same as ecgbench splits)
run_croissant(dataset, splits_dir, ...) -- standalone Croissant generation (same as ecgbench croissant)
run_upload(data_dir, datasets, ...) -- HuggingFace Hub upload (same as ecgbench upload)

Development

uv pip install -e ".[dev]"
ruff check ecgbench/
black ecgbench/
pytest

Citation

If you use ECGBench in your research, please cite:

@software{ecgbench,
  author = {Thambawita, Vajira},
  title = {ECGBench: Reproducible ECG Benchmark Datasets},
  url = {https://github.com/vlbthambawita/ECGBench}
}

License

MIT License -- see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vlbthambawita

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.12.0

Apr 24, 2026

0.11.0

Apr 24, 2026

0.10.0

Apr 23, 2026

0.9.2

Apr 16, 2026

0.9.1

Apr 16, 2026

0.3.2

Apr 14, 2026

0.3.0

Apr 14, 2026

0.1.0

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ecgbench-0.12.0.tar.gz (51.0 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ecgbench-0.12.0-py3-none-any.whl (93.5 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file ecgbench-0.12.0.tar.gz.

File metadata

Download URL: ecgbench-0.12.0.tar.gz
Upload date: Apr 24, 2026
Size: 51.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.12.0.tar.gz
Algorithm	Hash digest
SHA256	`4b12261213674dda8a528e1db7f9544aeb7da854050b14486a06c144635cc1dc`
MD5	`28e3cc4f7411d1cd5c2ad42be1f68620`
BLAKE2b-256	`2d858914ca3197543e277a7208dcd3fa9ae4a1bbacb88fd43a067ee80225d76c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.12.0.tar.gz:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecgbench-0.12.0.tar.gz
- Subject digest: 4b12261213674dda8a528e1db7f9544aeb7da854050b14486a06c144635cc1dc
- Sigstore transparency entry: 1370006783
- Sigstore integration time: Apr 24, 2026
Source repository:
- Permalink: vlbthambawita/ECGBench@cd292734cdecdfa104b8705225ebde844be7497e
- Branch / Tag: refs/tags/v0.12.0
- Owner: https://github.com/vlbthambawita
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@cd292734cdecdfa104b8705225ebde844be7497e
- Trigger Event: push

File details

Details for the file ecgbench-0.12.0-py3-none-any.whl.

File metadata

Download URL: ecgbench-0.12.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 93.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ecgbench-0.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`daabe7de2723f56ad27c8607e6844c092d1a60e4a75b5c8bb27737b5e4ea0cfa`
MD5	`2abae3fff80d1a763f8940576bd5445e`
BLAKE2b-256	`ba98047d7d693fba4e4d9b7a2431cb4c919fd7e348616d5f5dc44412258575d0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ecgbench-0.12.0-py3-none-any.whl:

Publisher: publish-pypi.yml on vlbthambawita/ECGBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ecgbench-0.12.0-py3-none-any.whl
- Subject digest: daabe7de2723f56ad27c8607e6844c092d1a60e4a75b5c8bb27737b5e4ea0cfa
- Sigstore transparency entry: 1370006930
- Sigstore integration time: Apr 24, 2026
Source repository:
- Permalink: vlbthambawita/ECGBench@cd292734cdecdfa104b8705225ebde844be7497e
- Branch / Tag: refs/tags/v0.12.0
- Owner: https://github.com/vlbthambawita
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@cd292734cdecdfa104b8705225ebde844be7497e
- Trigger Event: push

ecgbench 0.12.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ECGBench

Installation

Base (config, catalogue, validation, splitting)

With PyTorch support

With everything

From source (development)

Quick Start

Dataset Catalogue

Loading ECG Data

Standard train/val/test splits

K-fold cross-validation

ECGDataset parameters

Output format

Data Versions

Validation

Croissant Metadata

Adding a New Dataset

CLI

ecgbench splits

ecgbench croissant

ecgbench upload

API Reference

Config

Catalogue

Dataset

Validation

Splitting

Croissant

Download

Pipelines (CLI + Python API)

Development

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`ecgbench splits`

`ecgbench croissant`

`ecgbench upload`