Skip to main content

Shared data-access library for RAEH biomedical signal datasets

Project description

raeh-data

Shared data-access library for RAEH biomedical signal datasets.

One install, every RAEH project (algorithm validation, SQI audits, RR/BP estimation, foundation-model pretraining, …) reads from s3://raeh-datasets/ the same way. Returns plain pandas.DataFrame / numpy.ndarray — no framework lock-in.

Status

Layer 1 (data access) and Layer 2 (signal-processing ops) implemented; canonical metadata populated on S3 for all datasets (see Datasets Reference).

Install

pip install raeh-data

That's it — no SSH key, no GitHub access, no git required. Python ≥ 3.11.

Pin a version for reproducibility:

pip install raeh-data==0.1.0

Or as a dependency in a consumer project's requirements.txt / pyproject.toml:

raeh-data>=0.1

The package is public on PyPI for install convenience, but it is an access client for RAEH's private datasets. Installing it does not grant data access — you also need RAEH-issued AWS credentials (below) and must be covered by the relevant data-use agreements. See LICENSE.

For contributors

git clone git@github.com:<org>/raeh-data.git
cd raeh-data
pip install -e ".[dev]"     # editable install with test/lint/build deps

AWS credentials

Installing the package doesn't grant data access — the datasets live in a private bucket. raeh-data authenticates with any standard AWS credential source (boto3's default provider chain), so use whichever your team has set up. In rough order of preference:

1. AWS SSO / IAM Identity Center (recommended — short-lived, nothing to leak):

aws sso login --profile raeh        # once per session
export AWS_PROFILE=raeh             # or set profile in your shell rc

First-time setup (aws configure sso) and the admin-side org configuration are in AWS SSO setup.

2. A named profile in ~/.aws/credentials:

export AWS_PROFILE=raeh

3. On AWS compute (EC2 / ECS / Lambda): nothing to do — the instance/task role is picked up automatically.

4. Long-lived keys via env or a .env at your project root (simplest, but avoid for shared machines — these don't expire):

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_DEFAULT_REGION=ap-south-1
S3_BUCKET_NAME=raeh-datasets

The bucket (raeh-datasets) and region (ap-south-1) have sensible defaults; override via .env, env vars, or raeh_data.configure(...) only if needed.

Without any working credentials you'll get StorageUnavailable: HTTP 403 Forbidden on the first data call.

Quick example

from raeh_data import datasets, ops

# Browse what's available
print(datasets.list())

# Load one subject's PPG + ground truth
sig = datasets.load("ppg_dalia", "S01", signal="ppg")
gt = datasets.ground_truth("ppg_dalia", "S01")

# Apply a signal-processing pipeline
sig = ops.bandpass(sig, 0.5, 8.0, fs=64)
sig = ops.zscore(sig)

# Iterate windows for a reproducible benchmark
for sig_df, gt_df, meta in datasets.iter_benchmark("ppg_dalia", "ppg"):
    # meta.subject_id, meta.window_idx, meta.sample_rate
    # ... predict, compare to gt_df ...
    pass

Documentation

Run the demo

PYTHONPATH=src python examples/demo_ppg_dalia.py

End-to-end walkthrough on the PPG-DALIA dataset — catalog, load, ops chain, windowed iteration, benchmark mode.

Run the tests

pytest                    # unit tests (default; integration skipped)
pytest -m integration     # live-S3 integration tests (requires creds)

Project layout

raeh-data/
├── pyproject.toml
├── docs/                  ← documentation (you're here)
├── examples/              ← runnable demo scripts
├── scripts/               ← admin scripts (e.g., metadata rewriter)
├── src/raeh_data/
│   ├── datasets.py        ← Layer 1 — public data-access API
│   ├── ops/               ← Layer 2 — signal-processing ops
│   ├── cache.py           ← local Parquet cache
│   ├── _core.py           ← internal: DataStore (S3 + DuckDB)
│   ├── _config.py         ← env var loading + configure()
│   ├── _schemas.py        ← DatasetMetadata, YieldMetadata
│   └── exceptions.py      ← public exception hierarchy
└── tests/                 ← unit tests + live-S3 integration tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raeh_data-0.1.0.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raeh_data-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file raeh_data-0.1.0.tar.gz.

File metadata

  • Download URL: raeh_data-0.1.0.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for raeh_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 30e27c72a395cda5a3fe210b2ca85b4ef242e93332a4ce056a81d34e02377605
MD5 eed6f9a8c5113b3bffbbe842f3d8efc3
BLAKE2b-256 20d046d851ef3d5fdecb90f7cb39da3def1929e615ff9a42158006e18774ee2f

See more details on using hashes here.

File details

Details for the file raeh_data-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: raeh_data-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for raeh_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6b41467d872c2961fa06c9b2244466a5e8ec10b3e6d0248a31a18de041e1c2ab
MD5 2f87d5dfa1a3fac6e208302da56e0f2f
BLAKE2b-256 eedf9723327564b28a48b0bc8e012653385f8b77aff6a38844c6eb46dabd1a81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page