Shared data-access library for RAEH biomedical signal datasets
Project description
raeh-data
Shared data-access library for RAEH biomedical signal datasets.
One install, every RAEH project (algorithm validation, SQI audits, RR/BP estimation, foundation-model pretraining, …) reads from s3://raeh-datasets/ the same way. Returns plain pandas.DataFrame / numpy.ndarray — no framework lock-in.
Status
Layer 1 (data access) and Layer 2 (signal-processing ops) implemented; canonical metadata populated on S3 for all datasets (see Datasets Reference).
Install
pip install raeh-data
That's it — no SSH key, no GitHub access, no git required. Python ≥ 3.11.
Pin a version for reproducibility:
pip install raeh-data==0.1.0
Or as a dependency in a consumer project's requirements.txt / pyproject.toml:
raeh-data>=0.1
The package is public on PyPI for install convenience, but it is an access client for RAEH's private datasets. Installing it does not grant data access — you also need RAEH-issued AWS credentials (below) and must be covered by the relevant data-use agreements. See
LICENSE.
For contributors
git clone git@github.com:<org>/raeh-data.git
cd raeh-data
pip install -e ".[dev]" # editable install with test/lint/build deps
AWS credentials
Installing the package doesn't grant data access — the datasets live in a
private bucket. raeh-data authenticates with any standard AWS credential
source (boto3's default provider chain), so use whichever your team has set
up. In rough order of preference:
1. AWS SSO / IAM Identity Center (recommended — short-lived, nothing to leak):
aws sso login --profile raeh # once per session
export AWS_PROFILE=raeh # or set profile in your shell rc
First-time setup (aws configure sso) and the admin-side org configuration are
in AWS SSO setup.
2. A named profile in ~/.aws/credentials:
export AWS_PROFILE=raeh
3. On AWS compute (EC2 / ECS / Lambda): nothing to do — the instance/task role is picked up automatically.
4. Long-lived keys via env or a .env at your project root (simplest, but
avoid for shared machines — these don't expire):
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_DEFAULT_REGION=ap-south-1
S3_BUCKET_NAME=raeh-datasets
The bucket (raeh-datasets) and region (ap-south-1) have sensible defaults;
override via .env, env vars, or raeh_data.configure(...) only if needed.
Without any working credentials you'll get StorageUnavailable: HTTP 403 Forbidden on the first data call.
Quick example
from raeh_data import datasets, ops
# Browse what's available
print(datasets.list())
# Load one subject's PPG + ground truth
sig = datasets.load("ppg_dalia", "S01", signal="ppg")
gt = datasets.ground_truth("ppg_dalia", "S01")
# Apply a signal-processing pipeline
sig = ops.bandpass(sig, 0.5, 8.0, fs=64)
sig = ops.zscore(sig)
# Iterate windows for a reproducible benchmark
for sig_df, gt_df, meta in datasets.iter_benchmark("ppg_dalia", "ppg"):
# meta.subject_id, meta.window_idx, meta.sample_rate
# ... predict, compare to gt_df ...
pass
Documentation
- Usage Guide — concepts, recipes, common patterns.
- API Reference — every public function and class.
- Datasets Reference — per-dataset info, sample rates, benchmark protocols.
- AWS SSO setup — credential-free access via IAM Identity Center (admin + user onboarding).
- Troubleshooting — common errors and fixes.
- Design Doc — internal architecture and design decisions (for contributors).
Run the demo
PYTHONPATH=src python examples/demo_ppg_dalia.py
End-to-end walkthrough on the PPG-DALIA dataset — catalog, load, ops chain, windowed iteration, benchmark mode.
Run the tests
pytest # unit tests (default; integration skipped)
pytest -m integration # live-S3 integration tests (requires creds)
Project layout
raeh-data/
├── pyproject.toml
├── docs/ ← documentation (you're here)
├── examples/ ← runnable demo scripts
├── scripts/ ← admin scripts (e.g., metadata rewriter)
├── src/raeh_data/
│ ├── datasets.py ← Layer 1 — public data-access API
│ ├── ops/ ← Layer 2 — signal-processing ops
│ ├── cache.py ← local Parquet cache
│ ├── _core.py ← internal: DataStore (S3 + DuckDB)
│ ├── _config.py ← env var loading + configure()
│ ├── _schemas.py ← DatasetMetadata, YieldMetadata
│ └── exceptions.py ← public exception hierarchy
└── tests/ ← unit tests + live-S3 integration tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raeh_data-0.1.0.tar.gz.
File metadata
- Download URL: raeh_data-0.1.0.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30e27c72a395cda5a3fe210b2ca85b4ef242e93332a4ce056a81d34e02377605
|
|
| MD5 |
eed6f9a8c5113b3bffbbe842f3d8efc3
|
|
| BLAKE2b-256 |
20d046d851ef3d5fdecb90f7cb39da3def1929e615ff9a42158006e18774ee2f
|
File details
Details for the file raeh_data-0.1.0-py3-none-any.whl.
File metadata
- Download URL: raeh_data-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b41467d872c2961fa06c9b2244466a5e8ec10b3e6d0248a31a18de041e1c2ab
|
|
| MD5 |
2f87d5dfa1a3fac6e208302da56e0f2f
|
|
| BLAKE2b-256 |
eedf9723327564b28a48b0bc8e012653385f8b77aff6a38844c6eb46dabd1a81
|