A citable, reproducible bank of raw NIRS reference datasets — multi-source/multi-target, tier-governed, with on-demand checksum-verified access from each dataset's origin.
Project description
nirs4all-datasets
A citable, reproducible bank of raw NIRS (Near-Infrared Spectroscopy) reference datasets — for benchmarking, exploring, and comparing models on a common, version-pinned, provenance-rich footing.
A dataset here is raw measured reality, not a benchmark task: one or more spectral sources (instruments), any number of variables (every target and metadata column — nothing is invented, and nothing is thrown away), the native splits if the source defined them, and full provenance back to the origin that published the data. The task — which Y, which split, which metric — is a choice the consumer makes; it is never baked into the dataset.
Three deliverables:
- a git-tracked catalog — one hand-checkable descriptor + a machine-generated identity card (stats, per-source/per-variable dataviz, MLCommons Croissant, a Datasheet) per dataset. The heavy bytes never enter git.
- a Python plugin —
get("name")downloads a dataset on demand from its origin, verifies its SHA-256, caches it, and returns aNirsDataset. - a static site — a browsable, qualified catalog with whole-bank dataviz and per-dataset id-cards.
It reuses nirs4all for qualification and nirs4all-io /
nirs4all-formats for reading instrument files (OPUS, JCAMP-DX, SPC, ASD, …).
It never re-implements NIRS/IO logic.
Status: alpha (0.x), pre-1.0 — the on-disk and API contracts may still change.
The dataset model
- Sources (X) —
1..n, kept separate. Multi-instrument / multi-block datasets keep each block as its own source. Sources may even carry different numbers of spectra (asymmetric repetitions): they are aligned by sample identity (sample_id), never by row position. - Variables (Y + metadata) —
0..n. There is no intrinsic Y/metadata distinction: every column is a potential target. A dataset may declare no target at all (X-only / metadata-only is valid). Declared targets are flagged; everything else is kept as metadata, with full per-variable dataviz either way. - Splits — documented, never auto-applied. Native train/test/fold partitions are recorded so you can
reproduce a paper's split, but
get()never silently applies one. - Tiers — how a dataset is shown and exported.
public(everything shown, openly fetchable from the origin),private(everything shown; export needs a token),anonymized(variable names masked + targets normalized; export needs a token). Bytes are never served from git or the site — the catalog points at the origin DOI/URL; a personal Dataverse is only a future fallback for protected datasets. - Versions — two axes. A content version (bumps when the dataset bytes change) and a metric-protocol version (lets the cards be re-qualified under a new protocol without rebuilding the data).
Install (development)
uv venv && uv pip install -e ".[dev]" # maturin: builds the native acquisition core into the package
# (uses local editable nirs4all via [tool.uv.sources]; needs a Rust toolchain)
Native acquisition core & language bindings
The download of a dataset — version-pinned DOI resolution, redirect-safe Dataverse / Zenodo /
figshare fetch, streaming SHA-256 verification and the pooch-style cache — lives in a small Rust
core (crates/nirs4all-datasets-core) behind a stable C ABI (n4ds_), and is published like the
rest of the ecosystem (nirs4all-io is the template). The scientific analysis layer (cards,
qualify, site, health) stays in pure Python. The cross-language contract is one distributable
catalog/index.json; the n4ds CLI is the parity oracle. Bindings (all over the same C ABI):
| Binding | Package | Status |
|---|---|---|
| Python | embedded in nirs4all-datasets (nirs4all_datasets._n4ds, pyo3) |
built + tested |
| Rust | nirs4all-datasets-core / -capi (crates.io) |
built + tested |
| WASM/JS | @nirs4all/datasets-wasm (npm) — metadata + small public datasets |
built + tested |
| R | nirs4alldatasets (C shim, r-universe / Release) |
built + tested |
| Octave/MATLAB | MEX (GitHub Release zip) | built + tested |
See bindings/SPEC.md (the binding contract) and
docs/dev/release_process.md.
Quickstart
import nirs4all_datasets as n4ad
n4ad.list() # the catalog index
n4ad.card("corn_eigenvector_nir") # the identity card (dict): sources, variables, stats, provenance
ds = n4ad.get("corn_eigenvector_nir") # -> NirsDataset (fetched from origin, checksum-verified, cached)
ds.sources() # ['X1', 'X2', 'X3'] — the same corn measured on three NIR instruments
ds.x("X1") # one source's spectra as a 2D numpy array
ds.x(concat=False) # {source_id: array} for every source (sample-aligned, not row-aligned)
ds.y() # all declared targets, per sample
ds.metadata() # the metadata columns (each a potential target)
ds.split("original") # the native split labels, if the source defined one
ds.to_nirs4all() # hand off to nirs4all for modelling
Private / anonymized datasets need a Dataverse token: n4ad.get("name", token=...).
CLI (n4a-datasets)
bootstrap <tree> author schema-2.0 descriptors from <tree>/v2.0/* (--prune to re-base)
build-all --source-tree <tree> organize + qualify every dataset in parallel (--protocol-refresh, --site)
add <raw_source> <id> one raw source -> canonical + card + index
qualify <id> (re)build a dataset's card (--anonymize -> card.anon.json)
health-check probe each dataset's open origins -> catalog/health.json
catalog | list | card | get regenerate the index / inspect / load a dataset
publish | grant | revoke | restrict personal-Dataverse governance for protected data (future)
n4a-datasets <command> --help documents every flag.
What lives where (3-tier storage)
- git (small, tracked):
catalog/datasets/<id>.yaml(descriptor),catalog/datasets.yaml(index + whole-bank summary), and per-datasetcard.json/card.md/croissant.json/manifest.json. - the origin (Zenodo, a data Dataverse, a vendor archive, …): the raw + canonical bytes, fetched on demand and never re-hosted by this project.
- local cache (downloaded on demand): the verified canonical Parquet under
pooch.os_cache.
API token — where to put it
A Dataverse API token is only needed to fetch private/anonymized datasets or to publish to a personal Dataverse; public datasets need none. Resolution order:
- Environment variable
NIRS4ALL_DATAVERSE_TOKEN(recommended; required in CI). ~/.config/nirs4all-datasets/config.toml(chmod 600):[dataverse] instance = "https://entrepot.recherche.data.gouv.fr" token = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
- A project
.env(gitignored) — see.env.example.
The token travels only in the X-Dataverse-key header, is never logged, and is never sent on a redirect
to signed object storage. Never commit it (.env, config.toml, *.token are gitignored).
Contributing
Full walkthrough in CONTRIBUTING.md; the design is in docs/DESIGN.md. The green gate (run before every commit) mirrors CI:
ruff check . && mypy --config-file pyproject.toml src
python catalog/scripts/validate.py # every descriptor is schema-valid
pytest -q
License
Code: MIT (see LICENSE). Each dataset carries its own SPDX license in its descriptor, and
is only ever linked to its origin — open data is never re-hosted under a different license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nirs4all_datasets-0.2.0.tar.gz.
File metadata
- Download URL: nirs4all_datasets-0.2.0.tar.gz
- Upload date:
- Size: 272.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23c3905ffb0a1d1ba1efc04c7d6135d0d390400c2c2255459363033aabb10204
|
|
| MD5 |
c7021fab657a83228d55039681caaedd
|
|
| BLAKE2b-256 |
d5e93e3e9542220559d53e2539e1600f7b5b6296e5f205474d13ac8c0a7e3eb1
|
Provenance
The following attestation bundles were made for nirs4all_datasets-0.2.0.tar.gz:
Publisher:
release-python.yml on GBeurier/nirs4all-datasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nirs4all_datasets-0.2.0.tar.gz -
Subject digest:
23c3905ffb0a1d1ba1efc04c7d6135d0d390400c2c2255459363033aabb10204 - Sigstore transparency entry: 1800821285
- Sigstore integration time:
-
Permalink:
GBeurier/nirs4all-datasets@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/GBeurier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Trigger Event:
push
-
Statement type:
File details
Details for the file nirs4all_datasets-0.2.0-cp311-abi3-win_amd64.whl.
File metadata
- Download URL: nirs4all_datasets-0.2.0-cp311-abi3-win_amd64.whl
- Upload date:
- Size: 1.5 MB
- Tags: CPython 3.11+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0216cc934447a11db7bc16fe9a55ce3d00e39065db3b18d3ecfb1b5ef7d491fd
|
|
| MD5 |
6d3c8cc64c89ce94da246a5be9d65dd1
|
|
| BLAKE2b-256 |
326f8a1e34577882023c809b54b1906b8c04996664aa3c45765b7bf2722de084
|
Provenance
The following attestation bundles were made for nirs4all_datasets-0.2.0-cp311-abi3-win_amd64.whl:
Publisher:
release-python.yml on GBeurier/nirs4all-datasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nirs4all_datasets-0.2.0-cp311-abi3-win_amd64.whl -
Subject digest:
0216cc934447a11db7bc16fe9a55ce3d00e39065db3b18d3ecfb1b5ef7d491fd - Sigstore transparency entry: 1800821814
- Sigstore integration time:
-
Permalink:
GBeurier/nirs4all-datasets@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/GBeurier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Trigger Event:
push
-
Statement type:
File details
Details for the file nirs4all_datasets-0.2.0-cp311-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: nirs4all_datasets-0.2.0-cp311-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.11+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a766325c82ae4d4a441955ae28b6d6127079a6e94b3b59357227928011d2e310
|
|
| MD5 |
f216fff295fd6c2dfe0af8c26e168e8f
|
|
| BLAKE2b-256 |
7e6593a13687e714de9f15d08c6f3bd62c408619fb251c385dea89e5b7389cb3
|
Provenance
The following attestation bundles were made for nirs4all_datasets-0.2.0-cp311-abi3-manylinux_2_34_x86_64.whl:
Publisher:
release-python.yml on GBeurier/nirs4all-datasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nirs4all_datasets-0.2.0-cp311-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
a766325c82ae4d4a441955ae28b6d6127079a6e94b3b59357227928011d2e310 - Sigstore transparency entry: 1800822135
- Sigstore integration time:
-
Permalink:
GBeurier/nirs4all-datasets@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/GBeurier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Trigger Event:
push
-
Statement type:
File details
Details for the file nirs4all_datasets-0.2.0-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: nirs4all_datasets-0.2.0-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc3ca6ad0e658bf359bb489a4923bb82aa7dfebdcf3cd1444cf8f2d51325735d
|
|
| MD5 |
a68e1c4b6e195e45627852ec18b11f1d
|
|
| BLAKE2b-256 |
870fc383c990a80b52c6cbf306ef9d339bdc2820aeb0142fd8add3f3e543627d
|
Provenance
The following attestation bundles were made for nirs4all_datasets-0.2.0-cp311-abi3-macosx_11_0_arm64.whl:
Publisher:
release-python.yml on GBeurier/nirs4all-datasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nirs4all_datasets-0.2.0-cp311-abi3-macosx_11_0_arm64.whl -
Subject digest:
fc3ca6ad0e658bf359bb489a4923bb82aa7dfebdcf3cd1444cf8f2d51325735d - Sigstore transparency entry: 1800821722
- Sigstore integration time:
-
Permalink:
GBeurier/nirs4all-datasets@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/GBeurier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-python.yml@f020468bcb8ab9a37f0bea0fee2a5639b65b819f -
Trigger Event:
push
-
Statement type: