Python port of DataManifest.jl — declare and manage data dependencies for scientific projects
Project description
datamanifest
Keep track of datasets used in a scientific project.
datamanifest provides a simple way to declare data dependencies — URLs, git repositories, checksums, formats — in a datasets.toml file, and handles download, verification, extraction, and loading. It is a Python port of DataManifest.jl (same author), with the same manifest format and feature surface.
How it compares to Pooch
If you know Pooch, think "Pooch, but with a richer manifest that also loads the data and works across languages." Pooch is the established, widely-used tool for the fetch-verify-extract layer (it backs SciPy, scikit-image, and many others), and datamanifest covers that same ground — HTTP/Zenodo downloads, SHA-256 verification, unzip/untar. Pooch already has a registry file (flat lines of filename sha256 [url]); the three things datamanifest adds on top:
- A structured manifest that fetches and loads. Beyond filename+hash, one
datasets.tomlcarries format, extraction, per-language hooks, and how to turn each dataset into apandas/xarrayobject (the loader ladder) — where Pooch deliberately stops at "here's the verified path." - A dependency graph.
requires=resolves datasets in topological order, so derived datasets can be built from others. - A cross-language manifest. This is the core differentiator:
datamanifestis one member of a multi-language DataManifest family built on a shared TOML schema. The samedatasets.tomlis consumed by sibling implementations in other languages (todayDataManifest.jlfor Julia) via the_LANGnamespace, so projects in different languages share one declaration without stepping on each other. None of the Python tools below target this.
If you only need download-and-checksum in pure Python, Pooch is the more mature choice. datamanifest is aimed at multi-dataset, multi-language scientific projects that want the whole dependency declaration in one file.
Installation
pip install datamanifestpy
With optional loader backends:
pip install "datamanifestpy[csv]" # pandas CSV
pip install "datamanifestpy[parquet]" # pandas + pyarrow
pip install "datamanifestpy[nc]" # xarray + netcdf4
pip install "datamanifestpy[yaml]" # pyyaml
pip install "datamanifestpy[all]" # all of the above
API quickstart
import datamanifest
# Add a dataset (registers + downloads + auto-fills sha256)
datamanifest.add(
"https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip",
name="jesstierney/lgmDA",
extract=True,
)
# Resolve the on-disk path
path = datamanifest.get_dataset_path("jesstierney/lgmDA")
# Download and load in one step
ds = datamanifest.load_dataset("my_nc_entry") # returns xarray.Dataset for nc format
# Explicit database (no pyproject.toml / env-var lookup)
db = datamanifest.Database("datasets.toml", "my-data-folder")
datamanifest.add(db, "https://zenodo.org/record/.../file.csv")
path = datamanifest.get_dataset_path(db, "file")
The module-level functions (add, download_dataset, load_dataset, get_dataset_path, …) look up a process-wide default Database via pyproject.toml discovery, the DATAMANIFEST_TOML / DATASETS_TOML environment variables, or a datasets.toml / datamanifest.toml file in the working tree. Pass an explicit db as the first argument to bypass auto-discovery.
CLI usage
datamanifest COMMAND [OPTIONS]
| Command | Description |
|---|---|
list [--present|--missing|--all] |
List datasets; default shows present first, then missing |
download [NAME ...] [--all] [--overwrite] |
Download specific datasets or all of them |
path NAME |
Print the resolved on-disk path (composable in shell) |
add URI [--name N] [--no-download] [--extract] |
Register and (by default) download a dataset |
remove NAME [--keep-cache] |
Delete an entry, optionally preserving cached files |
show NAME |
Print full entry detail in TOML style |
verify [NAME ...] |
Re-check sha256 checksums; exits nonzero on any mismatch |
init [--folder PATH] [--force] |
Create a fresh datasets.toml in the current directory |
where |
Print active datasets_toml and datasets_folder paths |
migrate FILE |
Rewrite a v0 manifest to schema v1 (_LANG form) in-place |
Examples:
# Set up a new project
datamanifest init
# Add and download a dataset
datamanifest add "https://zenodo.org/record/.../file.zip" --extract
# Use the path in a shell pipeline
python analysis.py --data "$(datamanifest path file)"
# Verify all checksums before a paper submission
datamanifest verify
# Where is the active manifest?
datamanifest where
Features
| Feature | Supported |
|---|---|
| HTTP / HTTPS download with progress | yes |
| Partial-download resume (Range header) | yes |
git clone (git://, ssh+git://, *.git) |
yes |
SSH / rsync (ssh://, sshfs://, rsync://) |
yes |
Local file copy (file://) |
yes |
Multi-URI batch entries (uris=) |
yes |
| SHA-256 checksum verification + auto-fill | yes |
| ZIP / tar / tar.gz extraction | yes |
requires= dependency graph (topological order) |
yes |
Shell template hook (shell=) |
yes |
Python entry-point hook (python=) |
yes |
| Named + default loaders (csv, parquet, nc, json, yaml, toml, zip, tar) | yes |
TOML manifest round-trip (read tomllib, write tomli_w) |
yes |
Project-root auto-discovery (pyproject.toml walk, env vars) |
yes |
CLI (datamanifest list/download/path/add/remove/show/verify/init/where/migrate) |
yes |
Schema v1 _LANG namespace (read + write) |
yes |
| Fetch ladder: own Python fetcher → shell template → URI | yes |
| Load ladder: own Python loader → manifest default → built-in | yes |
Lossless round-trip of foreign _LANG.* subtrees |
yes |
v0 → v1 migration (datamanifest migrate) |
yes |
Portable storage model (store field + [_STORAGE] + platformdirs roots) |
yes |
Parameterized bindings ({ ref, args, kwargs } + $var substitution) |
yes |
Safe concurrent materialization (.tmp → atomic publish → .complete marker) |
yes |
Verify-once integrity (checksum only at fetch; .complete entry skips re-hash) |
yes |
| Recursive canonical key ordering / byte-identity (normative reference) | yes |
Storage model (spec-v1.1)
Behavior change from earlier releases. Prior releases stored all datasets under
$XDG_CACHE_HOME/Datasets(typically~/.cache/Datasets). As of spec-v1.1, the defaultdatastore resolves toplatformdirs.user_data_dir("datamanifest")/Datasets(typically~/.local/share/datamanifest/Datasetson Linux), and thecachestore toplatformdirs.user_cache_dir("datamanifest")/Datasets. If you have existing datasets at the old location, move them or pass an explicitdatasets_foldertoDatabase.
Each dataset entry carries an optional store field (default: data).
A [_STORAGE] table in the manifest lets you override the root directories per
store, per host (glob), or per profile:
[_STORAGE]
data = "~/data/Datasets"
cache = "~/.cache/Datasets"
repo = "datasets" # relative → <project_root>/datasets
[_STORAGE._HOST."login*.hpc.edu"]
data = "/scratch/$USER/Datasets" # $VAR and ~ are expanded
[_STORAGE._PROFILE.cluster]
data = "/work/proj/Datasets" # activated by DATAMANIFEST_PROFILE=cluster
[bigsim] # default store = "data" (persistent)
uri = "https://example.com/bigsim.nc"
[scratch_run]
store = "cache" # disposable, re-fetchable
uri = "https://example.com/scratch.nc"
[derived_table]
store = "repo" # lives under <project_root>/datasets
format = "csv"
Per-store precedence (highest first):
DATAMANIFEST_<STORE>_DIRenvironment variable.[_STORAGE._PROFILE.<name>].<store>— whenDATAMANIFEST_PROFILEis set.- First
[_STORAGE._HOST.<glob>].<store>where the glob matches the hostname. [_STORAGE].<store>base value.platformdirsdefault (data/cache) or<project_root>/datasets(repo).
Read resolution searches repo → data → cache and returns the first root
where <root>/<key> exists and has been successfully materialized (.complete
marker present). Falls back to the write path (selected store) when not found.
Schema v1 — _LANG namespace
Schema v1 separates language-specific bindings into a dedicated _LANG namespace so that a single manifest can serve multiple language implementations without conflicts.
[_META]
schema = 1
[mydata._LANG.python]
fetcher = "mypkg.fetch:download_mydata" # entry-point ref; resolved via importlib
loader = "mypkg.load:load_mydata"
[_LANG.python.loaders]
csv = "mypkg.loaders:load_csv" # per-format default for this manifest
[mydata._LANG.julia]
fetcher = "MyPkg.fetch_mydata" # preserved verbatim; Python never touches it
Fetch ladder (per dataset, in order):
- Own
_LANG.python.fetcherentry-point - Own
_LANG.shell.fetchertemplate - Plain
uridownload - Error — no source available
Load ladder (per dataset, in order):
- Own
_LANG.python.loaderentry-point - Manifest
[_LANG.python.loaders][format]default - Built-in format default (csv, parquet, nc, …)
- Error
Delegation to peer CLIs is not yet implemented — the ladder stops at built-ins.
Parameterized bindings (spec-v1.1)
Python fetcher/loader values may be a { ref, args, kwargs } table instead
of a plain string, allowing the same entry-point to be reused across datasets
that differ only in arguments:
[esm_5x5._LANG.python.loader]
ref = "mypkg.load:esm"
kwargs = { grid = "5x5" }
[esm_10x10._LANG.python.loader]
ref = "mypkg.load:esm"
kwargs = { grid = "10x10" }
String values in args and kwargs undergo $var substitution before the
call. Available variables: $download_path (fetcher), $path (loader),
$key, $version, $doi, $format, $branch, $uri, $project_root.
A bare string fetcher/loader keeps the conventional keyword-argument call
and requires no capability upgrade.
Foreign _LANG.<other> subtrees (e.g. _LANG.julia) are preserved verbatim on every read→write cycle; Python never modifies them. Unknown structural tables (any _* key that Python does not recognise) are similarly passed through.
v0 → v1 migration
datamanifest migrate datasets.toml
Rewrites a v0 flat manifest in-place: moves per-dataset python=/callable=/loader= into [<ds>._LANG.python], moves [_LOADERS] into [_LANG.python.loaders], and adds [_META] schema = 1. Foreign keys are left verbatim. Reading a v0 file without migrating still works (legacy forms are accepted silently), but a one-time deprecation warning is logged.
Python adaptations
The Python port uses the same manifest format as DataManifest.jl. Schema v1 is the preferred form; schema v0 (flat fields) is still accepted for backwards compatibility.
v0 / legacy fields (still accepted on read):
python=(orcallable=) — entry-point reference ("pkg.mod:func") resolved viaimportlib. The callable receives keyword arguments(download_path, project_root, entry, uri, key, version, doi, format, branch, requires_paths). No inline code execution (exec/eval) anywhere.loader=— format→ref mapping for the dataset's loader.python_includes=— list of directory paths prepended tosys.pathduring ref resolution.[_LOADERS]— manifest-wide format→ref loader defaults.
In schema v1 all of the above move into _LANG.python / _LANG.python.loaders. The datamanifest migrate command performs the conversion.
A single datasets.toml can be consumed by both tools: each reads the common fields and ignores the other's extension keys. The shared schema is documented at perrette/datamanifest.toml.
Conformance
This release targets spec-v1.1 of the shared datamanifest.toml schema.
Implemented capabilities:
| Capability | Status |
|---|---|
lang-read — parse _LANG namespace on read |
yes |
lang-write — regenerate _LANG.python, preserve foreign _LANG.* verbatim |
yes |
shell-fetch — _LANG.shell.fetcher template in the fetch ladder |
yes |
storage — store field, [_STORAGE] block, platformdirs roots, read-order resolution |
yes |
binding-args — { ref, args, kwargs } table form with $var substitution |
yes |
byte-identity — recursive canonical key ordering (normative reference) |
yes |
delegation — peer-CLI runtime (delegate fetch/load to another tool) |
not yet |
The conformance test suite (tests/test_conformance.py) downloads the pinned spec-v1.1 fixture tarball, verifies every file against a recorded per-file SHA-256 hash (tests/conformance_pin.toml), and runs only the fixtures whose capabilities are a subset of the above set, skipping the rest with a reason.
Related projects
The DataManifest family (one manifest, many languages):
perrette/datamanifest.toml— the shared TOML schema spec; the common contract every implementation reads.awi-esc/DataManifest.jl— the Julia implementation this port is based on, sharing the samedatasets.tomlvia the_LANGnamespace.
Python alternatives (single-language; closest established tools for parts of what datamanifest does):
fatiando/pooch— the closest established tool; covers the download / SHA-256 verification / unzip layer in pure Python (see How it compares to Pooch).datamanifestadds a load layer, arequires=dependency graph, and the cross-language manifest above.intake— catalog of data sources with drivers that load into pandas/xarray/dask; overlaps with the loader half ofdatamanifest.cthoyt/pystow— lightweight reproducible download + cached storage with an OS-appropriate data dir; code-driven rather than manifest-driven.
Acknowledgments
datamanifest is a Python port of awi-esc/DataManifest.jl, written by the same author (Mahé Perrette). The Python port was implemented with assistance from Anthropic's Claude.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamanifestpy-0.3.0.tar.gz.
File metadata
- Download URL: datamanifestpy-0.3.0.tar.gz
- Upload date:
- Size: 66.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be521c29b883c4f4b01d5c50c3ac317be67f3320354b166cfad1b2e92e97695b
|
|
| MD5 |
a14c24271efcddb99ae95e8e9fb64afc
|
|
| BLAKE2b-256 |
0a426ecae9331d72f29c0f46cf07e56d53a2ceafa61fcc81331064f8aba96cd1
|
Provenance
The following attestation bundles were made for datamanifestpy-0.3.0.tar.gz:
Publisher:
ci.yaml on perrette/datamanifest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamanifestpy-0.3.0.tar.gz -
Subject digest:
be521c29b883c4f4b01d5c50c3ac317be67f3320354b166cfad1b2e92e97695b - Sigstore transparency entry: 1705107698
- Sigstore integration time:
-
Permalink:
perrette/datamanifest@74c07de0a6091c505e3ba4511bc9792511b4065e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/perrette
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@74c07de0a6091c505e3ba4511bc9792511b4065e -
Trigger Event:
push
-
Statement type:
File details
Details for the file datamanifestpy-0.3.0-py3-none-any.whl.
File metadata
- Download URL: datamanifestpy-0.3.0-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa4e4359a96339086a73d7105c7ab7575ee33754f239e1e10934c406474b1665
|
|
| MD5 |
3c537d86fa433a1fda35e1421eff6578
|
|
| BLAKE2b-256 |
cb19967d32b3159e0a19a59e5b22a139ac589906371cd0f75624b59ab7a78d4d
|
Provenance
The following attestation bundles were made for datamanifestpy-0.3.0-py3-none-any.whl:
Publisher:
ci.yaml on perrette/datamanifest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamanifestpy-0.3.0-py3-none-any.whl -
Subject digest:
fa4e4359a96339086a73d7105c7ab7575ee33754f239e1e10934c406474b1665 - Sigstore transparency entry: 1705107725
- Sigstore integration time:
-
Permalink:
perrette/datamanifest@74c07de0a6091c505e3ba4511bc9792511b4065e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/perrette
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@74c07de0a6091c505e3ba4511bc9792511b4065e -
Trigger Event:
push
-
Statement type: