Python port of DataManifest.jl — declare and manage data dependencies for scientific projects
Project description
datamanifest
Keep track of datasets used in a scientific project.
datamanifest provides a simple way to declare data dependencies — URLs, git repositories, checksums, formats — in a datasets.toml file, and handles download, verification, extraction, and loading. It is a Python port of DataManifest.jl (same author), with the same manifest format and feature surface.
Installation
pip install datamanifest
With optional loader backends:
pip install "datamanifest[csv]" # pandas CSV
pip install "datamanifest[parquet]" # pandas + pyarrow
pip install "datamanifest[nc]" # xarray + netcdf4
pip install "datamanifest[yaml]" # pyyaml
pip install "datamanifest[all]" # all of the above
API quickstart
import datamanifest
# Add a dataset (registers + downloads + auto-fills sha256)
datamanifest.add(
"https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip",
name="jesstierney/lgmDA",
extract=True,
)
# Resolve the on-disk path
path = datamanifest.get_dataset_path("jesstierney/lgmDA")
# Download and load in one step
ds = datamanifest.load_dataset("my_nc_entry") # returns xarray.Dataset for nc format
# Explicit database (no pyproject.toml / env-var lookup)
db = datamanifest.Database("datasets.toml", "my-data-folder")
datamanifest.add(db, "https://zenodo.org/record/.../file.csv")
path = datamanifest.get_dataset_path(db, "file")
The module-level functions (add, download_dataset, load_dataset, get_dataset_path, …) look up a process-wide default Database via pyproject.toml discovery, the DATAMANIFEST_TOML / DATASETS_TOML environment variables, or a datasets.toml / datamanifest.toml file in the working tree. Pass an explicit db as the first argument to bypass auto-discovery.
CLI usage
datamanifest COMMAND [OPTIONS]
| Command | Description |
|---|---|
list [--present|--missing|--all] |
List datasets; default shows present first, then missing |
download [NAME ...] [--all] [--overwrite] |
Download specific datasets or all of them |
path NAME |
Print the resolved on-disk path (composable in shell) |
add URI [--name N] [--no-download] [--extract] |
Register and (by default) download a dataset |
remove NAME [--keep-cache] |
Delete an entry, optionally preserving cached files |
show NAME |
Print full entry detail in TOML style |
verify [NAME ...] |
Re-check sha256 checksums; exits nonzero on any mismatch |
init [--folder PATH] [--force] |
Create a fresh datasets.toml in the current directory |
where |
Print active datasets_toml and datasets_folder paths |
Examples:
# Set up a new project
datamanifest init
# Add and download a dataset
datamanifest add "https://zenodo.org/record/.../file.zip" --extract
# Use the path in a shell pipeline
python analysis.py --data "$(datamanifest path file)"
# Verify all checksums before a paper submission
datamanifest verify
# Where is the active manifest?
datamanifest where
Features
| Feature | Supported |
|---|---|
| HTTP / HTTPS download with progress | yes |
| Partial-download resume (Range header) | yes |
git clone (git://, ssh+git://, *.git) |
yes |
SSH / rsync (ssh://, sshfs://, rsync://) |
yes |
Local file copy (file://) |
yes |
Multi-URI batch entries (uris=) |
yes |
| SHA-256 checksum verification + auto-fill | yes |
| ZIP / tar / tar.gz extraction | yes |
requires= dependency graph (topological order) |
yes |
Shell template hook (shell=) |
yes |
Python entry-point hook (python=) |
yes |
| Named + default loaders (csv, parquet, nc, json, yaml, toml, zip, tar) | yes |
TOML manifest round-trip (read tomllib, write tomli_w) |
yes |
Project-root auto-discovery (pyproject.toml walk, env vars) |
yes |
CLI (datamanifest list/download/path/add/remove/show/verify/init/where) |
yes |
Python adaptations
The Python port uses the same datasets.toml format as DataManifest.jl. Two fields differ:
python=replacesjulia=: an entry-point reference ("pkg.mod:func") resolved viaimportlib. The callable receives keyword arguments(download_path, project_root, entry, uri, key, version, doi, format, branch, requires_paths). No inline code execution (exec/eval) anywhere.callable=is an alias forpython=accepted on read and normalized topython=on write. Intended for single-language projects that want a language-agnostic key.python_includes=is a list of directory paths prepended tosys.pathduring loader resolution (replacesjulia_modules).
A single datasets.toml can be consumed by both tools: each reads the common fields and ignores the other's extension keys. The shared schema is documented at perrette/datamanifest.toml.
Related projects
awi-esc/DataManifest.jl— the Julia implementation this port is based on.perrette/datamanifest.toml— the shared TOML schema spec consumed by both implementations.
Acknowledgments
datamanifest is a Python port of awi-esc/DataManifest.jl, written by the same author (Mahé Perrette). The Python port was implemented with assistance from Anthropic's Claude.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datamanifestpy-0.1.0.tar.gz.
File metadata
- Download URL: datamanifestpy-0.1.0.tar.gz
- Upload date:
- Size: 39.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f637c40f612db8dd05e7bb21a89c3be07d79aee966f1ca007dd101fc888f099
|
|
| MD5 |
2c9697655c68b50a7d0c4bfc1da45452
|
|
| BLAKE2b-256 |
43fda1fa585cb701fb148b17fcf7d51f9c8f79f0e0781ffa178d4b62c3c69e41
|
Provenance
The following attestation bundles were made for datamanifestpy-0.1.0.tar.gz:
Publisher:
ci.yaml on perrette/datamanifest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamanifestpy-0.1.0.tar.gz -
Subject digest:
6f637c40f612db8dd05e7bb21a89c3be07d79aee966f1ca007dd101fc888f099 - Sigstore transparency entry: 1702365089
- Sigstore integration time:
-
Permalink:
perrette/datamanifest@f2f7d8eb361f243afd4359d8ef61db7cf854b3db -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/perrette
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@f2f7d8eb361f243afd4359d8ef61db7cf854b3db -
Trigger Event:
push
-
Statement type:
File details
Details for the file datamanifestpy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: datamanifestpy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf922af9a92a8877f5d69b0dad743afc6f12c9c13d49c7dd8c6c1fc4aac998d5
|
|
| MD5 |
916db8ff69a607ca4b1ab02a9703c6c7
|
|
| BLAKE2b-256 |
d55c862aa519963297ac8867588749d6671f63437a6933b8515caee9d25386f7
|
Provenance
The following attestation bundles were made for datamanifestpy-0.1.0-py3-none-any.whl:
Publisher:
ci.yaml on perrette/datamanifest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
datamanifestpy-0.1.0-py3-none-any.whl -
Subject digest:
bf922af9a92a8877f5d69b0dad743afc6f12c9c13d49c7dd8c6c1fc4aac998d5 - Sigstore transparency entry: 1702365192
- Sigstore integration time:
-
Permalink:
perrette/datamanifest@f2f7d8eb361f243afd4359d8ef61db7cf854b3db -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/perrette
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yaml@f2f7d8eb361f243afd4359d8ef61db7cf854b3db -
Trigger Event:
push
-
Statement type: