Skip to main content

Python port of DataManifest.jl — declare and manage data dependencies for scientific projects

Project description

datamanifest

pypi python CI

Keep track of datasets used in a scientific project.

datamanifest provides a simple way to declare data dependencies — URLs, git repositories, checksums, formats — in a datasets.toml file, and handles download, verification, extraction, and loading. It is a Python port of DataManifest.jl (same author), with the same manifest format and feature surface.

Installation

pip install datamanifestpy

With optional loader backends:

pip install "datamanifestpy[csv]"       # pandas CSV
pip install "datamanifestpy[parquet]"   # pandas + pyarrow
pip install "datamanifestpy[nc]"        # xarray + netcdf4
pip install "datamanifestpy[yaml]"      # pyyaml
pip install "datamanifestpy[all]"       # all of the above

API quickstart

import datamanifest

# Add a dataset (registers + downloads + auto-fills sha256)
datamanifest.add(
    "https://github.com/jesstierney/lgmDA/archive/refs/tags/v2.1.zip",
    name="jesstierney/lgmDA",
    extract=True,
)

# Resolve the on-disk path
path = datamanifest.get_dataset_path("jesstierney/lgmDA")

# Download and load in one step
ds = datamanifest.load_dataset("my_nc_entry")  # returns xarray.Dataset for nc format

# Explicit database (no pyproject.toml / env-var lookup)
db = datamanifest.Database("datasets.toml", "my-data-folder")
datamanifest.add(db, "https://zenodo.org/record/.../file.csv")
path = datamanifest.get_dataset_path(db, "file")

The module-level functions (add, download_dataset, load_dataset, get_dataset_path, …) look up a process-wide default Database via pyproject.toml discovery, the DATAMANIFEST_TOML / DATASETS_TOML environment variables, or a datasets.toml / datamanifest.toml file in the working tree. Pass an explicit db as the first argument to bypass auto-discovery.

CLI usage

datamanifest COMMAND [OPTIONS]
Command Description
list [--present|--missing|--all] List datasets; default shows present first, then missing
download [NAME ...] [--all] [--overwrite] Download specific datasets or all of them
path NAME Print the resolved on-disk path (composable in shell)
add URI [--name N] [--no-download] [--extract] Register and (by default) download a dataset
remove NAME [--keep-cache] Delete an entry, optionally preserving cached files
show NAME Print full entry detail in TOML style
verify [NAME ...] Re-check sha256 checksums; exits nonzero on any mismatch
init [--folder PATH] [--force] Create a fresh datasets.toml in the current directory
where Print active datasets_toml and datasets_folder paths

Examples:

# Set up a new project
datamanifest init

# Add and download a dataset
datamanifest add "https://zenodo.org/record/.../file.zip" --extract

# Use the path in a shell pipeline
python analysis.py --data "$(datamanifest path file)"

# Verify all checksums before a paper submission
datamanifest verify

# Where is the active manifest?
datamanifest where

Features

Feature Supported
HTTP / HTTPS download with progress yes
Partial-download resume (Range header) yes
git clone (git://, ssh+git://, *.git) yes
SSH / rsync (ssh://, sshfs://, rsync://) yes
Local file copy (file://) yes
Multi-URI batch entries (uris=) yes
SHA-256 checksum verification + auto-fill yes
ZIP / tar / tar.gz extraction yes
requires= dependency graph (topological order) yes
Shell template hook (shell=) yes
Python entry-point hook (python=) yes
Named + default loaders (csv, parquet, nc, json, yaml, toml, zip, tar) yes
TOML manifest round-trip (read tomllib, write tomli_w) yes
Project-root auto-discovery (pyproject.toml walk, env vars) yes
CLI (datamanifest list/download/path/add/remove/show/verify/init/where) yes

Python adaptations

The Python port uses the same datasets.toml format as DataManifest.jl. Two fields differ:

  • python= replaces julia=: an entry-point reference ("pkg.mod:func") resolved via importlib. The callable receives keyword arguments (download_path, project_root, entry, uri, key, version, doi, format, branch, requires_paths). No inline code execution (exec/eval) anywhere.
  • callable= is an alias for python= accepted on read and normalized to python= on write. Intended for single-language projects that want a language-agnostic key.
  • python_includes= is a list of directory paths prepended to sys.path during loader resolution (replaces julia_modules).

A single datasets.toml can be consumed by both tools: each reads the common fields and ignores the other's extension keys. The shared schema is documented at perrette/datamanifest.toml.

Related projects

Acknowledgments

datamanifest is a Python port of awi-esc/DataManifest.jl, written by the same author (Mahé Perrette). The Python port was implemented with assistance from Anthropic's Claude.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datamanifestpy-0.1.2.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datamanifestpy-0.1.2-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file datamanifestpy-0.1.2.tar.gz.

File metadata

  • Download URL: datamanifestpy-0.1.2.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamanifestpy-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e497783e6d5178f857dfceef6c544edba9b2211afc264e0ec70470eb118e9c37
MD5 5f82f8dc2be0238d1bb7d3b91458089a
BLAKE2b-256 7f6ba50475b71bea3c106ca0440fe89f353062fa66e109f982d9e787242e4b5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamanifestpy-0.1.2.tar.gz:

Publisher: ci.yaml on perrette/datamanifest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file datamanifestpy-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: datamanifestpy-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for datamanifestpy-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bbf91ee3c3436a59c535a7c0952be41f76cba74305c1f5fdc8ab2dbb0b8b815c
MD5 b005eeaa2246bc5c9fe016c04fa57bf3
BLAKE2b-256 79dedafc99c477d82aea11d92c6d68aa2619700951dd732711619f3de1c08696

See more details on using hashes here.

Provenance

The following attestation bundles were made for datamanifestpy-0.1.2-py3-none-any.whl:

Publisher: ci.yaml on perrette/datamanifest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page