Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

ZipEntry

Represents a ZIP archive entry in the dataset. A ZipEntry is a container object that describes a downloadable archive file and the files contained within it.

@dataclass
class ZipEntry(Entry):
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    version: str | None
    creation_date: str | None
    last_modification_date: str | None
    files: list[FileInZipEntry]

Fields

  • download_url URL from which the ZIP archive can be downloaded.

  • size Size of the ZIP archive in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")) used to verify archive integrity.

  • version Optional version identifier of the archive.

  • creation_date Optional creation timestamp of the archive.

  • last_modification_date Optional last modification timestamp of the archive.

  • files List of files contained inside the ZIP archive. Each entry describes a file within the archive (path, size, checksum, and optional metadata such as mimetype).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry | ZipEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
    ) -> int: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry | ZipEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
) -> int

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)
    elif isinstance(entry, ZipEntry):
        print("Zip:", entry)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.6.4-cp310-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_i686.whl (6.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_armv7l.whl (6.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_aarch64.whl (7.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_ppc64le.whl (7.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_i686.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_armv7l.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_aarch64.whl (7.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.6.4-cp310-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5ee6107b9f991ac640bf9daa4a9b05ee9377fb389da84654bacefc0eb2423935
MD5 de803eca866635ea8b1e3f2b8547ac56
BLAKE2b-256 c0f731ad3d78fa5443168d51c9cacae554a929aba3e564fc4e8a2775f034cb4c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d55cf0ba1e069155703dc1c46d2ffca6cabb2f143683d4fb4a025448da1fc1fc
MD5 ff15194d894592b6b9a6b670b0357b42
BLAKE2b-256 20a852765aa1215f4c4c2ed8e21fd86bb0670f4dd196c29c24c36d264f33b67f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 cf49e01b0960e7d1afa4b8c2142fa6db665a9e7f6f4db3cfe58a911c8ddbf082
MD5 543ac07417682e25fb3ee12899bd5701
BLAKE2b-256 700c1c77b668f14faff398ba452724d50942768c81a3991fa669140443ed805e

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 b198316ee7c8d43bd43891adbf4e1dbf3a629c212b254a18e81a1aebc373e867
MD5 e4b811ddd352e48811009ff4d2fdb2b0
BLAKE2b-256 3c1309830f959e2711d0ad5181b1f83e41b1e9b4424be6fc0fdba0aa9f83d7b4

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b4c79364e9f3a6b3ad9fb33479bf0b6538d91c79e70e9bf7a50dd039dc0259ad
MD5 f18a1b9cad57a3d16100545937dc5ca0
BLAKE2b-256 01209c69bb71c323b3ecd259fd7b6edf39d8f8cf2e6976180a08662dc1f2b494

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bdbda302da6b56d860cb24e5d6c859f7e3dfdf02f6b661813abf29ed235092e3
MD5 d34037ad75c1a7d377fc51c52a12a7c8
BLAKE2b-256 f0820cdfd9b8001bf598addcf99e4f4b4405498a8288cf42a427e94cab20e93b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 c4d644d560a171f669de1bddd17f9010c7c3e55be325821054964c59eb8b546b
MD5 618e914beab32cf7edb645a27220d791
BLAKE2b-256 e913e7b436985ca8339c69f77f75e2f760105d849ce2bebef1f6a14985f4775b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 cd944bfd2a5f441db77fc5a9b6ec3378fcb2a134b2fe3be2185d8a9e26f4f018
MD5 9a20eb3b2a95ba003361c2d5d31a4aaa
BLAKE2b-256 e957466af137e8800c0a2625bfc014e9e44f8344e8c2fceddec7e8b3a0238838

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 e4d6c1979ab9e231691edb6dd7bd78feaca8465a2b7db3bdc8a6306f2a2af715
MD5 e9cb831fe282a9c791f545fa67b4e51f
BLAKE2b-256 ff4b2bf808d766ad0548044141dee7f634d599003c5effd3e535f84120ec805d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c056291185efa131ae74009cbbfd9e8dd23f98d6b32c2efe7661f7e92691c3b0
MD5 d4a1a3afd65c9df0ee878a0204a82c1c
BLAKE2b-256 a3eac02b14c931cbdc6d050bc0001e22fee567758d345cb71d0cbc5778ed019c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1633b0e63fdb64b789643e73150c409e3d7f2d9134c9445a416160c32362a579
MD5 b7ef05998e9a1e4faddb73b13a8b08fa
BLAKE2b-256 ae0b25d575196d995dafd33eb9b486b2cd2d9ba46e6aea9c8ed3dfa0a1f15ec5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page