Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.5-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.5-cp310-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 41d682ee1d7556b17a001a2f1fe646b20aecf57f1f47944696e8e5a3328c312f
MD5 8d5d33509dc9def136506dd98f315aaf
BLAKE2b-256 e6d8f58f281f67cda482f6afebcfc8a3d2ff3106488b66e3ae248c0b25047738

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 77b76eb2cb3b945d0585ca90296fe210fc11b321517431f1eab23e1c1b5ec2a3
MD5 8a8a99311b9a6799fd356985e66186e8
BLAKE2b-256 b34720bd05bcbcdbb312d20eede804591d5e1b37d38b4d9501b4195ed1c95f5f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 908b52e0e00e66f0bdf57c99623406851c64b412ff2af49d160d543dd440c41f
MD5 4ba181c52e0230da79120379cc5132cc
BLAKE2b-256 c5e1997902756feb6102d806689f8a1d1a33eb529d0fb6ca3e67112177397c39

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 99acfcd997f10d6dfc41fcc4c9167ddc56209586c86b7c78a141fd15329c7467
MD5 427ec10801b390d7eb785db9a64239ca
BLAKE2b-256 2561b5485d61439070ccf509e398fcd50dcd197d682125f42c1579226ad90423

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 e7e0061c0e304d60ae046d5e4edbd5ee86482d3e989103447743923026bf3bf8
MD5 e09151ad666e433c7bad5341712906eb
BLAKE2b-256 fe155461d0ae36b2f7a3e6e77f0da4509695ca6b052639149bdbc879030f186b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 71c371c44bfa5b48d7e372b20a0921ba387848f70ad11c9ed20ee350b5da0bbb
MD5 bbf6d7c2d5c1d1a4dc39ac49b28d62b1
BLAKE2b-256 0462858a11e6d69862386ae8dc0e1b99c609d731a38632e6efa18ba314663449

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 a50f45518ca21527b68724d82e80b1ddebbebbfa55ed3ea3fa1fa13e53c59c57
MD5 0fc55bda44fbab38904a7d1ffe23c361
BLAKE2b-256 b759e2a7211476d1ee4a3b653ca815a3839c05b3fd8343bdeb2bb488f3c669f9

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 7c29b7cc6918ed0275bb2a901792f2957cf1b05c1135568543c42f024725122b
MD5 6c236c9e3a8ccc9ff3ff77e963f48cf2
BLAKE2b-256 46a5764c4ce9b8dd384c2bb3e420ad1ceaee15e7153ae9180297ff697bc26e1f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 8689c2b3e2be7583cea9ad6b89384fa74efdc59c5dc91cc29b652b8c8d864b40
MD5 d34aef42041d21273ce5c7c360cf4730
BLAKE2b-256 e2f5a6b219c763b25ae94358a2062b736ad5a2de6bd54d33dbc7c4da751ab27b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 510e387963615f68a7b3a50909e6792965f5972c4f2e2820a3ce4cd76e16aebd
MD5 d5615e50a06f965255b1046e90c48a08
BLAKE2b-256 4a0bc78460994f7acc4ef6686a98e5a5511366a5c60c0b52d21c871634132d8d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.5-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.5-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9adc7d60390316903ffd057e72ac37b79b67c6c859a1242310535aa6066a54b5
MD5 f06f00b6bd9933b1107d71be51df71b1
BLAKE2b-256 153982a42cb114359a3e1cee5e1a5683317a3280e6fff1c9a88709ec283557ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page