Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.1-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.1-cp310-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f2fa488fc88452506e7cde57ec27b5423420242e806d9aeb13071809c1d32c8a
MD5 2c349325bc250d009cea9fad88488f32
BLAKE2b-256 bb7d07a5a10751208c911d431c1e727cfb2ceca49775323283d5e119e9efb191

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 0fb8bf51fd81d45c9e778954448bf949709a38816ffabcd179a0dbc5d3f45568
MD5 8d89c706e679474ee95abe094117f1f8
BLAKE2b-256 d49518072bea775bbb69582a7781c24b2bc77b92f24973240a9e02db2b963e4e

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 7e371d9a3ec6b98b73f008594717a00dcce522f4db2e7c33437ce09a04ec28ad
MD5 58639601d5a69d7a760980c917aa5766
BLAKE2b-256 b8c6726ee61f1e8fc1d49745ed07ab5b94b84fea13d124c5b6d6a7a5cf65146b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 1156b25fefbc32300229b43ef6bf611030869e963136fc9d72ec0d97d90444df
MD5 7f4b0fcb5532fa5680a89cac67a05b2d
BLAKE2b-256 6c5a789fe11b243bc867b0bce50ff56eb7155d15ba65b97597cf11c75ebeea7b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 481e040a8db159c9569950997a82e095371b61cdbd0ee03f0223265b32f6c0ce
MD5 7a191d3fad9ff74302e1a0eca649e992
BLAKE2b-256 69629f3f22ab577fbd53267a7a27459563d96357df59d034ec26cf796d4ed03d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 caa4aa9465f03ae547ed50e8cb2adf2169b8a77465ff8595ac789c12ee6167fd
MD5 2d019537c4940a3e3cd890fa09432d10
BLAKE2b-256 4d5885bc4ad6b530d32935d1b396a262901092527d967c58f16c7e44419126e2

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 3a2a8539d51be1ee29bd089aec7d3e2ad3ec6b731c2e3dcd98b8a11f5a6cdd02
MD5 6a55e5b22e833ae7432dd6947d496a04
BLAKE2b-256 fd8a42591932b173905c756fdc192e052e0592b1f75cd92e25d27feaddb9738f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 4bebfb4df7eb9e82b00f34c4d423c52e7cf0fd43716f9b1b5243bb7abe55d213
MD5 e93934d0cf517e0d3bd95cb3d1d60e56
BLAKE2b-256 ed386ed6568b27a2c741f3c20d2592ae71852f20a95ffed3e33de1f043e5e518

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 9b34c5922f9e2d547b9a4de875cdd8c89eb253e90c2bae64f5a37d862a6fad13
MD5 1181b83aa0819b9f70e105275d24a8b4
BLAKE2b-256 2559cc6f6fc06aa3a1c920bd3f4375dad97195369205138364b10e3b4be51386

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 a45524140da1e3ad4402a2d848ce10ca8e8ed855ecf140dea1fcb1eba6ff02be
MD5 d444f0548fcf03f1850a8d6c0882938a
BLAKE2b-256 c49f6f42185b0c5a74effd008558a008596500d2e40aa056ae37fc5bb4617e99

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3811cdcfcd4a0c2da4ec47125e96016910175906d487d12578eca884390327ee
MD5 5c2d8db3af1d02bee2faf3f8113ff7c1
BLAKE2b-256 9bb6302ac67b1e28932c2cee7568e21cb5f31b127104357475c5bf4b780b20f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page