Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.3-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.3-cp310-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 33a9cbc93d72955397742d9824560e40ca07775b00e972a8b9cff3336d12def5
MD5 6a91ecb959937c39d98435c6fdc02675
BLAKE2b-256 228f138a68bdea23756fca3106c134d2dd4bf86202c4ba9a63162b4e44aa3f74

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 daa0d37a6a9bd19ea9a34dde1ecf9fb06a5bf49d14be94d95d0bd6fc617376c9
MD5 a15fd55f479ead3844589818bb142767
BLAKE2b-256 a3f17894863fc4e47bda0da02ccc1623d8dfec3e67538fadb8924797505f008b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 1067842a2defda27329efde2cce0a1caaa2a9b5043d42a44f42f91d8d8cae821
MD5 b38023616b86e703c42cf0708fad88a8
BLAKE2b-256 925d19b756c24be329f1c244fa12b18bcee846db88366105e0d0c212a146859d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 f74c19d7371a884e8ab460cab1dd5fe62534af7cfa46fd39db2d4a24a5f8919e
MD5 36a82dbc1e178a6aee86b86c346c08c8
BLAKE2b-256 2c672da3912d3c7f35d6255e3c19d99049ad8675749a095c9bd1c6ef1ceeeb72

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 610c4ab856ec2e07045bf5a205d8ca2a17feb80999f41caec5fda7101c8cfc18
MD5 bf353745f07d6952b8a7952a17e688e1
BLAKE2b-256 79813d0821a4f8fa95727d7b700a0a012d96b5efe9fa8aeb01af56bd65351d8a

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 16ce575cea52de5b8795275f531d099cbf8c0029e8ef5151b73a7820a57eff47
MD5 b86acb9b0eb0735a6af0a8b0ebae92f1
BLAKE2b-256 52c0ea58f341dd4aac827f31c1d0f2ca3fbe7da334dba2b0be19bd6cadfbf394

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 41806ef91fab810288925ff29a2f83b12bcfcefdac3af9a8080ee7e443ebda5d
MD5 0b9a2a5e1e97d9ade5e9d643d893c4e8
BLAKE2b-256 27a486caeac24e55d6929dea692532985f9215cdc9877763047f7f22531a5f85

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 d51e733bd6af3ed1860df703683d915ff0e077bfcb5dcd3788fd47154ec5b1a7
MD5 dd6b70bf1f26024572cea0f110c1e57b
BLAKE2b-256 aaf357a668ac844a5349c72f484a87344e7dc4a9f55d27a486cc04516f8a7006

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 54dfdd7469d8059804cda0bd5052b9809d92b58666a3f6e19ea3a5dbdbd23186
MD5 ad852f085cab17951db7aaba948ce7c1
BLAKE2b-256 7faab88d68b7e0a550138d0461f3ad5276a07cb573d412a794e53a9d62369ad4

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 688e87c342676dd4fde64c5cb4f6aab43da08c7b74768df5ae96eac614f31a09
MD5 8eb81727f1c3ae46758d6189b8a89e00
BLAKE2b-256 4efafd75fdcff55ae3464867e830385d529af8a01c7d8239267b2683bf7f49b4

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 480f9a1c7d189e23b26a25eebc4913274fb9063c686d198402d26f4b30489761
MD5 c9e7e57443b07f00b932f8267addb4c8
BLAKE2b-256 6f1832f9027e061a59ee6d6a8f3248547be72d205cb0dddccf2fc698acca2608

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page