Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.2-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.2-cp310-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9db7e1958953169486a149f12d2fc4b73a0db43e47d52ed4589f1250a685df78
MD5 851f16c4a46f7e79edeaf8fadbfe2790
BLAKE2b-256 a3650f41776146d213542e068dbb0c116311e6d52e591e672c5ee507ab09b62f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 968c6d5dba9255f0dc84ca2110c7334b0629137147fa7b0100b7a94bcf58c797
MD5 094f1defeb5d2bb0722f1123e521f61a
BLAKE2b-256 7f044b982faf04e8cd89b1efe255e7df54958f572b5cfb48b838b6da78b6b5ea

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 8b7e8821516061278bee65b90b17b7b49ff198b28c6d89a8ec64cd65cd28722f
MD5 cf4780c7731f7067ed38f78de4559278
BLAKE2b-256 32ca33f8cecd5eeb6560b6d31404eee9e066c300160741afd009499bb40e6b93

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 08e53956baa344ae5ca05c2f21f8a41cbd6a5e2086a71c276b6cb072a70eaefe
MD5 d26a1998cf14140732f00e83b7d5edea
BLAKE2b-256 b5dfafc42abeded8351809f19eb27ada67d8e520c13e7a8b6f6e219a181f8f96

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 e73d1bd8f8940c290a345976e71b260aee2ce093eb46bdf77ccc4881ad2aa7ef
MD5 d3d9b8b5076a2b71764e495dfac41ccb
BLAKE2b-256 31aaf097834836f6ec2ac2e0477aec4cfdad8096451d977a1353de46aa7bd61c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f6f3b529053a30aaf9150b26d07912488c76650c1d81b4d0b92f515e712412c2
MD5 2253ea60a2912c0aa639ebaee60a646a
BLAKE2b-256 eac1926cfa7d4cd47bc57cf065fff3f5fdf07bd5f5a276538032af9b8a5301a3

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 8d27723ced0a1fb5228d12a0a8c815fb9572d48f1c76092053f138f126c25832
MD5 329f313093f6641c2d15caca4a8e0c6f
BLAKE2b-256 c713a3b46fcca2de169aeebea6c405de7d304c0bddc32eb189e4fc337d65ec48

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 bdbe9db54e8433b155ca0de6dde6e9cee92d0c2fd4d36fe47a4c0e055b09db25
MD5 8086568ea68d15d4b3ecde2dd1c20ff7
BLAKE2b-256 ed85be44abeb4ec96d1c54d4f7ce0fe048c2437612f9c3251527afc4fc74dea6

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 f0c19c792a6dacc07ff1a0e713ba395fba7e4cf51f31c9185eea31326723a2cb
MD5 6403a3fbaead05e1cedd72ca358e0042
BLAKE2b-256 acd4fcb142952eb4a69b06eb88a2201be7debfbdd80cdaef3da67cc3b5293fc7

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 7888247041176efd100c3b2e7ab6f1da9892c63bd206b8fc9d2ef7c6287d2f2e
MD5 83b8e346ac96b9b70a2a657dabd7dfaa
BLAKE2b-256 59a718902125581fe9818a74e9ce5c09ee376a9c3121ba50a53d0e803931e1e1

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 52a3a940e1a229118ab6a3f7207c12e3f317cc763cc94dfa2eb55a74418f32ee
MD5 4cdf7035d61cd9fffb3a4092330d1c8a
BLAKE2b-256 c0183ef42249d3dcdb7a74a45c8e69d8abcbaacf2de6af67f90beb03bd23e009

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page