Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
    ) -> int: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
) -> int

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.6.0-cp310-abi3-win_amd64.whl (3.6 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_x86_64.whl (7.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_i686.whl (6.5 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_armv7l.whl (6.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_aarch64.whl (7.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_ppc64le.whl (6.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_i686.whl (6.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_armv7l.whl (5.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_aarch64.whl (6.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.6.0-cp310-abi3-macosx_11_0_arm64.whl (4.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8ca23a8dbb1cd95eec4874ca883ca35cd4f68e62444613758156897f614ada45
MD5 bed26dd12e914eda5ac25094d26fc08b
BLAKE2b-256 1e5472fef77a79ae34582431efd22b62535e409845151f28bf496aa535718936

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 ac68854469c79cdc200cb9e63e4bdd62c5c385cfaec4a7db2d95bc54c0a16554
MD5 b3249b09d84e7b8d4f820f5a8140d636
BLAKE2b-256 95e0d4009a7c7dea1a74406a46a610ebd6f0b3c92924add8ed331b4deb53b638

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 7ebdd8eeb2c5884a091a199f3066157069947758a612fbb92c9ba42dc0fcc232
MD5 0c1011aaa9493f953a30a75b5839f8d3
BLAKE2b-256 e354b2580481fd534503676522779cb5235beb566c21151c3265c34005f607d7

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 4e004870fef3d12dedae1cd0813c4467e18933f44a81878263d2759e2122f931
MD5 3c23a804a6817bb4ddc91798fc13dd43
BLAKE2b-256 7d61c94a3e979d8b21650863a04311de28a1d34ab47724717905ab2a204a1f21

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 c8e84c6cc83ef677eb3f9edd4d0c371950a2fedc49e4002a4af397cf276b2db8
MD5 91d3b9403499aea312d94ac48d051423
BLAKE2b-256 b3d1662a3b44a67b6d656cbe136b446ae38140aa256d40b3d4edaf70fa3a4c76

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 47ff6981cdcafad92ddbd4b8281e9be4c1739c92ada715df681b79269f13f3ce
MD5 d50675521326846e73d0fc0039b5c9fe
BLAKE2b-256 1a462ab2e32a6ec101042281de5fb7c201169b6d78df62b4f93e06afb9d57abf

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 3bd454a05df13406d3928d3dd763558b3816761535aeb565a65f6fc4b14932c2
MD5 58ea7b3d98afb1914d41cdca4cb40d63
BLAKE2b-256 b08389ec3a76133e92ae914ce0e8f747608f160a60b3813cf774c5d174f368b7

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 a7ce823c362531a7d61d941c62c652ef02b95d1bc5495e4633d000a2c1a11f22
MD5 1ad46d1f2239623d217a8495f5e17e0b
BLAKE2b-256 c204a6b58fcac5a4735b7c648b6920335c03c4e01a1c3b17867d8f141db8306d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 2a94999a49d9f60a62d09ab47a6015e3282f7f8552c5ebb103ce3aad3fe1e7a1
MD5 d2feba9bc194dfed6b4f451cf6b5525f
BLAKE2b-256 555e99d18724bd217fc9265ded0f4dee387a9c4c669684b57afcdd92da3cd26e

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1ba72741210217cc2e98425caf26e4f18bb94d9f48f3960baa09cf1f4928c0e5
MD5 a1f0faf3b8775f6a7c4e359571475630
BLAKE2b-256 f1c56e06f4e8bff6b07a7b9748044d78191f71fd7a94c90933cbbd9d6e96a712

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff9fd64d87367f75242db8ce4ad3e19dd245ea53fa058b6197cb0f08e4514fac
MD5 6f41f56681fec1f6e85ae79f57d59bff
BLAKE2b-256 4b55918b70135660d84f3a3b80218fd9ee17c352c108e777f51f6085d9d7b763

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page