Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
    ) -> int: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
) -> int

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.6.1-cp310-abi3-win_amd64.whl (3.6 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_i686.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_armv7l.whl (6.0 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_aarch64.whl (7.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_ppc64le.whl (7.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_i686.whl (6.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_armv7l.whl (5.8 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_aarch64.whl (6.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.6.1-cp310-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7c582c7676b3610f89f2d6199346ba8f6e9c0c29dd7c0c3a98d18d74f0d98a40
MD5 3bcb281103787a7074b64474e69ff50a
BLAKE2b-256 1390049104a07935d3681d75dfcb30c35ae4db95e74315d6a95313cdec4eb148

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 69ed7d9bdb171fd85d74ab1b2ad6f1fd003bcebd6428221c7e4deec4990db60f
MD5 30976b397ab3142f6c22d1243ac7b9c5
BLAKE2b-256 189ff8424d39f91c1e2c6e5ff48fe8b82a344849097a4328fee48851082ef351

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 970c4a9e3d3df1709876bfd2597d34d7d5fc45411c715d2d9b8625b4596870d4
MD5 537007fef7fe20bcf0753a737a6305e0
BLAKE2b-256 8b03c9b7e13722767167f234efd1ac8147cdc6f5110c779729d9dbff3230b32c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 792e85e48224fa4f9267b50a77b2e95212b13ca0bacee89a227f63c7044cdab2
MD5 51bed30731fdb408a7579ea9668b68cf
BLAKE2b-256 931be51f59d628af1fda1c88011a99eea0c9714bb318a97979357562448f0a9c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 552af9309363dadf9644dd18f4926b40bfc071ee855cc57cf8e154eff9d46f4d
MD5 a45905c8195a6b5665b5689439289200
BLAKE2b-256 02692a4f51cc0f7d446eeb4df06e88b23a4fdf40a2ed93d258549aa22069c765

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9245f90b9450419083938b225a79217b5baaae2da78d90ab71322031051cdf23
MD5 2143d89a4b470b7ec35da599c5c4b0f5
BLAKE2b-256 e1b245bf5fb067bcc6d3578c1e1b11c224a7ca5eab91c26d57169926d58ecdae

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 bc4f3f85a7d742264c3dacfab930eb7f5d74f2c126bddb96d619af9e48df0803
MD5 b7210f09eb2b2adcb71d3047cecfd396
BLAKE2b-256 ae79223081dbcec802c6962f8394a6727e12625a3889a483d6870934b3e01ce5

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 8a16ad8569d3f35a17b3de0c7c4aec843bfdb166c5aad84c675bb30d514e79e8
MD5 8edb6692ebd6dc55a49279d779ba5f28
BLAKE2b-256 956354fa8fa0c9cd2e49fb8002be55cc247ab65cae307831453959ebb2e1bf6d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 cda787e2cb90498449693c05790fef416599a789f7223a63d906d8b7a20585d6
MD5 4af3057ca88325f5f3f288f9d12691d8
BLAKE2b-256 c75f3cefed95aeaae53e3e98a9569138d656f0f2ab96c0a8a07187e3e94d4565

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 bd04f9b7cd94a825e804027edfa1dc4c1cb2c6aa27efee2c72d19f97e1f9578a
MD5 79333521cef8b89cc0de779240e6c19d
BLAKE2b-256 f33f9072799816acde7b1a79705b1dfcd937974abfeb50fa611019267eba355b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ffc412463c0efadfc6f9aaf89ace5db6539ced2b785a9c19b751a4b3bfa05e2
MD5 4d7194473291aef7e83eba005a9450f4
BLAKE2b-256 af0cd3b340ceef3884ab699bbcdb61b79d0f3c39e1ee665402d908448fab2ce8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page