Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.0-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.0-cp310-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 93b9eb4d37a727ff5c50bd93196ac6c410adcda12a8f19535c4c72098de92a9a
MD5 ca2eeb0101927a28b6c2481532225174
BLAKE2b-256 3cfbe9bd5006898def98c1c191d683f5885ceb0ee5825ed003b56dec376c20ad

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 65e815d1a2b4de95f3612082d433d6413ec3b2e4f32ffecdc271e194de64b8e3
MD5 2a7f2d6f14b11e2c53fd10007d1bdb39
BLAKE2b-256 c2c0b33982d5d6fc8b6e40edb8a5e7b012f0ca69d77f9411837ce536ec60b395

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 d1910baec354162dd97250fc9afdebdc994f874493b13c0239083ef3a5aa6ae0
MD5 b84899ad07dc0c91610e10f39f110462
BLAKE2b-256 e9a725c0d6009cd535fe35efcb3c9516d6d09a3ae0c977a618bd3a00ad553ce0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 a0039ac4cdcf5500c39c21ef2120f0af8872390823539d2d3fb4af7909d7fd15
MD5 ed0cfd2e676a0d464c7b57c0e246bf90
BLAKE2b-256 ede8ea643122723efa09cb7d3df38887fabf59a98c23a6464f40fa7ff491d896

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b18cf28eb366a0e5924b0cd22c2ab92b759a91e1bce7bffba0258c6d073eaba8
MD5 633b47912549a71af09d0dd7d332deb7
BLAKE2b-256 4f526ff48a6b95130985299a7d8d3858db45bc46b4819768c02272434bd880aa

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d5ac1f8af061bfefef3ffcf988ebd43bce36440e4929bf03e1f7ea60b28393be
MD5 7f51146dbc50e4d3650199ea6b02bf8d
BLAKE2b-256 8cd5f3f87bd57fdf0d95ba5fcf2f2cce7578a379b724170c375b863398c99c35

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 23eb2effaf6e6e477345ba81829184cd0425d40c9e0e01929ad7b5a4cdea6068
MD5 d553d2dd9ceeed393816e596f69acc54
BLAKE2b-256 247bc7ca9742a2f332506a8f84298213934d77b1cef9875a32a9c4993276c62d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 73b0d6c74e0de1e875caf4c44e55a009b5c28fbf706c247a8be0f259d4b49e52
MD5 c626b3e747fc93e30c39482561bd210e
BLAKE2b-256 863e1f39f06a91c7ec63ea814abfec32b8b1400b37e850943a6285ca89b96fa3

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 b12ead9fbf64bf39994fb68b4c9a25592ec17c4b28e6783dfe44519366ee8664
MD5 08265e438eaa8a8864d8e09425626d28
BLAKE2b-256 3ee83f0ff1802db58c6182ba94829b11fbed221d7b6a42ec8743526016fb2f30

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c57c5d0469adf1ccbc054d5c7b553135c33673bc4c278525fdf0236e61d6068c
MD5 f881af8b33f67214e87db4da1cb9e2b6
BLAKE2b-256 96727428baae58c5c3fd26b3d46ea4d7f163b219a1f00033e88b2ac4ad0cebd2

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cc2461aa3debf5bbad4332a50658438688f2401518a8e09ced1b15a6faf7c0b0
MD5 704ce56abba78363089c90b45693ecd7
BLAKE2b-256 8d9c15dfb87c95f379cb79f71a521041822721feafd8fd0f7bf941df165bc2cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page