Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.6-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_ppc64le.whl (6.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.6-cp310-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3383e2fcee0958fd1946d7fd6ebf87db1e48483ab519401999f3e9a04569b63e
MD5 ec8129753ae1a17805daf9c4acbe66fb
BLAKE2b-256 d295889c099a70bfd22ac39b712c71ca05c8b6756a4ef35073085b6d9e2c96d8

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 f4c8c9660bf51e2aca665dd0a29f86ac6e97602a031d485e262708a294030b02
MD5 0fa799c068f7b03f6378e213e1527697
BLAKE2b-256 46f76a784d244ee5ab63a476e33abb72cd33ecd26ab1e7191147703c401e7af0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 4ee246bafb32303ab262d9566bf51fca062b4090a33ce1cc735fc5650fbe4017
MD5 a404a89c4f8e03985b5a128f3e86cbfe
BLAKE2b-256 189daff0cbe898a6fed6822b6035a06a65b22ee5ea9cd1634c74040eb26f66b6

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 d96960bceffc30ca676c716cc9b1c68d0b420050b30355af324f7f709e087b10
MD5 8f05ef4ee1ea7a5424b5397b338a4306
BLAKE2b-256 0cad893a1ab0258e250231cb3a6c3b4203b2c5babe40afbf35fe198766d9784f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 16b8ff25b8cf07b12e11e352538b04c1decdf5bd8a9f5e809d3512d4a5abd627
MD5 0e0910df13952ed85bee29811f173bec
BLAKE2b-256 5245fe1009ea07dbd61d40ba2e209e68addf3a007997bb42761d5db31ce7fce9

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b93154e901f75e54658b6cfd4a97b7cf8edfd3b2842773871598e58b3b80e2fc
MD5 c9f7131c5a05800b4d976e060b92b16e
BLAKE2b-256 8f0a1522002e0e24d5b6f3318e76991cec89d1adb5a7b207ec8b74885a556923

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 2f0dd3f39aa0fe5f441025f6070c50a969f6c1c53afe492ec380ef6f1d7e2ed0
MD5 ac6617171cc7c9ac708d9874dafa17be
BLAKE2b-256 fb72fd70cec753ead2dd973af67c1f6d9b647132ea6fd7679d018ea911510c62

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 a149952dc69ec97b3c239481ebe689f8bc07134ba4c1c3e6a8b931a1d7276449
MD5 af3ce1c681208c2c58080f9fba71e203
BLAKE2b-256 18536b2f9528ceae5efad717d99956f36526271ba1d508b9f918055105936ce2

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 d6e3c576507d1dc35e5b5285e457cfdf4f8c8987359bbb21ddee5449474f9e16
MD5 43e81e355d01bbb2482fa5d0d3d3cee2
BLAKE2b-256 487f33f01600443e27ffcaabbba79eb670737c5d9b8b261112e4c6ebc0d83061

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3d3df76944bdfd8de498a3e3f9f9606b0880c7b5f4ed2ce65dddedbcfca1fae8
MD5 5b2f96467a1b23358e187526a05cf59e
BLAKE2b-256 3e5baa1e9cd2803690ed25d0d9b0f36c9668a46da3d658dd73c496db6df34f88

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.6-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.6-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ec331e70380f10d4e631e065e2660b16e456f81fe3a29fbb18538e4f702fee63
MD5 31603facad80e759fb66aa4a3cbabee1
BLAKE2b-256 d790c0a41b4e5228a1f6e11f3009bb6c0521f123a157469d0b40039747db9cc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page