Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

ZipEntry

Represents a ZIP archive entry in the dataset. A ZipEntry is a container object that describes a downloadable archive file and the files contained within it.

@dataclass
class ZipEntry(Entry):
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    version: str | None
    creation_date: str | None
    last_modification_date: str | None
    files: list[FileInZipEntry]

Fields

  • download_url URL from which the ZIP archive can be downloaded.

  • size Size of the ZIP archive in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")) used to verify archive integrity.

  • version Optional version identifier of the archive.

  • creation_date Optional creation timestamp of the archive.

  • last_modification_date Optional last modification timestamp of the archive.

  • files List of files contained inside the ZIP archive. Each entry describes a file within the archive (path, size, checksum, and optional metadata such as mimetype).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry | ZipEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
    ) -> int: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry | ZipEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
) -> int

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)
    elif isinstance(entry, ZipEntry):
        print("Zip:", entry)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.6.3-cp310-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_i686.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_armv7l.whl (6.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_aarch64.whl (7.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_ppc64le.whl (7.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_i686.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_armv7l.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_aarch64.whl (6.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.6.3-cp310-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f4d99f8112f72a13ac7321e27df3332c59e905fdb04dcfdb276406bbf96410a3
MD5 7c5044ebc0359cddf7aa07f4a124c8e4
BLAKE2b-256 78e5b85521a7b6d832a18b1e38124ad1dceab4b0a67e9ce1cf0aa21fd42d3e70

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 afc6f70890c64ef43ae55c4fdc14e463df731238bbe39ed36b0f6a10c5441f4d
MD5 7a5056504788b1aea71e09c0c3b3ab37
BLAKE2b-256 0ffde6cdf8acebbb770d6d021bf002d630761f00866cbe0fb6246f697419beb4

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 099fe03b5ca24babe5ea6992bfa53566e55e10fbcaefd183bc874216a933f27c
MD5 b648c85ffec44b63dfdda2bb1e913a3c
BLAKE2b-256 9c706268ca195ccaf773d20379cc0c0592d54ad49e9aa684d172e882e6c752b0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 aee1408dd57a602cf245dacd327a4f8b98878a7a3ec1a0124584e9581d8a3f4e
MD5 70597e2fd36b5bde2ef99c07a8c71ef7
BLAKE2b-256 0407117625398ae5feb13eef3e2f93d8055879983ba701ee75d70977b3815bdc

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 8fecf587af80e3e0477765e8cf656c79af1ab9ad039d8e61263297a5b96d8616
MD5 ff0e76ccdc45209771f61f5be0819a57
BLAKE2b-256 00f2e109cdfb4da78bbd2dbd1b44dfedd8d2a7b04134c3d5d60673dcdee1684a

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9dea6c333ea68ca93efaedf90c35c528bb1f80c69565cda2f62dce06874dd7f4
MD5 7020da638ad1ffd80604d6f5de4d3e1c
BLAKE2b-256 7a9d3000b581e69394043b3ff564cc0e6d03620d74fb6c8efe53a0d8ad607208

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 1253e890b9704b5174b163a102503d82ecd48c03fd29ec3fc37384487ab8bb91
MD5 cebec3de5f4d8b73e0af6c8538abc89e
BLAKE2b-256 ce43b930cb168ebf998ba97c537aa64243a73d50ab16f698a9e0229402a28ee5

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 7e7d9a4a37897c59f72439ba9b11c021c04e3323a04fa80e32ac7f6d64fcdbd1
MD5 e72ea2afb0133086fff829dd6be363a5
BLAKE2b-256 f08a3909af94038f3638b93825f9bee259a6fd8ba37109024483637f3694e6ea

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 c5bad63320191672b8cd76ee5bde6db28150d9135ab5d3caf7fa5bdf12d783c9
MD5 e2d68e7a47b13ab4ceac9fdddb08d21f
BLAKE2b-256 bdea2b89621ff71e4f40ab19b7dcf4148888912cb4de65c028116fa0419e3008

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 62227f4e5fd75161913963914b7dc7784af72e5f1b9405a869272dca190bc64d
MD5 2c5daaabf9d6b774c51a0c95b8d35b2f
BLAKE2b-256 949ec2c824ecf192520c7677af372a148006e5fb5bf136965f082105fb4be959

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 39af5757897ae0bebbea70ee1b35afcdcb828f180750214127035061c519a36a
MD5 a67de508e11fc1e59e21337a08b8445d
BLAKE2b-256 0347ae97aaa17ca2ffb982fdfe4d21926841e6217800c3dd4a4393a0d03a4e66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page