Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
    ) -> int: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0, includes = None, excludes = None,
) -> int

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.6.2-cp310-abi3-win_amd64.whl (3.7 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_i686.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_armv7l.whl (6.1 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_aarch64.whl (7.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_ppc64le.whl (7.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_i686.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_armv7l.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_aarch64.whl (6.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.6.2-cp310-abi3-macosx_11_0_arm64.whl (4.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 131df0527ea8ffad2f09d7e0faf4a2bab7a69419f283a0d10eff0569a8655942
MD5 97bfe72883478353dc34b171060dd794
BLAKE2b-256 b3a3c3acdac5419c362e206d4da27fe0ce5beaae599e6bc87032bf91c5811fbe

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6b55121b01010c8fa31f719c2917a0828d9108d743ad601d4a9d490b4c28e134
MD5 cec95c6c7fd10dd00c8e3f2c7e502e87
BLAKE2b-256 3abbcff7cd22203757a220ae95f78b9da1a463fca6382e781066d51da3ec21b2

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 2f4c7cb69fd50fcf30b476ab7b76e39c0e1b91c427621985430a4b6393961c3b
MD5 0e2330bc845529cc84ff744fc7929088
BLAKE2b-256 e194ba2b6cd2f8310a26b065651b1375c9670c2518d49e982d35ad6f4eb86c5d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 5b0bd351eece7162c1cd606623d3facd601496393dee8e5fd708e01c4167c6dc
MD5 5feb6b3def2b84f913b9145656142762
BLAKE2b-256 79653a0f3d4ee6b152f188198120c0f994f99c860616a40b0330dd696bf61965

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 642b468831440caf0542cd9d49283949b53c6d925db9498e1616ae5b98f29e50
MD5 63a02d2c6c91ba43501046bcba02d23e
BLAKE2b-256 a694d1d29fab1168212021ddc07fc2df0b30c3a3ca30a4398f9ad098d6093e4d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1062ce2801bd44659c232d59747acbe71f44fe99b97229fb288208789903fca6
MD5 a4a9d6daa89a7e3f3dede5230c70792f
BLAKE2b-256 0e5c0d143fcf19dccba8ff0da17d5cd9e0fbe98b914d3aa37723c36031122304

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 c0c8a62597d4b2fcbb7c9c36763009c8eba3297f2f47971f3e447ba5c05b7598
MD5 5b99abb47b0f6b6f12e9689802905f25
BLAKE2b-256 69e46c6824229200f7935b24a854b30c2014dd4cdbd806c9aa7792791381f39f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 3e5ad014c711dfedc46d04c78762a2c3f67336dd715424886d790b632ce62696
MD5 1971ae0a0f88a34f3e8268b907a7a164
BLAKE2b-256 0fd65d4d117f48881ea69622851c4b1ed3ea994afd9b9e26bb892d097bebfa9c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 4724cc34fe72af6acffbbcf336603b1dd2e6ab957b9591183e8d74dcdacd2d36
MD5 93918be698db0c5288d7b40e5378eba4
BLAKE2b-256 d3515c8a3e8c9aeca1af5c3f563d07e01eeb089e4d886a092b37ea0b9f122927

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 2b364667329dae1e8b7daa16c9b26a8c0bfabb454a9b27f984c62369e0d2a8d3
MD5 384b15e63097e1eba71ac1520a76b6cd
BLAKE2b-256 1dadddceb99a9176825ac76fd94062a0760e8b1d22b0677c2c221b676eb44a20

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.6.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.6.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bd41acdf523afb689427676a4c896ff61fb9a693641f5fba0ede50dc44861090
MD5 244fa0c0190ff2ee363181c174e97bd9
BLAKE2b-256 1398aefeca9f400a6df64289926065aea578c3bc11f50484bc8fd094beeec921

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page