Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

DOIResolver

Resolves Digital Object Identifiers (DOIs) to their target URLs using the DOI resolution service (e.g. https://doi.org/<doi>).

from datahugger import DOIResolver

doi_resolver = DOIResolver(timeout=30)

url = doi_resolver.resolve("10.34894/0B7ZLK", False)
assert url == "https://dataverse.nl/citation?persistentId=doi:10.34894/0B7ZLK"

# or for multiple resolving in one call
urls = doi_resolver.resolve_many(
    ["10.34894/0B7ZLK", "10.17026/DANS-2AC-ETD6", "10.17026/DANS-2BA-UAVX"], False
)

Parameters

  • doi or list of doi in resolve_many The DOI to resolve (e.g. "10.1000/xyz123"). The https://doi.org/ prefix should not be included.

  • follow_redirects Whether HTTP redirects should be followed.

    • True: Returns the final landing page URL (default).
    • False: Returns the first redirect target.

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]
    TODO <- here the mimetype will be added.

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.5.4-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_armv7l.whl (5.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_aarch64.whl (6.6 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.5.4-cp310-abi3-macosx_11_0_arm64.whl (3.6 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 193907aa20260f0fdc6ebd0603f49527af24b0fe491265f90c54feead8fee9c6
MD5 875b126da01b858ba46d4cdc4f1f406e
BLAKE2b-256 197b30cb8ea3b7f3b3ce82168e80b1b835e188a4e20cedcc229ac476167e9305

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 8900907ef743e9930bd11e82eaecace8e180fd9f3d7a519489f6bcc3989518e6
MD5 ae43f2509e9fa1390e8cc7df004a217a
BLAKE2b-256 dfbe19a30d3f61ef904d3ccec429c65ed205f76b037557240dafee04aecf55cf

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 17100962701dd091183de250948eef62cf07fff2a24efd920ac19cb0e945b576
MD5 22db8638157d1f0e84cf19cccaa34730
BLAKE2b-256 a2cbf085013b3c82cc00239f9ec1ea9010cc021aa3fd3929f9b5b872a15817a3

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 dd675eddeb29578b1bf44b3ff008e8321a242cbbedb1d3c4d77d736c2c243552
MD5 aa1696044b2d84beda4711d72f9b27ef
BLAKE2b-256 1e0f1ee1585b11eabe25dbcbbb3b902e4361297817246fd6b022371bf1657909

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 6006b09ad0845e83651a5feeee808e5c249796150d6523475b0227423c8c5d2b
MD5 df3c3b8faf768f0b986eaa5dcd319c41
BLAKE2b-256 521de9ced26dc7e4fd6624cb0c67d7c31210a8516be977837f5a4648f32a6e54

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 49cd66cc98b59aa874efdbe87413149d5bbe169c93480213ff7ffa6e06aef7c1
MD5 5104ca9902bda992e0518d1bd1308231
BLAKE2b-256 d9a6f86358fe7c90cd0d27374f67f91638fe3b6a6a8e4ca2531ae0978922e2f0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 10a3796b633f1e33e26614d4e44b1526c4093dd17c183469e299b65f8803ce08
MD5 1ea576c4716ffe26f0593eb74db610cb
BLAKE2b-256 02836172f5dfe63cfcd39bd405c845a007220f2d63193a061f6253673697dd00

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 b656d8bcada2b6b8606e17f7cacedc36dfddbc7a45c599edbe9c96b6b68f779e
MD5 685ee5a957bb351015374a44b3c03b1b
BLAKE2b-256 38064441083f780e25d426e744eb3adf1f19b51923aecbd19841655febff8406

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 ec6d3e66af2280ff56071fd6868002224baf3e64caedb75a46c29002892be534
MD5 ec20fad848ed6bb2b84baa6519e3558a
BLAKE2b-256 346dc34847a50e94adbc6e1dc8f772c453ed658184475a3b038219be620b0f47

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5097554a36df7c8bbebc9d901fb7f40a521cf232ba9d372071b8683234d51fc8
MD5 34fa8f9fbd801e2cb8744f1ea3af9395
BLAKE2b-256 d53b1fae3fd4c00620bf599fc48f0f35242aa6e313a31b727a688e171224b9f5

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.5.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.5.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 924e92bed83454e37c307a20453e1ac4dbef44b335a1d14f34ed89a786211135
MD5 4fca51fa4aa49800258e10e34af66e27
BLAKE2b-256 d6d035a65ae52abf5985c40947d726d6fb62e0621e42e5c76f3c79759b815adc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page