Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

Core Concepts

Entries

Datasets are composed of entries, which can be either directories (DirEntry) or files (FileEntry). All entries inherit from a common base type.

class Entry:
    """Base entry for files and directories."""

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_craw_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_craw_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_craw_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]

Fields

  • path_craw_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[Entry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[Entry]

Returns a stream of dataset entries (directories and files).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_craw_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_craw_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.1.1-cp310-abi3-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_i686.whl (5.7 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (5.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

datahugger_ng-0.1.1-cp310-abi3-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 aa33ea6b81c85a0fe630e3df6b050d3be7524413220a81cebdf4a08b40828563
MD5 ad4f81b42cfd8cd2daa79d9abde02dc9
BLAKE2b-256 642dde5a4e6264e9deba3efe02c662b9c0f5f38bfe9cf0080589e70fe20154c4

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 16fbf4de9922c2a25d310407c6991bd14fed0ecc1d79cb9cd8d8671aa71101df
MD5 216a3d44786815eac3a231df75d760b9
BLAKE2b-256 d1abb873b56c4f4fc5dae319a289253c19aeceb50ed6ea2ae5c6bddb39b77c38

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 1fa5cbe30a41b774f5ef41b25c73cfbd780ba2a3ee664255eff27f3b13e4669c
MD5 0e6cfcf7dc172c102d62d715f7eafd61
BLAKE2b-256 d45f0bba0ebdc4dc3883d10414b67841fa805d055de2bce8794bccabc09047dd

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 2197469d673ad4b0db31fe4c221dddf77950ecd77d6d7694c55ba0d9f3ae4678
MD5 f8d56fe3eac0e48eca441cdf2bf329b4
BLAKE2b-256 d42ebcc910add4ada0b01618eab0dbc9b275f2bea26cc90c20fb99273e902ed0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 13dc03f8fb9f20d8d351a2471518dd70786f3db97b06fa1b89a7b4f5488fc73a
MD5 a4803b6781482b06c89caa14637f56b3
BLAKE2b-256 54441966410b92bbe966940e8ca1b650b3fb231d7be23aa82eded85c6e0b6c08

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5f0195b0ed6b524591a75c1977c48f33ddda61484abf0009d1c7eacbf8a9a033
MD5 bcaf7ef8b50ad76cabb3d1ad902a7193
BLAKE2b-256 676ac19b0e8c5d95e9ac4731377b71aa5df9f3e8782ad1e5a0343205bc9a6d5e

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 6a2b0611f15704e1dc32c263cd44983cdfc11fd89355050125424ad1102289af
MD5 b6f9bf2f422c7a9972372fd4d35795bc
BLAKE2b-256 df66c385705d5ba1b5bee54643049aa2a8b7027ad31b7e159ff4a4f2a2a8cd05

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 bbfe3fa1e14cd42885e2cf6dc8837ec67e6ec44a4bb590524364cd2f882d60b5
MD5 819049bb4eaae9dba5cdc9aa83ff66ec
BLAKE2b-256 bdb05c249aae71834c4826094c5af7d818c33242a5746ff80a2a977addda665d

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 6c8be00cb9b1f75bd5f9c2c18d7dc022ee82b03d124454b02214e602725c782e
MD5 d1964acc887837a47753e552831675aa
BLAKE2b-256 4e1e6dfb6cbb5b2cc528226e659d970da84d9da6c79cc9e20b31095e063bd401

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.1.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.1.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 60efaafb79c0c66bd1226a349831e0e6e2d425e59dae6c0d0d42670e34043b1d
MD5 8c52dfbf490d101edec850429efcf980
BLAKE2b-256 0c90ea598cf33ac2d067a2212f23eac89e123dbd842626a3557eabb765d6887b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page