Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.3.0-cp310-abi3-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_i686.whl (5.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_x86_64.whl (5.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_ppc64le.whl (6.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_i686.whl (5.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_armv7l.whl (5.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_aarch64.whl (6.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.3.0-cp310-abi3-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c94a8e87fbb70715d647917fbdfa4029fb921e6d3a117b7573d998f3cbfa4a36
MD5 5a9303bc97951ceef40ac97ff909ecd0
BLAKE2b-256 c88392f583b96e81438b33b913a9791fa84f4a7119f2ddca7ed0b1e127aaf540

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 728a9dfb6fb4fa5e6045888064a05a938a696695dd64186ac230aa64f4b42cae
MD5 f9c40f043a5f088a3ed56586908ef8ed
BLAKE2b-256 89c85151e38cf1c44e7f8b37075bdcc66254dd736ed887383ef38e5bd2f5a189

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 231427e69ca6c96d182f53940f750fda8b0294d34ae3f0811ebbf6541c9d5a10
MD5 9ab4bfe9ccc53bef2a400478847e156d
BLAKE2b-256 7ef847b9e5bd480060d089bdf84a8c4dd76c17a205cf87fcd41a2c59a6d223e9

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 d9b7456e3290606acada7769c9eaf6870517f6a4515a086239b589f0ab5a0225
MD5 72afaa37e9783eb79a30fa3f2d09b619
BLAKE2b-256 b060e4c9f5ac01cbe5e5ae704aff419157ed104b67b33216fc47216297876151

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 b7da2e169f437779ba88474cca0ee701654a3e58ff6ac89e3fa53bd97717ec0e
MD5 2f6b2026421a2c489e416eaefeb1c1d6
BLAKE2b-256 c57c53ced97e07e508921406cbcf77380d597b0e3b47396f20c11d19e5d98fc1

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 595471e436d88a2f38f1a3805df705918a81c32e7f06c2b6d6f8357c8b054182
MD5 0ed5f45ddb91c977002091983b1f98c7
BLAKE2b-256 d17ea167a802e4c4fee785135d0790f8b306072b65ede9ce9c50c0d7a98de497

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 0e0fe9eb87750c776e3e860a5ff46d4d0d89ae4a74cf793733809b5e3821183e
MD5 01520d9771dc0b6433f0aa327008822e
BLAKE2b-256 6c073b169f705bb63dd9a38e427748654e71c152b85b0ef921da3b6773bb727f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 f95200306fdfbe2dcd7399f0e91582865b1eefed62fb2ebd26e225f6cb230e48
MD5 1ccc0f0d2323379b937cc99d44fb3ef0
BLAKE2b-256 b17f9096b4233c37da053f97139dad56559a23712437ca188e0d1d0bd614af12

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 74a08c90ba1f93742c8536fde88e41ec7c9d73f5abb78ee2e27dd70303c8d40e
MD5 1f28a85e873499cd6b5c29151e40ef61
BLAKE2b-256 0ec4b0c1dfdfbf15e550e4f860a9b46d8da5cd387feaf77f711c691dddc91533

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1eff0fd8a35c01736128a8c7235cd01a6248d91f22a0dc9df13361eccd48a7c7
MD5 fd9b3b49868e9f9e0ec272f8ba9d0819
BLAKE2b-256 40d3c6bdb80e8a9dfc4c70c86d3804ff724d3888c4e4d6a3c291bab40e710048

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.3.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.3.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 932c0b8bd6447acbd764a17bc3516452cb1f9bb25bcbc60da1ffb11a8143a331
MD5 62e7a0c715ca5797a6fc1e778091a297
BLAKE2b-256 0716c6851a6861dac414f421ef7309594e19027aa71a68445ec3d433b1e4d299

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page