Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.4.0-cp310-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_i686.whl (5.9 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl (5.3 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl (6.5 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_ppc64le.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ppc64le

datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_i686.whl (5.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ i686

datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_armv7l.whl (5.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARMv7l

datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_aarch64.whl (6.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

datahugger_ng-0.4.0-cp310-abi3-macosx_11_0_arm64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0b59688c4d397fe750cf7d102002444ed36e7bddcf1cd7a253f6c0bcdbf82850
MD5 16b9d531f2379a7752b566dc45d24490
BLAKE2b-256 aa8851246b47754fb78b3f1c22c0e055327093baf431be0eecc091d7a6fa3e11

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 b07586484a6a80faf303e4a6ff35abad4ff003c6bfee10df040bfa43a6163977
MD5 6fb995c1c37ffa72b375dd0420be448f
BLAKE2b-256 68dc01c25cfaf44f5a3e222aba5c2528dab71ac34c3eff6e7582bdee372eadc7

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 9a728e473d7d5989562820adebd00551f8c46a19289615e849164de40c0c2770
MD5 037efed9bc2420b9ab18eca9d4b6e48e
BLAKE2b-256 bcdd3d867f1094e8b203312c62bf3dc9eda0273fc2a7532f996a01292d23fb9f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 f02259ae6d7dbed82e86e17ad8c816b4996bb5c0f67e16ae4f80157e20202cbc
MD5 f87392bea4e047d54bebcbc424ba7ada
BLAKE2b-256 1fd6a453cb9ae4a088cb4672c2754f0ae0928d29c553a594dfb2caa7d0d3f308

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 001b22e4e7c02e8dd498a5c68c070c843fad5d685e98486ddcfe63100ba703be
MD5 753252927034c686d5686c8bc73ede01
BLAKE2b-256 5bc0efe0277c6d1cc9e354283c5772cf7984dbb7826cfdc86d6b50b1e9fa7acf

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8df5ccb98fa15eabf884a202edac7ce64f26881715a835703a2d7c0d5543605c
MD5 65a88bf5c2277be30bd2acbf7a53d155
BLAKE2b-256 8b1598e6d65e4412bd6a83462d6121e7e4bd41a0aac31ad9569a3be981a96a93

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_ppc64le.whl
Algorithm Hash digest
SHA256 074089d6388028e9ae28f6224d1f6b95a45dd9a97358edba97586055922b188e
MD5 06e9108ecbe3083df2ec8728545dd344
BLAKE2b-256 64f9e5f01228a366596f33d43d4290fef53051c517149d13f6e102000fc5919b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_i686.whl
Algorithm Hash digest
SHA256 788953ec0e4eaa90c571390ccc1f7e01730c8836caadcc437bba795be922ec82
MD5 f98d1fb68f50fb9b30534f49a55c64f6
BLAKE2b-256 2a0332488315c6da55346dbeab66440fe52ce0059849b569124a2aa13d4ab191

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_armv7l.whl
Algorithm Hash digest
SHA256 65162cf38a11868c7dc82d2e274f1a5dd27548dd5aea94dde34cd6d0ae6b3b5f
MD5 53b3ccd7e942c1282757b002fda6d4f1
BLAKE2b-256 3cf6f163ca5bddae726b247ae3f5d026ce19220b63ab00385f1364a3e2bc439f

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 aad0189d357426636ed5a18a309e63a0ed55f5b933941f049e57c8095d119d38
MD5 a0bb0e51e4961ad9e3adfa8e1d0f45a9
BLAKE2b-256 313b822a3f92d6f8d1e65abbb5e3918818de491f42a3d3efac629cee58a1d2db

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.4.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.4.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d831876ae22334617571b4900660e3641de222905595fa1fbd23b10b3728016c
MD5 62f3b7cba3ed4bf9ea47c41b8a285826
BLAKE2b-256 8152f6af518665bf31ab229a617e46c8a527aab4dba219197459f5fa7379be42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page