Skip to main content

python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.

Project description

Datahugger API doc

Python version

This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.

Overview

  • Resolve a dataset from a URL
  • Crawl its contents as a stream of entries (files or directories)
  • Download and validate dataset contents using a blocking API backed by an async runtime

Core Concepts

DirEntry

Represents a directory in the dataset.

@dataclass
class DirEntry(Entry):
    path_crawl_rel: pathlib.Path
    root_url: str
    api_url: str

Fields

  • path_crawl_rel Path of the directory relative to the dataset root.

  • root_url Root URL of the dataset this directory belongs to.

  • api_url API endpoint used to query the directory contents.

FileEntry

Represents a file in the dataset.

@dataclass
class FileEntry(Entry):
    path_crawl_rel: pathlib.Path
    download_url: str
    size: int | None
    checksum: list[tuple[str, str]]

Fields

  • path_crawl_rel Path of the file relative to the dataset root.

  • download_url URL from which the file can be downloaded.

  • size File size in bytes, if known.

  • checksum List of checksum pairs (algorithm, value) (e.g. ("sha256", "...")).

Iteration Model

SyncAsyncIterator[T]

A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.

class SyncAsyncIterator(Protocol[T]):
    def __aiter__(self) -> AsyncIterator[T]: ...
    async def __anext__(self) -> T: ...
    def __iter__(self) -> Iterator[T]: ...
    def __next__(self) -> T: ...

This enables APIs that can be consumed in either context without duplication.

Dataset

The central abstraction representing a remote dataset.

class Dataset:
    def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
    def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
    def download_with_validation(
        self, dst_dir: pathlib.Path, limit: int = 0
    ) -> None: ...
    def id(self) -> str: ...
    def root_url(self) -> str: ...

Dataset.crawl()

def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]

Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).

The returned object supports both:

Synchronous iteration

for entry in dataset.crawl():
    print(entry)

Asynchronous iteration

async for entry in dataset.crawl():
    print(entry)

Entries are yielded as either DirEntry or FileEntry.

Dataset.download_with_validation()

def download_with_validation(
    self, dst_dir: pathlib.Path, limit: int = 0
) -> None

Downloads files in the dataset into the given directory and validates them using the provided checksums.

  • This is a blocking call.
  • Internally backed by a Rust async runtime.
  • Intended for use from synchronous Python code.

Parameters

  • dst_dir Destination directory for downloaded files.

  • limit Maximum number of files to download. 0 means no limit.

Dataset.root_url()

def root_url(self) -> str

Returns the dataset’s root URL.

Resolving a Dataset

resolve

def resolve(url: str, /) -> Dataset

Resolves a dataset from a given URL.

Example

dataset = resolve("https://example.com/dataset")

The returned Dataset can then be crawled or downloaded.

Example Usage

Crawl a dataset synchronously

dataset = resolve("https://example.com/dataset")

for entry in dataset.crawl():
    if isinstance(entry, FileEntry):
        print("File:", entry.path_crawl_rel)
    elif isinstance(entry, DirEntry):
        print("Dir:", entry.path_crawl_rel)

Crawl a dataset asynchronously

dataset = resolve("https://example.com/dataset")

async for entry in dataset.crawl():
    print(entry)

Download a dataset

dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

datahugger_ng-0.2.0-cp310-abi3-win_amd64.whl (2.9 MB view details)

Uploaded CPython 3.10+Windows x86-64

datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ x86-64

datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_i686.whl (5.8 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ i686

datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_armv7l.whl (5.2 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARMv7l

datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_aarch64.whl (6.4 MB view details)

Uploaded CPython 3.10+musllinux: musl 1.2+ ARM64

datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (5.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ppc64le

datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl (5.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ i686

datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (4.9 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARMv7l

datahugger_ng-0.2.0-cp310-abi3-macosx_11_0_arm64.whl (3.4 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bfb94df862d98d2320a3a9c5da85eb45ef1e0ab105c1ffc5dde945abdb743749
MD5 f16192b4ef8f705e5412e71d4f8e1f41
BLAKE2b-256 583cf84ab29bacd520a9db3fb97bba5a70a3e20bf4c3a9757994752c1efa7b74

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 08ce9f3a54a75462294f9abc12c0604e99f88fb8cc3213961a56040e97f1be45
MD5 b27cb8181891574b2e40c34bc5a76506
BLAKE2b-256 3cc6075e347e79f457d454018773ace146c4a43e5a4b6ae561071669b07c2df0

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_i686.whl
Algorithm Hash digest
SHA256 6ffee893df900f3d4de11bcb9a3ba2ada05ee37d734d9e64bbc63f1ab24fca93
MD5 204b6bdb0fd7fc7d9e7dea72ca2d7cc7
BLAKE2b-256 fe763388fff4ecdec027beb2e71e12c4af0712ed38fdc0b4abd87259353656c5

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_armv7l.whl
Algorithm Hash digest
SHA256 45d63de2a54be5b8933b37ce4fa393e11e7fd2a1fb0d26b620f92f77b4cac836
MD5 b5d9acc413b53efa7c94c21d243b1af0
BLAKE2b-256 a2fac0083fedf65182e13b27f777874e77318549085b7cfa83bc099c40330fa5

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 6becebf2307e0cc0424009fccd39efd282ecd5827d834e342186f7c64732f99a
MD5 f6e440a83e643a1154266c0f943f9c0c
BLAKE2b-256 ea2e765ec9d2355d3b55d4d402d392436b0bf67de01d268372d451f5b56e15e3

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69b42776ef11a426185a2ac5a73b428f26bb192fb0c43057c13db590c2635b84
MD5 2080d7b2f8b322da4da689b9f31567cf
BLAKE2b-256 b77cc88157aee8a24fd5e1b742ee18610ff48e7b11bd3df0e930249f8a63c649

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 dfc32d93945f465922cf29bd79667426f17ee7e42d7d6f29c99aaa394eeac752
MD5 0d49ac541be4353834ddc4f98116fd0d
BLAKE2b-256 6435044b477d3c118dea47094171512e02ac0e7fc90e129f17296361f049f01b

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 7c13647879dc96ddce66414baf153f7b9f63cf38122385a044aaf7670ac96503
MD5 27519e6d7cb97457d7511c3bb68faff5
BLAKE2b-256 9e903d17f0c9107184d265f07d470517a318fe2b2c8c2a7b764abd13cdf99c1c

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 16e17b7e054547da33cbeed950ce263039c61dd951db8c7f70c82cd1a1be4fb2
MD5 bd887994e8737381abee259c14efa960
BLAKE2b-256 b2e6ca90d2bda6eb32ae8a8d7bf979269a0e8422618f0dd5d9beee3eddcd27f3

See more details on using hashes here.

File details

Details for the file datahugger_ng-0.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for datahugger_ng-0.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3ecd8d7ade7dd897616a135db624b48cb08a9e3403f5a26cb5673c1a4742fcee
MD5 e191679c31b7fbefa7fc2e05bb9284cd
BLAKE2b-256 1bb25f55bc9649077051689364e1160997b9cc4f145d16f733c5ae24c3501844

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page