python binding of datahugger -- rust tool for fetching data and metadata from DOI or URL.
Project description
Datahugger API doc
This module provides a unified interface to resolve, crawl, and download datasets exposed over HTTP-like endpoints. A key design goal is that dataset crawling can be consumed both synchronously and asynchronously using the same API.
Overview
- Resolve a dataset from a URL
- Crawl its contents as a stream of entries (files or directories)
- Download and validate dataset contents using a blocking API backed by an async runtime
Core Concepts
DirEntry
Represents a directory in the dataset.
@dataclass
class DirEntry(Entry):
path_crawl_rel: pathlib.Path
root_url: str
api_url: str
Fields
-
path_crawl_relPath of the directory relative to the dataset root. -
root_urlRoot URL of the dataset this directory belongs to. -
api_urlAPI endpoint used to query the directory contents.
FileEntry
Represents a file in the dataset.
@dataclass
class FileEntry(Entry):
path_crawl_rel: pathlib.Path
download_url: str
size: int | None
checksum: list[tuple[str, str]]
Fields
-
path_crawl_relPath of the file relative to the dataset root. -
download_urlURL from which the file can be downloaded. -
sizeFile size in bytes, if known. -
checksumList of checksum pairs(algorithm, value)(e.g.("sha256", "...")).
Iteration Model
SyncAsyncIterator[T]
A protocol that allows a single object to be used as both a synchronous and an asynchronous iterator.
class SyncAsyncIterator(Protocol[T]):
def __aiter__(self) -> AsyncIterator[T]: ...
async def __anext__(self) -> T: ...
def __iter__(self) -> Iterator[T]: ...
def __next__(self) -> T: ...
This enables APIs that can be consumed in either context without duplication.
Dataset
The central abstraction representing a remote dataset.
class Dataset:
def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]: ...
def crawl_file(self) -> SyncAsyncIterator[FileEntry]: ...
def download_with_validation(
self, dst_dir: pathlib.Path, limit: int = 0
) -> None: ...
def id(self) -> str: ...
def root_url(self) -> str: ...
Dataset.crawl()
def crawl(self) -> SyncAsyncIterator[FileEntry | DirEntry]
Returns a stream of dataset entries (optional type that can be either DirEntry or FileEntry).
The returned object supports both:
Synchronous iteration
for entry in dataset.crawl():
print(entry)
Asynchronous iteration
async for entry in dataset.crawl():
print(entry)
Entries are yielded as either DirEntry or FileEntry.
Dataset.download_with_validation()
def download_with_validation(
self, dst_dir: pathlib.Path, limit: int = 0
) -> None
Downloads files in the dataset into the given directory and validates them using the provided checksums.
- This is a blocking call.
- Internally backed by a Rust async runtime.
- Intended for use from synchronous Python code.
Parameters
-
dst_dirDestination directory for downloaded files. -
limitMaximum number of files to download.0means no limit.
Dataset.root_url()
def root_url(self) -> str
Returns the dataset’s root URL.
Resolving a Dataset
resolve
def resolve(url: str, /) -> Dataset
Resolves a dataset from a given URL.
Example
dataset = resolve("https://example.com/dataset")
The returned Dataset can then be crawled or downloaded.
Example Usage
Crawl a dataset synchronously
dataset = resolve("https://example.com/dataset")
for entry in dataset.crawl():
if isinstance(entry, FileEntry):
print("File:", entry.path_crawl_rel)
elif isinstance(entry, DirEntry):
print("Dir:", entry.path_crawl_rel)
Crawl a dataset asynchronously
dataset = resolve("https://example.com/dataset")
async for entry in dataset.crawl():
print(entry)
Download a dataset
dataset = resolve("https://example.com/dataset")
dataset.download_with_validation(dst_dir=pathlib.Path("./data"))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 2.9 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c94a8e87fbb70715d647917fbdfa4029fb921e6d3a117b7573d998f3cbfa4a36
|
|
| MD5 |
5a9303bc97951ceef40ac97ff909ecd0
|
|
| BLAKE2b-256 |
c88392f583b96e81438b33b913a9791fa84f4a7119f2ddca7ed0b1e127aaf540
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 6.2 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
728a9dfb6fb4fa5e6045888064a05a938a696695dd64186ac230aa64f4b42cae
|
|
| MD5 |
f9c40f043a5f088a3ed56586908ef8ed
|
|
| BLAKE2b-256 |
89c85151e38cf1c44e7f8b37075bdcc66254dd736ed887383ef38e5bd2f5a189
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_i686.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_i686.whl
- Upload date:
- Size: 5.8 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
231427e69ca6c96d182f53940f750fda8b0294d34ae3f0811ebbf6541c9d5a10
|
|
| MD5 |
9ab4bfe9ccc53bef2a400478847e156d
|
|
| BLAKE2b-256 |
7ef847b9e5bd480060d089bdf84a8c4dd76c17a205cf87fcd41a2c59a6d223e9
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_armv7l.whl
- Upload date:
- Size: 5.2 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9b7456e3290606acada7769c9eaf6870517f6a4515a086239b589f0ab5a0225
|
|
| MD5 |
72afaa37e9783eb79a30fa3f2d09b619
|
|
| BLAKE2b-256 |
b060e4c9f5ac01cbe5e5ae704aff419157ed104b67b33216fc47216297876151
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-musllinux_1_2_aarch64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.10+, musllinux: musl 1.2+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7da2e169f437779ba88474cca0ee701654a3e58ff6ac89e3fa53bd97717ec0e
|
|
| MD5 |
2f6b2026421a2c489e416eaefeb1c1d6
|
|
| BLAKE2b-256 |
c57c53ced97e07e508921406cbcf77380d597b0e3b47396f20c11d19e5d98fc1
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
595471e436d88a2f38f1a3805df705918a81c32e7f06c2b6d6f8357c8b054182
|
|
| MD5 |
0ed5f45ddb91c977002091983b1f98c7
|
|
| BLAKE2b-256 |
d17ea167a802e4c4fee785135d0790f8b306072b65ede9ce9c50c0d7a98de497
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_ppc64le.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_ppc64le.whl
- Upload date:
- Size: 6.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ppc64le
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e0fe9eb87750c776e3e860a5ff46d4d0d89ae4a74cf793733809b5e3821183e
|
|
| MD5 |
01520d9771dc0b6433f0aa327008822e
|
|
| BLAKE2b-256 |
6c073b169f705bb63dd9a38e427748654e71c152b85b0ef921da3b6773bb727f
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_i686.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_i686.whl
- Upload date:
- Size: 5.3 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ i686
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f95200306fdfbe2dcd7399f0e91582865b1eefed62fb2ebd26e225f6cb230e48
|
|
| MD5 |
1ccc0f0d2323379b937cc99d44fb3ef0
|
|
| BLAKE2b-256 |
b17f9096b4233c37da053f97139dad56559a23712437ca188e0d1d0bd614af12
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_armv7l.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_armv7l.whl
- Upload date:
- Size: 5.0 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARMv7l
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74a08c90ba1f93742c8536fde88e41ec7c9d73f5abb78ee2e27dd70303c8d40e
|
|
| MD5 |
1f28a85e873499cd6b5c29151e40ef61
|
|
| BLAKE2b-256 |
0ec4b0c1dfdfbf15e550e4f860a9b46d8da5cd387feaf77f711c691dddc91533
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 6.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eff0fd8a35c01736128a8c7235cd01a6248d91f22a0dc9df13361eccd48a7c7
|
|
| MD5 |
fd9b3b49868e9f9e0ec272f8ba9d0819
|
|
| BLAKE2b-256 |
40d3c6bdb80e8a9dfc4c70c86d3804ff724d3888c4e4d6a3c291bab40e710048
|
File details
Details for the file datahugger_ng-0.3.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: datahugger_ng-0.3.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
932c0b8bd6447acbd764a17bc3516452cb1f9bb25bcbc60da1ffb11a8143a331
|
|
| MD5 |
62e7a0c715ca5797a6fc1e778091a297
|
|
| BLAKE2b-256 |
0716c6851a6861dac414f421ef7309594e19027aa71a68445ec3d433b1e4d299
|