Download gallery and insert into h2hdb
Project description
H2HDB Downloader (h2hdb-downloader)
Automates downloading galleries from exhentai/e-hentai (via hbrowser) and
recording their state in an h2hdb database. It has no CLI or standalone
runtime of its own — it's a library consumed by another project that owns
the browser session and the overall process lifecycle.
Concepts
- Gallery — a single exhentai/e-hentai gallery, identified by a
gid(numeric id) and represented ash2h_galleryinfo_parser.GalleryURLParseronce its URL is known. - Dedup — before issuing a real network download, the package checks h2hdb to see if the gid is already settled (downloaded, and not flagged for redownload). Settled gids are skipped — except periodically, at a random interval (1 to 19 attempts), when one is force-redownloaded anyway as an integrity re-check.
- Durable queue — every download attempt is logged to the h2hdb
todownload_gidstable before it starts and cleared after it finishes, so a process killed mid-download leaves a trace that gets retried on the next run instead of silently disappearing. The same table doubles as a manual work queue: drop a(gid, url)row into the CSV file you configure ascsv_pathand it will be picked up the next time the queue is drained. - Deep download — download a gallery, then look at its
artist/grouptags and download sibling galleries that match a set of search conditions (e.g. other-language releases of the same work).
API
Downloader is the sole public export. Every method either acts on a
target you explicitly pass in, or — for the two queue-reading methods below
— hands back a plain value with no further bookkeeping required from you.
There is no "run the whole thing" method: deciding when to stop, what order
to process things in, and how to report progress is the calling
application's job, not the library's.
Downloader(
driver: ExHDriver, # an un-entered driver; see below
config_path: str, # path to the h2hdb JSON config
csv_path: str | None = None, # path to the manual download-queue CSV
*,
wait4client: int, # seconds to wait before retrying after ClientOfflineException
retry2download: int, # seconds to wait before retrying after InsufficientFundsException
)
csv_path only enables the optional "queue a gid/url by editing a CSV file"
feature described above — leave it as None if you don't need that; the
durable in-flight log and dedup cache work identically either way.
Downloader is itself an async context manager that opens and closes the
browser session for you, so driver is expected un-entered:
async with Downloader(ExHDriver(headless=False), ...) as downloader:
...
If you'd rather manage the driver's lifecycle yourself, pass an
already-entered driver and skip async with downloader.
Method names follow one rule throughout: no suffix means it operates
directly on a GalleryURLParser you already have; _by_gid means it
resolves a bare gid to its gallery via search first, then does the same
thing.
await download_by_gallery(target)— download oneGalleryURLParser, or an iterable of them. Returns{gallery: downloaded}for each. Retries automatically onClientOfflineException(waitswait4clientseconds) andInsufficientFundsException(waitsretry2downloadseconds); a wait of0means "don't retry, raise immediately."await download_by_gid(gid)— resolve a bare gid to its gallery via search, then download it. If the gid no longer resolves to anything, it's recorded as removed in h2hdb; if it resolves to a different gid (the gallery was merged/redirected), the original gid is flagged for deletion. Either way,gidis fully settled in the pending-redownload queue before this returns — callers never need to do that bookkeeping themselves.await download_by_tag(tag, conditions)— download every gallery under ahbrowserTag, once per search condition inconditions(or unconditionally ifconditionsis empty).await deep_download_by_gallery(gallery, policy, skip_check=False)— downloadgallery, then for each tag inpolicy.filters(e.g."artist","group") on that gallery, calldownload_by_tagwithpolicy.conditions. The cascade only runs if the initial download actually happened, unlessskip_check=Trueforces it to run regardless (useful when you already know the gallery is downloaded from a separate call and just want the cascade).policyis aTagCascadePolicy(filters, conditions)— both fields always travel together, so they're grouped into one frozen value object rather than two parallel parameters.await deep_download_by_gid(gid, policy, skip_check=False)— same gid-resolution asdownload_by_gid, but deep.await drain_queue(policy, skip_check=True)— process everything currently queued right now: anything queued manually via the CSV, plus anything left in-flight by a previous interrupted run. Doesn't loop or wait for more — it's a single, bounded pass over a snapshot.pending_redownload_gids()— a snapshot list of gids h2hdb currently flags as needing a redownload. Read-only; safe to call repeatedly as you work through it.
Example
The calling application owns the loop. A typical one drains the queue once, then walks the pending-redownload list, deep-downloading anything that actually got (re)downloaded:
import asyncio
from h2hdb_downloader import Downloader, TagCascadePolicy
from hbrowser import ExHDriver
from h2h_galleryinfo_parser import GalleryURLParser
policy = TagCascadePolicy(
filters=("artist", "group"),
conditions=("language:chinese$", "language:speechless$"),
)
async def main():
async with Downloader(
ExHDriver(headless=True),
config_path="h2hdb-config.json",
csv_path="todownload_gids.csv",
wait4client=30 * 60,
retry2download=4 * 60 * 60,
) as downloader:
gallery = GalleryURLParser("https://exhentai.org/g/123/456/")
await downloader.download_by_gallery(gallery)
await downloader.download_by_gid(666)
await downloader.deep_download_by_gallery(gallery, policy)
await downloader.drain_queue(policy, skip_check=True)
for gid in downloader.pending_redownload_gids():
gb = await downloader.download_by_gid(gid)
for downloaded_gallery, downloaded in gb.items():
if downloaded:
await downloader.deep_download_by_gallery(
downloaded_gallery, policy, skip_check=True
)
asyncio.run(main())
License
This project is distributed under the terms of the GNU General Public Licence (GPL). For detailed licence terms, see the LICENSE file included in this distribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file h2hdb_downloader-0.3.1.tar.gz.
File metadata
- Download URL: h2hdb_downloader-0.3.1.tar.gz
- Upload date:
- Size: 28.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2a915eeedf6cb8fb8a11cf11b57186ab6224ca9a95e347e025a7356c4710ef4
|
|
| MD5 |
920a4390481dc0e4a8471538faa8e4c6
|
|
| BLAKE2b-256 |
ff2e8c3a5b167e85653284d84266e184071770133000466d0f1423eb85c4e129
|
Provenance
The following attestation bundles were made for h2hdb_downloader-0.3.1.tar.gz:
Publisher:
publish.yml on Kuan-Lun/h2hdb-downloader
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
h2hdb_downloader-0.3.1.tar.gz -
Subject digest:
d2a915eeedf6cb8fb8a11cf11b57186ab6224ca9a95e347e025a7356c4710ef4 - Sigstore transparency entry: 1983476477
- Sigstore integration time:
-
Permalink:
Kuan-Lun/h2hdb-downloader@bc9f31b66590a8600a5a76373278129d9b046009 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/Kuan-Lun
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc9f31b66590a8600a5a76373278129d9b046009 -
Trigger Event:
push
-
Statement type:
File details
Details for the file h2hdb_downloader-0.3.1-py3-none-any.whl.
File metadata
- Download URL: h2hdb_downloader-0.3.1-py3-none-any.whl
- Upload date:
- Size: 22.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a51cf1f1a8d04208c871cd25044adb77b89485547f3068facb86635eb6a5f40
|
|
| MD5 |
98ba570ca565e3c5abaf4b844e743ca7
|
|
| BLAKE2b-256 |
fa27f287cc99d8e91f9db0a3f1204038b72fdc7fb07ee6927f2efced1d83a2d1
|
Provenance
The following attestation bundles were made for h2hdb_downloader-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on Kuan-Lun/h2hdb-downloader
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
h2hdb_downloader-0.3.1-py3-none-any.whl -
Subject digest:
3a51cf1f1a8d04208c871cd25044adb77b89485547f3068facb86635eb6a5f40 - Sigstore transparency entry: 1983476605
- Sigstore integration time:
-
Permalink:
Kuan-Lun/h2hdb-downloader@bc9f31b66590a8600a5a76373278129d9b046009 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/Kuan-Lun
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bc9f31b66590a8600a5a76373278129d9b046009 -
Trigger Event:
push
-
Statement type: