Skip to main content

Download gallery and insert into h2hdb

Project description

H2HDB Downloader (h2hdb-downloader)

Automates downloading galleries from exhentai/e-hentai (via hbrowser) and recording their state in an h2hdb database. It has no CLI or standalone runtime of its own — it's a library consumed by another project that owns the browser session and the overall process lifecycle.

Concepts

  • Gallery — a single exhentai/e-hentai gallery, identified by a gid (numeric id) and represented as h2h_galleryinfo_parser.GalleryURLParser once its URL is known.
  • Dedup — before issuing a real network download, the package checks h2hdb to see if the gid is already settled (downloaded, and not flagged for redownload). Settled gids are skipped — except periodically, at a random interval (1 to 19 attempts), when one is force-redownloaded anyway as an integrity re-check.
  • Durable queue — every download attempt is logged to the h2hdb todownload_gids table before it starts and cleared after it finishes, so a process killed mid-download leaves a trace that gets retried on the next run instead of silently disappearing. The same table doubles as a manual work queue: drop a (gid, url) row into the CSV file you configure as csv_path and it will be picked up the next time the queue is drained.
  • Deep download — download a gallery, then look at its artist/group tags and download sibling galleries that match a set of search conditions (e.g. other-language releases of the same work).

API

Downloader is the sole public export. Every method either acts on a target you explicitly pass in, or — for the two queue-reading methods below — hands back a plain value with no further bookkeeping required from you. There is no "run the whole thing" method: deciding when to stop, what order to process things in, and how to report progress is the calling application's job, not the library's.

Downloader(
    driver: ExHDriver,         # an un-entered driver; see below
    config_path: str,          # path to the h2hdb JSON config
    csv_path: str | None = None,  # path to the manual download-queue CSV
    *,
    wait4client: int,       # seconds to wait before retrying after ClientOfflineException
    retry2download: int,    # seconds to wait before retrying after InsufficientFundsException
)

csv_path only enables the optional "queue a gid/url by editing a CSV file" feature described above — leave it as None if you don't need that; the durable in-flight log and dedup cache work identically either way.

Downloader is itself an async context manager that opens and closes the browser session for you, so driver is expected un-entered:

async with Downloader(ExHDriver(headless=False), ...) as downloader:
    ...

If you'd rather manage the driver's lifecycle yourself, pass an already-entered driver and skip async with downloader.

Method names follow one rule throughout: no suffix means it operates directly on a GalleryURLParser you already have; _by_gid means it resolves a bare gid to its gallery via search first, then does the same thing.

  • await download_by_gallery(target) — download one GalleryURLParser, or an iterable of them. Returns {gallery: downloaded} for each. Retries automatically on ClientOfflineException (waits wait4client seconds) and InsufficientFundsException (waits retry2download seconds); a wait of 0 means "don't retry, raise immediately."
  • await download_by_gid(gid) — resolve a bare gid to its gallery via search, then download it. If the gid no longer resolves to anything, it's recorded as removed in h2hdb; if it resolves to a different gid (the gallery was merged/redirected), the original gid is flagged for deletion. Either way, gid is fully settled in the pending-redownload queue before this returns — callers never need to do that bookkeeping themselves.
  • await download_by_tag(tag, conditions) — download every gallery under a hbrowser Tag, once per search condition in conditions (or unconditionally if conditions is empty).
  • await deep_download_by_gallery(gallery, policy, skip_check=False) — download gallery, then for each tag in policy.filters (e.g. "artist", "group") on that gallery, call download_by_tag with policy.conditions. The cascade only runs if the initial download actually happened, unless skip_check=True forces it to run regardless (useful when you already know the gallery is downloaded from a separate call and just want the cascade). policy is a TagCascadePolicy(filters, conditions) — both fields always travel together, so they're grouped into one frozen value object rather than two parallel parameters.
  • await deep_download_by_gid(gid, policy, skip_check=False) — same gid-resolution as download_by_gid, but deep.
  • await drain_queue(policy, skip_check=True) — process everything currently queued right now: anything queued manually via the CSV, plus anything left in-flight by a previous interrupted run. Doesn't loop or wait for more — it's a single, bounded pass over a snapshot.
  • pending_redownload_gids() — a snapshot list of gids h2hdb currently flags as needing a redownload. Read-only; safe to call repeatedly as you work through it.

Example

The calling application owns the loop. A typical one drains the queue once, then walks the pending-redownload list, deep-downloading anything that actually got (re)downloaded:

import asyncio
from h2hdb_downloader import Downloader, TagCascadePolicy
from hbrowser import ExHDriver
from h2h_galleryinfo_parser import GalleryURLParser

policy = TagCascadePolicy(
    filters=("artist", "group"),
    conditions=("language:chinese$", "language:speechless$"),
)


async def main():
    async with Downloader(
        ExHDriver(headless=True),
        config_path="h2hdb-config.json",
        csv_path="todownload_gids.csv",
        wait4client=30 * 60,
        retry2download=4 * 60 * 60,
    ) as downloader:
        gallery = GalleryURLParser("https://exhentai.org/g/123/456/")
        await downloader.download_by_gallery(gallery)
        await downloader.download_by_gid(666)
        await downloader.deep_download_by_gallery(gallery, policy)

        await downloader.drain_queue(policy, skip_check=True)
        for gid in downloader.pending_redownload_gids():
            gb = await downloader.download_by_gid(gid)
            for downloaded_gallery, downloaded in gb.items():
                if downloaded:
                    await downloader.deep_download_by_gallery(
                        downloaded_gallery, policy, skip_check=True
                    )


asyncio.run(main())

License

This project is distributed under the terms of the GNU General Public Licence (GPL). For detailed licence terms, see the LICENSE file included in this distribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h2hdb_downloader-0.3.1.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

h2hdb_downloader-0.3.1-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file h2hdb_downloader-0.3.1.tar.gz.

File metadata

  • Download URL: h2hdb_downloader-0.3.1.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for h2hdb_downloader-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d2a915eeedf6cb8fb8a11cf11b57186ab6224ca9a95e347e025a7356c4710ef4
MD5 920a4390481dc0e4a8471538faa8e4c6
BLAKE2b-256 ff2e8c3a5b167e85653284d84266e184071770133000466d0f1423eb85c4e129

See more details on using hashes here.

Provenance

The following attestation bundles were made for h2hdb_downloader-0.3.1.tar.gz:

Publisher: publish.yml on Kuan-Lun/h2hdb-downloader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file h2hdb_downloader-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for h2hdb_downloader-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3a51cf1f1a8d04208c871cd25044adb77b89485547f3068facb86635eb6a5f40
MD5 98ba570ca565e3c5abaf4b844e743ca7
BLAKE2b-256 fa27f287cc99d8e91f9db0a3f1204038b72fdc7fb07ee6927f2efced1d83a2d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for h2hdb_downloader-0.3.1-py3-none-any.whl:

Publisher: publish.yml on Kuan-Lun/h2hdb-downloader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page