Skip to main content

Web gallery downloader

Project description

img-lurker

img-lurker is a gallery downloader.

img-lurker takes a URL of a (HTML) web page and downloads linked images on it. If the page contains only thumbnails, linking to a the full size version of the image, img-lurker will rather take the bigger one. If there are links to other HTML pages (themselves containing a the full size image), img-lurker will follow those links to find the bigger size.

img-lurker has a "minimum image size" for considering an image is worthy of being downloaded and isn't UI stuff like buttons/separators. img-lurker will not follow links if the link tag doesn't contain an image tag (assumed to be the thumbnail).

Example

Consider a site with following HTML:

<a href="fullimage1.jpg">
    <img src="thumbnail1.jpg" />
</a>
<a href="fullimage2.jpg">
    <img src="thumbnail2.jpg" />
</a>

img-lurker would download fullimage1.jpg and fullimage2.jpg. If instead the links point to other HTML pages containing the full size version of the images (for example fullimage1.html containing fullimage1.jpg), img-lurker would still find fullimage1.jpg by following the page links.

Options

Cookies

--cookie KEY=VALUE

Inject a specific cookie, which might be required to visit some restricted access pages. For example, some subreddits require you to pass the cookie "over18=1".

The option can be passed several times to inject multiple cookies.

Pagination

--next-page-xpath HTML_XPATH

img-lurker can handle pagination for sites where a gallery contains so many images that the site is split in numbered pages. HTML_XPATH should be an XPath expression locating the HTML link to the "next page". If this argument is given, after downloading all images of a "page", img-lurker will follow the link pointed to by HTML_XPATH and repeat on the next page.

Warning: this can issue a lot of traffic for huge galleries. Be cautious or you might get blocked by the website.

Stop/resume

--history-file FILE

Mark all downloaded images URLs in this file and avoid redownloading URLs present in this file. Useful when running img-lurker multiple times on the same gallery, typically if the gallery has received fresh images. Also useful if you use --next-page-xpath option and kill img-lurker to avoid flooding the site, make a pause (minutes? hours? days?) then restart img-lurker: the history file will help it resume where it was interrupted.

This makes the assumption that:

  • each image will always have the same URL, e.g. no varying tokens/timestamps in the URL, etc.
  • conversely, an URL will always point to the same image, it will not point to another image at some point, e.g. the images are NOT numbered in ascending order (else 1.jpg would point to different images over time).

Tell apart thumbnails from "big images" to download only the latter

--min-thumb-size WIDTHxHEIGHT
--min-image-size WIDTHxHEIGHT

Minimum size for an image to be considered a thumbnail worth following or an image worth downloading. Useful not to download navigation buttons, logos, etc. Default values are --min-thumb-size=128x128 and --min-image-size=400x400.

--max-aspect-ratio WIDTH:HEIGHT

Maximum ratio between WIDTH and HEIGHT (or HEIGHT on WIDTH, img-lurker is smart enough to figure out) to consider an image is worth downloading.

For example, pass "16:9" and img-lurker will accept images with dimensions 1920x1080 or 1080x1920 as they are respectively 16:9 and 9:16 but also 1600x1200 or 1200x1600 because they are 4:3 (and 3:4) which is lower (more looking like a square) than the max "16:9". Ratios of portrait and landscape are considered equivalent. However, passing "16:9" would discard a banner with dimensions 1200x300 because its ratio is 4:1 which is way more distorted (very thin rectangle) than 16:9. It would also reject a banner with dimensions 300x1200 because it is 1:4, equivalent to 4:1.

A photo is rarely square but is almost never thin like 4:1, except panoramas, so configure this option if you intend to download panoramas for example. The default value is --max-aspect-ratio=4:1.

Debug

--debug

Debug log.

Limitations

  • img-lurker will not interpret javascript, though it has specific hints to detect lazy-loaded images, so it might not work on sites like instagram.
  • img-lurker will not open iframes, so it will fail to download a few images from reddit.
  • img-lurker does not crawl a site and does not support nested galleries, it only takes one gallery and expects it to contain the images desired.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

img_lurker-1.0.3.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

img_lurker-1.0.3-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file img_lurker-1.0.3.tar.gz.

File metadata

  • Download URL: img_lurker-1.0.3.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for img_lurker-1.0.3.tar.gz
Algorithm Hash digest
SHA256 44a6a27e2653c452401cb9b2e1713eea8f8c6e69fc864a921c1e3271111a1005
MD5 01283d22e5eb66ff04f54b6a48b9a809
BLAKE2b-256 91358fb096453fbedc340f50c7d9b84b228b5729e81246851c27f372280b8ff4

See more details on using hashes here.

File details

Details for the file img_lurker-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: img_lurker-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for img_lurker-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 313b0d918cd17145a0053eedb82f4477322dfda44d1a7ca96624bc677dc732ae
MD5 aff559a841ec568246741511cfcfe9d1
BLAKE2b-256 18fc98d51a90e277e5d87816b4bdf6cc8f4d2d1bd897e34a9884dbdf2b9a50b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page