Skip to main content

A fast CLI tool to discover and batch download images from web pages.

Project description

image-harvest

A fast CLI tool to discover and batch download images from web pages.

Features

  • Zero configimgx https://example.com downloads all images instantly
  • Smart discovery — finds images in <img>, data-src (lazy-load), srcset, <picture>, og:image, CSS background-image, and <a> links
  • Parallel downloads — configurable thread pool (default 4, up to 32)
  • Content deduplication — skips duplicate images by SHA-256 hash, even from different URLs
  • Resume support — automatically skips already-downloaded files
  • Size filtering — skip by minimum width/height or maximum file size
  • Type filtering — only download specific formats (jpg, png, webp, etc.)
  • Auto-excludes junk — tracking pixels (1x1), data URIs, SVGs (toggleable)
  • Recursive crawling — follow links to discover images across entire sites
  • Custom renaming — rename files with patterns like {n:04d}.{ext}
  • List-only mode — just discover and list image URLs without downloading
  • Multiple output formats — plain, JSON, JSONL for list mode
  • Polite crawling — rate limiting and robots.txt compliance
  • Proxy support — route through HTTP/HTTPS proxies
  • Minimal dependencies — only requests + beautifulsoup4

Installation

pip install image-harvest

Requires Python 3.8+.

Quick Start

Download all images from a page:

imgx https://example.com/gallery

Download only JPG and PNG, minimum 200px wide:

imgx https://example.com --type jpg,png --min-width 200 -o ./photos

Parallel download with 8 threads:

imgx https://example.com --workers 8

List images without downloading:

imgx https://example.com --list-only --list-format json

Crawl an entire site for images:

imgx https://example.com --depth 2 --max-pages 100 -v

Rename downloaded files sequentially:

imgx https://example.com --rename "{n:04d}.{ext}"
# Downloads as 0001.jpg, 0002.png, 0003.webp, ...

CLI Reference

imgx [OPTIONS] URL...

Output

Flag Default Description
-o, --output DIR ./images Output directory
--list-only off Only list URLs, don't download
--list-format plain Format: plain, json, jsonl

Filtering

Flag Default Description
--min-width PX 0 Minimum image width (from HTML)
--min-height PX 0 Minimum image height (from HTML)
--type EXT all Only these types: jpg,png,webp
--exclude-type EXT Exclude types: gif,bmp
--include-svg off Include SVG (excluded by default)
--max-size MB 0 Skip images larger than this

Download

Flag Default Description
-w, --workers N 4 Parallel download threads
--overwrite off Overwrite existing files
--no-dedupe off Disable content-hash deduplication
--rename PATTERN Rename: {n:04d}.{ext}, {name}_{n}.{ext}
--timeout SEC 30 Download timeout per image

Crawling

Flag Default Description
--depth N 0 Recursion depth (0 = single page)
--max-pages N 50 Max pages to crawl
--rate-limit SEC 1.0 Seconds between page requests
--ignore-robots off Ignore robots.txt
--proxy URL HTTP/HTTPS proxy

Python API

from image_extractor import ImageExtractor
from image_extractor.downloader import ImageDownloader

# Discover images
ex = ImageExtractor(min_width=100, include_types=["jpg", "png"])
images = ex.from_url("https://example.com/gallery")

for img in images:
    print(f"{img.url} ({img.filename})")

# Download them
dl = ImageDownloader(output_dir="./photos", workers=8)
stats = dl.download(images)
print(f"Downloaded {stats['downloaded']} images")

Crawling API

from image_extractor import ImageExtractor
from image_extractor.crawler import ImageCrawler

ex = ImageExtractor()
crawler = ImageCrawler(ex, rate_limit=1.0)
images = crawler.crawl("https://example.com", max_depth=2)

Image Discovery Sources

image-harvest looks for images in all these locations:

Source Example
<img src> <img src="/photo.jpg">
<img data-src> Lazy-loaded images
<img srcset> Responsive images
<picture><source> Art direction
<a href> Links to image files
<meta og:image> Open Graph images
CSS background-image Inline style backgrounds

Automatically excluded: tracking pixels (1x1), data: URIs, SVGs (unless --include-svg).

Comparison

Feature image-harvest gallery-dl google-images-download icrawler
Zero config Yes No Yes No
Any web page Yes Site-specific Google only Search engines
Parallel downloads Yes No No Yes
Content deduplication Yes No No No
Resume support Yes Yes No No
Size/type filtering Yes Complex config Basic Basic
Lazy-load detection Yes N/A No No
srcset/picture Yes N/A No No
og:image + CSS bg Yes No No No
Recursive crawling Yes Per-site No No
Custom renaming Yes Yes No No
List-only mode Yes No No No
robots.txt Yes No No No
pip install Yes Yes Yes Yes
Dependencies 2 Many Many Several

Also by Thunderbit

  • email-harvest — Extract email addresses from text, files, and web pages
  • phone-harvest — Extract and identify phone numbers from text, files, and web pages

License

MIT


Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

image_harvest-0.1.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

image_harvest-0.1.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file image_harvest-0.1.0.tar.gz.

File metadata

  • Download URL: image_harvest-0.1.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for image_harvest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 48668ee9a64a3e87cde2aff27e7a6dc542568a3c9ee2c0bb65b6f97b99ff7d86
MD5 ad5a8789e7a78b296d2f0f38691b59ba
BLAKE2b-256 e21bca4692c6bd12d18c822fa9ed19ed652e0a68ee570487bcf5af6890917e21

See more details on using hashes here.

File details

Details for the file image_harvest-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: image_harvest-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for image_harvest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6245b0dddb6bf6a456c8f71aabc605ca95ab90ad718dc6ccc1b5bf59ca26b879
MD5 a022e4c2a11b94aec630275b3e5ddac4
BLAKE2b-256 9473961719576b5e5a4f4923d12eb3c358c8a57b8f6e5baaa1449fbb878eba1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page