A fast CLI tool to discover and batch download images from web pages.

These details have not been verified by PyPI

Project links

Project description

image-harvest

A fast CLI tool to discover and batch download images from web pages.

Features

Zero config — imgx https://example.com downloads all images instantly
Smart discovery — finds images in <img>, data-src (lazy-load), srcset, <picture>, og:image, CSS background-image, and <a> links
Parallel downloads — configurable thread pool (default 4, up to 32)
Content deduplication — skips duplicate images by SHA-256 hash, even from different URLs
Resume support — automatically skips already-downloaded files
Size filtering — skip by minimum width/height or maximum file size
Type filtering — only download specific formats (jpg, png, webp, etc.)
Auto-excludes junk — tracking pixels (1x1), data URIs, SVGs (toggleable)
Recursive crawling — follow links to discover images across entire sites
Custom renaming — rename files with patterns like {n:04d}.{ext}
List-only mode — just discover and list image URLs without downloading
Multiple output formats — plain, JSON, JSONL for list mode
Polite crawling — rate limiting and robots.txt compliance
Proxy support — route through HTTP/HTTPS proxies
Minimal dependencies — only requests + beautifulsoup4

Installation

pip install image-harvest

Requires Python 3.8+.

Quick Start

Download all images from a page:

imgx https://example.com/gallery

Download only JPG and PNG, minimum 200px wide:

imgx https://example.com --type jpg,png --min-width 200 -o ./photos

Parallel download with 8 threads:

imgx https://example.com --workers 8

List images without downloading:

imgx https://example.com --list-only --list-format json

Crawl an entire site for images:

imgx https://example.com --depth 2 --max-pages 100 -v

Rename downloaded files sequentially:

imgx https://example.com --rename "{n:04d}.{ext}"
# Downloads as 0001.jpg, 0002.png, 0003.webp, ...

CLI Reference

imgx [OPTIONS] URL...

Output

Flag	Default	Description
`-o, --output DIR`	`./images`	Output directory
`--list-only`	off	Only list URLs, don't download
`--list-format`	`plain`	Format: `plain`, `json`, `jsonl`

Filtering

Flag	Default	Description
`--min-width PX`	`0`	Minimum image width (from HTML)
`--min-height PX`	`0`	Minimum image height (from HTML)
`--type EXT`	all	Only these types: `jpg,png,webp`
`--exclude-type EXT`	—	Exclude types: `gif,bmp`
`--include-svg`	off	Include SVG (excluded by default)
`--max-size MB`	`0`	Skip images larger than this

Download

Flag	Default	Description
`-w, --workers N`	`4`	Parallel download threads
`--overwrite`	off	Overwrite existing files
`--no-dedupe`	off	Disable content-hash deduplication
`--rename PATTERN`	—	Rename: `{n:04d}.{ext}`, `{name}_{n}.{ext}`
`--timeout SEC`	`30`	Download timeout per image

Crawling

Flag	Default	Description
`--depth N`	`0`	Recursion depth (0 = single page)
`--max-pages N`	`50`	Max pages to crawl
`--rate-limit SEC`	`1.0`	Seconds between page requests
`--ignore-robots`	off	Ignore robots.txt
`--proxy URL`	—	HTTP/HTTPS proxy

Python API

from image_extractor import ImageExtractor
from image_extractor.downloader import ImageDownloader

# Discover images
ex = ImageExtractor(min_width=100, include_types=["jpg", "png"])
images = ex.from_url("https://example.com/gallery")

for img in images:
    print(f"{img.url} ({img.filename})")

# Download them
dl = ImageDownloader(output_dir="./photos", workers=8)
stats = dl.download(images)
print(f"Downloaded {stats['downloaded']} images")

Crawling API

from image_extractor import ImageExtractor
from image_extractor.crawler import ImageCrawler

ex = ImageExtractor()
crawler = ImageCrawler(ex, rate_limit=1.0)
images = crawler.crawl("https://example.com", max_depth=2)

Image Discovery Sources

image-harvest looks for images in all these locations:

Source	Example
`<img src>`	`<img src="/photo.jpg">`
`<img data-src>`	Lazy-loaded images
`<img srcset>`	Responsive images
`<picture><source>`	Art direction
`<a href>`	Links to image files
`<meta og:image>`	Open Graph images
CSS `background-image`	Inline style backgrounds

Automatically excluded: tracking pixels (1x1), data: URIs, SVGs (unless --include-svg).

Comparison

Feature	image-harvest	gallery-dl	google-images-download	icrawler
Zero config	Yes	No	Yes	No
Any web page	Yes	Site-specific	Google only	Search engines
Parallel downloads	Yes	No	No	Yes
Content deduplication	Yes	No	No	No
Resume support	Yes	Yes	No	No
Size/type filtering	Yes	Complex config	Basic	Basic
Lazy-load detection	Yes	N/A	No	No
srcset/picture	Yes	N/A	No	No
og:image + CSS bg	Yes	No	No	No
Recursive crawling	Yes	Per-site	No	No
Custom renaming	Yes	Yes	No	No
List-only mode	Yes	No	No	No
robots.txt	Yes	No	No	No
pip install	Yes	Yes	Yes	Yes
Dependencies	2	Many	Many	Several

Also by Thunderbit

email-harvest — Extract email addresses from text, files, and web pages
phone-harvest — Extract and identify phone numbers from text, files, and web pages

License

MIT

Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_image-0.1.0.tar.gz (14.7 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extract_image-0.1.0-py3-none-any.whl (15.1 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file extract_image-0.1.0.tar.gz.

File metadata

Download URL: extract_image-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_image-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8c2e2b64e164471e9eddb549cabf7714c1c94e18ca88cd843fa905b905fdfe2c`
MD5	`df2d2e86cf3fb56c0a84820da94903d9`
BLAKE2b-256	`42c5c93a105ddc405f1061b7d6f64f6dc63644eaea01f5fa8ee80226e731300b`

See more details on using hashes here.

File details

Details for the file extract_image-0.1.0-py3-none-any.whl.

File metadata

Download URL: extract_image-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_image-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df8262d13328e40de956f6bdab06415f38e8c96a721d3f2e9f2c1d4106acaae2`
MD5	`5d3cd9f23098dbeb47d2f833f4e99b9e`
BLAKE2b-256	`973e5bc35a3dc006becd434dff249032dc7037db149be6c7a523fb9f0112ea84`

See more details on using hashes here.

extract-image 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

image-harvest

Features

Installation

Quick Start

CLI Reference

Output

Filtering

Download

Crawling

Python API

Crawling API

Image Discovery Sources

Comparison

Also by Thunderbit

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes