Tool for extracting and saving specific images from websites.
Project description
Webpage Image Downloader
Webpage image downloader (wid) is a python package for finding and saving images from webpages. It uses Selenium's Chrome webdriver to scrape image elements from web pages and extracting their source URLs. The images are downloaded using Python urllib.request
or requests
packages.
Instalation
You can use pip to install the package wid.
pip install webpage-image-downloader
Executables
The webpage-image-downloader
(wid
) package includes two executables wid-downloader
and wid-crawler
.
Usage: wid-downloader [OPTIONS]
Python script for extracting and saving images from websites.
Options:
-u, --url TEXT Url of the website containing desired images.
Contents of the clipboard are used if none is
provided.
-t, --target-dir TEXT Target directory used to store images.
-r, --img-regex TEXT Regex for finding specific subset of images on the
website.
-i, --img-info Option to find and print all image URLs on the
website.
-p, --page-source Option to get the source code of target website and
print it (save it).
--help Show this message and exit.
Usage: wid-crawler [OPTIONS]
Python web crawler for extracting and saving images from websites.
Options:
-u, --url TEXT Url of the starting website for the web crawler.
Contents of the clipboard are used if none is
provided.
-i, --instructions TEXT Implementation of Instructions abstract class used
by the WebCrawler.
-t, --target-dir TEXT Target directory used to store images.
--help Show this message and exit.
Examples wid-downloader
Look up all the image elements on a web page using wid-downloader
. Some of the elements might not be explicitly visible.
wid-downloader -u https://www.duckduckgo.com -i
Images found:
https://www.duckduckgo.com/assets/icons/header/twitter.svg
https://www.duckduckgo.com/assets/icons/header/reddit.svg
https://www.duckduckgo.com/assets/icons/header/blog.svg
https://www.duckduckgo.com/assets/icons/header/newsletter.svg
https://duckduckgo.com/assets/add-to-browser/cppm/laptop.svg
https://duckduckgo.com/assets/home/landing/icons/search.svg
https://duckduckgo.com/assets/add-to-browser/cppm/mobile.svg
https://duckduckgo.com/assets/onboarding/arrow.svg
https://www.duckduckgo.com/assets/onboarding/bathroomguy/1-monster-v2--pre-animation.svg
https://duckduckgo.com/assets/onboarding/bathroomguy/2-ghost-v2.svg
https://duckduckgo.com/assets/onboarding/bathroomguy/3-bathtub-v2--no-animation.svg
https://duckduckgo.com/assets/onboarding/bathroomguy/4-alpinist-v2.svg
Done.
Download a subset of desired images using wid-downloader
by specifying a regular expression to filter through the list of found image elements. The regular expression is matched against the source URL of the elements. The images are downloaded into the target directory.
wid-downloader -u https://www.duckduckgo.com -t wid-images -r '.*(header).*'
Downloading image from 'https://www.duckduckgo.com/assets/icons/header/twitter.svg'.
Downloading image from 'https://www.duckduckgo.com/assets/icons/header/reddit.svg'.
Downloading image from 'https://www.duckduckgo.com/assets/icons/header/blog.svg'.
Downloading image from 'https://www.duckduckgo.com/assets/icons/header/newsletter.svg'.
Done.
If you want to explicitely look through the web page source code without opening up a browser you can use the wid-downloader
to save the source code to a file.
wid-downloader -u https://www.duckduckgo.com -t wid-page-source -p
Examples wid-crawler
The wid-crawler
can be to navigate through a series of web pages, as well as find and select specific image elements on those web pages that are to be downloaded and saved locally.
Webcrawler Instructions
If you want to use the wid-crawler
to find and download images it's necessary for you to implement the abstract class Instructions
from wid.web.bot.instructions
. Depending on the web pages you intend to scrape, it might be needed to first implement a way for the webcrawler to bypass site's verifiaction/validation. Otherwise, it is required to provide a way to find new URLs to visit from the starting_url
and a way to find desired image elements on visited web pages.
An example of wid-crawler
instructions can be found in examples/test_instructions.py
:
class WebCrawlerInstructions(Instructions):
def __init__(self) -> None:
super().__init__()
def validate(self, webdriver: WebDriver, url: Url) -> bool:
pass
def next_step(self, webdriver: WebDriver) -> List[Url]:
try:
xpath = "//a[@class=\"btn next_page\"]"
element = webdriver.find_element_by_xpath(xpath)
return [Url(element.get_attribute('href'))]
except NoSuchElementException:
return []
def find_image_elements(self, webdriver: WebDriver) -> List[WebElement]:
try:
xpath = '//div[@class="reading-content"]/div[@class="page-break no-gaps"]/img'
elements = webdriver.find_elements_by_xpath(xpath)
return elements
except NoSuchElementException:
return []
__InstructionClass__ = WebCrawlerInstructions
Running webcrawler
wid-crawler -u https://www.mangaread.org/manga/one-punch-man-onepunchman/chapter-218-chapter-160/ -i examples/example_instructions.py
Parsing web page 'https://www.mangaread.org/manga/one-punch-man-onepunchman/chapter-218-chapter-160/'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/2.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/3.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/4.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/5.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/6.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/7.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/8.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/9.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/10.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/11.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/12.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/13.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/14.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/15.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/16.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/17.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/18.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/19.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/20.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/21.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/22.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/23.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/24.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/25.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/26.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/27.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/7b34c97392e0e7ea750b5663abed3f7a/28.jpeg'.
Parsing web page 'https://www.mangaread.org/manga/one-punch-man-onepunchman/chapter-219-chapter-161/'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/2.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/4.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/5.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/6.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/7.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/8.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/9.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/10.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/11.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/12.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/13.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/14.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/15.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/16.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/17.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/18.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/19.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/20.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/21.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/22.jpeg'.
Downloading image from 'https://www.mangaread.org/wp-content/uploads/WP-manga/data/manga_5db92303ed13e/572aee0c077fcf50c79b8ac758b19e9a/23.jpeg'.
Done.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webpage-image-downloader-0.1.2.tar.gz
.
File metadata
- Download URL: webpage-image-downloader-0.1.2.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cd7175c99062d18b161bcbd4ef962ac0f465331e929a597c049dea568004800 |
|
MD5 | bd30c9231c30136f9cb4549b14bbd134 |
|
BLAKE2b-256 | 1caece52bb1f9d1374a8691b15d37c2574dbf839346dbf7e5f64feaff089fbd3 |
File details
Details for the file webpage_image_downloader-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: webpage_image_downloader-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbaf0108f39ebda52e125263a71d4268af644aa049635c67c15e327fdaff467b |
|
MD5 | 83a141c6c7126ee1fdafbcae0ddf266d |
|
BLAKE2b-256 | 594259baba3eb7dbf3b838fa612a55de1bb88d49d2508841df880eb4f7a27bdd |