Skip to main content

Package for downloading images from google

Project description

Python-Image-Fetcher (image_fetcher)

https://github.com/BenAAndrew/Python-Image-Fetcher

A simple lightweight library to download images from google.
This is originally based on https://github.com/hardikvasa/google-images-download by hardikvasa but with a few major changes;

  • Speed: Through tests outlined in the 'Performance Considerations' section using this library is over 6x faster!
  • Simplification: Code has been simplified to make it more understandable and expandable
  • Reduced download duplication: By using the url from which the image was downloaded to name the file, we can avoid trying to redownload the same file in the future. This was a significant drawback with google_images_download as whenever you wanted to download images again it would redownload ones that already existed making it slower.
  • Multithreading: Implementing multithreading means you can run multiple google image downloads similtaneously massively increasing throughput when downloading a large selection of images
  • Extended browser support: Added Firefox support and further configurations to come
  • Progress bar: Added a tqdm progress bar to track how your download was getting on

Table of Contents

Install
Multi-Thread Multi-Search example
Multi-Thread Single-Search example
Single-Thread Single-Search example
Browser
Performance Considerations
How to optimise
Other examples

Install

pip install image-fetcher

Then download the driver for your browser (and OS) of choice;

  • Chrome: https://chromedriver.chromium.org/downloads (download the correct driver for your version of chrome)
  • Firefox: https://github.com/mozilla/geckodriver/releases

Multi-Thread Examples

Multiple search terms

Quick Start;

from image_fetcher.multithread_image_fetching import concurrent_image_search
from image_fetcher.browsers import Browser, BrowserType

concurrent_image_search(
    search_terms=['cat','dog'], 
    max_similtanous_threads=2,
    max_image_fetching_threads=20,
    image_download_timeout=5,
    total_images=200, 
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},
    browser=Browser(BrowserType.CHROME, 'chromedriver.exe')
)

Key arguments;

Optional Arguments;

  • chromedriver_path: Path to chromedriver (default is chromedriver.exe in the current directory)
  • extensions: List of acceptable file extensions (default is jpg & png)
  • directories: Names of folder to save images to (default is the same names as the search_terms)
  • progress_bar: Whether to display a progress bar (default is True)
  • verbose:Whether to print total downloaded & total ignored at the end (default is True)

Single search terms

Quick Start;

from image_fetcher.multithread_image_fetching import concurrent_images_download
from image_fetcher.browsers import Browser, BrowserType

concurrent_images_download(
    search_term='cat', 
    max_image_fetching_threads=20,
    image_download_timeout=5,
    total_images=200, 
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},
    browser=Browser(BrowserType.CHROME, 'chromedriver.exe')
)

All arguments are the same as above except here search_terms is replaced with search_term as this function only accepts a single term and there is no max_similtanous_threads argument as we are only doing one google image search.

Single-Thread Examples

For performance reasons outlined later I would reccommend using muti-threading. However if you choose not to this is how you would implement a single thread execution.

Quick Start;

from image_fetcher.image_fetcher import download_images
from image_fetcher.browsers import Browser, BrowserType

download_images(
        search_term='Dog', 
        total_images=10,  
        headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},
        browser=Browser(BrowserType.CHROME, 'chromedriver.exe')
    )

Key arguments;

Optional Arguments;

  • chromedriver_path: Path to chromedriver (default is chromedriver.exe in the current directory)
  • extensions: List of acceptable file extensions (default is jpg & png)
  • directory: Name of folder to save images to (default is same name as the search_term)
  • progress_bar: Whether to display a progress bar (default is True)
  • verbose:Whether to print total downloaded & total ignored at the end (default is True)

Browser

The Browser object is used to let you easily connect to your browser of choice. It takes two arguments;
  • browser_type: The BrowserType you want (also imported from image_fetcher.browsers)
  • driver: The relative path to the browser executable
Currently two browsers are supported: Chrome & Firefox.

Chrome

Download the driver at https://chromedriver.chromium.org/downloads

browser = Browser(BrowserType.CHROME, 'chromedriver.exe')

Firefox

Download the driver at https://github.com/mozilla/geckodriver/releases

browser = Browser(BrowserType.FIREFOX, 'geckodriver.exe')

In both these cases the driver is an exe in the same directory. Change/remove the extensions depending on your driver type. For different directories just append the path i.e. ../chromedriver.exe would look in the directory above.

Performance Considerations

Time in seconds to perform various image fetching tasks;
Task concurrent_image_search concurrent_images_download download_images google-images-download by hardikvasa
Download 200 cat pictures 23.6 22.4 92.7 148.4
Download 200 cat & dog pictures 28.7 47.7 254.2 330.4
All tests were ran with the following config;
  • total_images=200
  • headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
  • progress_bar=False
  • verbose=True
Both concurrent_image_search and concurrent_images_download were ran with;
  • max_image_fetching_threads=20
  • image_download_timeout=3
concurrent_image_search was also ran with max_similtanous_threads=2

google-images-download was ran with the following config; arguments = {"keywords":"cat", "limit":200, "chromedriver": "chromedriver.exe", "format": "jpg", "print_urls":False}

Explanation;
Understandably in all cases concurrent processing beat out single thread because they are able to download multiple images similtaneously. concurrent_image_search goes one step further with multiple search terms by running them similitaneoulsy, where the other 2 must run one after the other. What's interesting is that concurrent_image_search is slower than concurrent_images_download even though the first actually uses the second when executing. This delay is likely to do with the fact that concurrent_image_search must allocate the call to a thread handler, whereas concurrent_images_download starts immediatly.

How can you optimise performance?

Adjusting the following values will help improve your download speeds. Bear in mind however, that pushing these values too high may cause excessive strain on low performance machines. Adjust these at your own discretion.
  • max_image_fetching_threads: This value states how many similtaneous image fetching processes can be executed. Increasing this value typically increases performance, but there is a tradeoff: If allocating too many threads the allocation time may actually take longer than fewer threads. In my tests at 200 images, I've found 20 to be roughly ideal, but play about with it and let me know what you find.
  • image_download_timeout: This value states how many seconds an image download will be waited on before abandoning. Decreasing this value will typically increase performance as it means slower downloads will be ignored, but bear in mind that if you set this value too low then too many images may be ignored and this will slow performance. It also means in this event more URL's will need to be fetched which is time consuming. I've found most images standard quality should be downloaded within 1-2 seconds, so typically use 3 for this value.
  • max_similtanous_threads (concurrent_image_search only): This value states how many similatenous image search processes can be executed. This is what makes this function more efficent for more searches at the same time (i.e. dogs & cats). For better preformance this value should be equal to how many search terms your making.

Why would I ever use single thread over multi? Simply put it's marginally more reliable. The reason I say this is when you're executing multiple threads you increase the complexity and therefore slightly increase the risk of something going wrong. However in the vast majority of my tests I've had no thread-related issues so I wouldn't take concern with this, just treat single thread as a backup/alternative.

Other examples

...(
        ...
        chromedriver_path='../chromedriver.exe'
    )

Would look for the chromedriver in the directory above the one in which you are executing the method.

...(
        search_term='Duck'
        ...
        extensions=['png'],
        directory='My duck photos'
    )

Would download images using your chosen function (concurrent_images_download or download_images) from the search 'Duck' to a folder called 'My duck photos' where the file type was 'png'

...(
        search_term='Ninja', 
        ...
        progress_bar=False
        verbose=False
    )

Would download images using your chosen function (concurrent_images_download or download_images) from the search 'Ninja' and hide the progress bar and summary text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

image_fetcher-1.1.2.tar.gz (14.2 kB view hashes)

Uploaded Source

Built Distribution

image_fetcher-1.1.2-py3-none-any.whl (14.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page