Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu.
Project description
DatasetScraper
Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.
Features:
- Search engine support: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
- Image format support: jpg, png, svg, gif, jpeg
- Fast multiprocessing enabled scraper
- Very fast multithreaded downloader
- Data verification after download for assertion of image files
Installation
- COMING SOON on pypi
Usage:
-
Import
from datasetscraper import Scraper
-
Defaults
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')
- Specify a search engine
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')
- Specify a list of search engines
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')
- Specify max images (default was 200)
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')
FAQs
-
Why aren't yandex, yahoo, duckduckgo and other search engines supported? They are hard to scrape, I am working on them and will update as soon as I can.
-
I set maxlist=[500] why are only (x<500) images downloaded? There can be several reasons for this:
- Search ran out: This happens very often, google/bing might not have enough images for your query
- Slow internet: Increase the timeout (default is 60 seconds) as follows:
obj.download(urls, directory='kiniro_mosaic/', timeout=100)
-
How to debug? You can change the logging level while making the scraper object :
obj = Scraper(logger.INFO)
TODO:
- More search engines
- Better debug
- Write documentation
- Text data? Audio data?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for datasetscraper-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4819ae12d72c5f358d6bd753b5203cb2689ba4e796799dad535e482d36f42b61 |
|
MD5 | 6e1af8303f541d2b69e19dd7be9728e6 |
|
BLAKE2b-256 | 76325b4d10c3e5fdd37fb62e18ff000ec8f857dcff91e0efd4594e0e8e0275ce |