Skip to main content

Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu.

Project description

DatasetScraper

Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.

Features:

  • Search engine support: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
  • Image format support: jpg, png, svg, gif, jpeg
  • Fast multiprocessing enabled scraper
  • Very fast multithreaded downloader
  • Data verification after download for assertion of image files

Installation

  • COMING SOON on pypi

Usage:

  • Import from datasetscraper import Scraper

  • Defaults

obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')
  • Specify a search engine
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')
  • Specify a list of search engines
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')
  • Specify max images (default was 200)
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')

FAQs

  • Why aren't yandex, yahoo, duckduckgo and other search engines supported? They are hard to scrape, I am working on them and will update as soon as I can.

  • I set maxlist=[500] why are only (x<500) images downloaded? There can be several reasons for this:

    • Search ran out: This happens very often, google/bing might not have enough images for your query
    • Slow internet: Increase the timeout (default is 60 seconds) as follows: obj.download(urls, directory='kiniro_mosaic/', timeout=100)
  • How to debug? You can change the logging level while making the scraper object : obj = Scraper(logger.INFO)

TODO:

  • More search engines
  • Better debug
  • Write documentation
  • Text data? Audio data?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasetscraper-0.0.4.tar.gz (5.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datasetscraper-0.0.4-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file datasetscraper-0.0.4.tar.gz.

File metadata

  • Download URL: datasetscraper-0.0.4.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6

File hashes

Hashes for datasetscraper-0.0.4.tar.gz
Algorithm Hash digest
SHA256 2487404f8454cdef44d32309c62cc3a035b0d0d7a5fea4744881d66dbf060437
MD5 5e2b2caeb2a770518776c0cc9683e686
BLAKE2b-256 aa3fff3744248ae93b2724e7d210bb95fa0a391c6c81b50db831a968a7a6e009

See more details on using hashes here.

File details

Details for the file datasetscraper-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: datasetscraper-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6

File hashes

Hashes for datasetscraper-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 4819ae12d72c5f358d6bd753b5203cb2689ba4e796799dad535e482d36f42b61
MD5 6e1af8303f541d2b69e19dd7be9728e6
BLAKE2b-256 76325b4d10c3e5fdd37fb62e18ff000ec8f857dcff91e0efd4594e0e8e0275ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page