Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu.
Project description
DatasetScraper
Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.
Features:
- Search engine support: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
- Image format support: jpg, png, svg, gif, jpeg
- Fast multiprocessing enabled scraper
- Very fast multithreaded downloader
- Data verification after download for assertion of image files
Installation
- COMING SOON on pypi
Usage:
-
Import
from datasetscraper import Scraper -
Defaults
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')
- Specify a search engine
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')
- Specify a list of search engines
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')
- Specify max images (default was 200)
obj = Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')
FAQs
-
Why aren't yandex, yahoo, duckduckgo and other search engines supported? They are hard to scrape, I am working on them and will update as soon as I can.
-
I set maxlist=[500] why are only (x<500) images downloaded? There can be several reasons for this:
- Search ran out: This happens very often, google/bing might not have enough images for your query
- Slow internet: Increase the timeout (default is 60 seconds) as follows:
obj.download(urls, directory='kiniro_mosaic/', timeout=100)
-
How to debug? You can change the logging level while making the scraper object :
obj = Scraper(logger.INFO)
TODO:
- More search engines
- Better debug
- Write documentation
- Text data? Audio data?
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datasetscraper-0.0.4.tar.gz.
File metadata
- Download URL: datasetscraper-0.0.4.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2487404f8454cdef44d32309c62cc3a035b0d0d7a5fea4744881d66dbf060437
|
|
| MD5 |
5e2b2caeb2a770518776c0cc9683e686
|
|
| BLAKE2b-256 |
aa3fff3744248ae93b2724e7d210bb95fa0a391c6c81b50db831a968a7a6e009
|
File details
Details for the file datasetscraper-0.0.4-py3-none-any.whl.
File metadata
- Download URL: datasetscraper-0.0.4-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4819ae12d72c5f358d6bd753b5203cb2689ba4e796799dad535e482d36f42b61
|
|
| MD5 |
6e1af8303f541d2b69e19dd7be9728e6
|
|
| BLAKE2b-256 |
76325b4d10c3e5fdd37fb62e18ff000ec8f857dcff91e0efd4594e0e8e0275ce
|