image-caption-scraper is a Python tool for downloading images and captions from image search engines.
Project description
image-caption-scraper
About
This package allows downloading images and their corresponding captions from web search engines like Google Images, Yahoo Image, Flickr, and more to come.
The model is particularly targetting researchers for building their own datasets of their own concepts, and can thus train machine learning models for computer vision, natural language processing, and the intersection of both (image captioning, visual relationship detection).
The pipeline is completely based on Selenium web driver.
Table of Contents
Installation
pip install image-caption-scraper
Requirements
Python 3.6 or later with all requirements.txt dependencies installed. To install run:
pip install -r requirements.txt
Also, make sure to download chrome-driver from https://chromedriver.chromium.org/ and either add it to the system variables path, or type down the full path to the .exe file in the options below.
Usage
from image_caption_scraper import Image_Caption_Scraper
scraper = Image_Caption_Scraper(
engine="all", # or "google", "yahoo", "flickr"
num_images=100,
query="dog chases cat",
out_dir="images",
headless=True,
driver="chromedriver",
expand=False,
k=3
)
scraper.scrape(save_images=True)
Options
Argument | Description | Options |
---|---|---|
engine | Search engine to scrape images from. By default it searches through all the available engines (currently supports Google, Flickr, Yahoo). | "all","google","flickr","yahoo" |
num_images | Number of images targetted by the user | Any number (int) > 0 |
query | The text query to search for | Any text query |
out_dir | Output directly to save the images and captions | Any text string |
headless | Argument to hide the browser while crawling the web pages or show it. True will hide, False will open it | 'True' or 'False' |
expand | Argument to expand the input query. Expansions supports synonyms from wordnet at the moment. Translations are coming soon. | 'True' or 'False' |
driver | The web driver to navigate the web pages. Download the driver from https://chromedriver.chromium.org/ | Default='chromedriver' (configured in System Path). Otherwise just type the path to the .exe file |
k | If expand==True, k determines how many synonyms to fetch from wordnet for each word in the query. It is assumed that words are separated by spaces. The model fetches the closest k synonyms for each word by path_similarity in the wordnet graph. | 'True' or 'False' |
save_images | If True the model will save the images+captions in the out_dir folder. Otherwise it will only save the meta-data (with urls) without the images. | 'True' or 'False' |
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Email: alishibli97@hotmail.com
TODO
- Verify large dataset collections (quality and time wise)
- Implement parallel execution for faster data collection
- Expand queries using more methods like transations and other.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for image_caption_scraper-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e64c174cbbba8339162511ee0555900c6abe06641668236a22c5325a2ad85774 |
|
MD5 | 5217411588e6b0111c5d14321fe37ae5 |
|
BLAKE2b-256 | f1da295b264a4101de98d16b29d325d394b3f3634dcf9bc2d55cef3881a1b465 |
Hashes for image_caption_scraper-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 052d7cbede597a2547688f8f1fba2b4c0b03565bc28e0c187f9b5b11876def22 |
|
MD5 | 51ed67508aaf07acaa9568fd95e3b307 |
|
BLAKE2b-256 | 34f49445bc0ab7ebb38d0bcbee8e3e4221f808dec2916f862cf0c90a80553534 |