Skip to main content

A mini framework of image crawlers

Project description

PyPI Version

Introduction

This python package is a mini framework of image crawlers.

Requirements

Python 2.7+ or 3.4+.

Stucture

It consists of 3 main components (Feeder, Parser and Downloader) and 2 FIFO queues (url_queue and task_queue). The workflow is shown in the following figure.

  • url_queue stores the url of pages which may contain images

  • task_queue stores the image url as well as any meta data you like, each element in the queue is a dictionary and must contain the field img_url

  • Feeder puts page urls to url_queue

  • Parser requests and parses the page, then extracts the image urls and puts them into task_queue

  • Downloader gets tasks from task_queue and requests the images, then saves them in the given path.

Feeder, parser and downloader are all thread managers, which means they start threads to finish corresponding tasks, so you can specify the number of threads they use.

Quick start

Installation

For quick install, just use pip.

pip install icrawler

You can also manually install it by

python setup.py install

Then you should have all the dependency installed. If there is any problem with it, you can install the dependency manually.

pip install -r requirements.txt

This framework uses the HTTP library requests for sending requests and the the parsing library beautifulsoup4 for parsing HTML pages.

Use built-in crawlers

This framework contains 5 built-in crawlers.

  • Google

  • Bing

  • Baidu

  • Flickr

  • General greedy crawl (crawl all the images from a website)

Here is an example of how to use the built-in crawlers. The search engine crawlers have similar interfaces.

from icrawler.examples import GoogleImageCrawler
from icrawler.examples import BingImageCrawler
from icrawler.examples import BaiduImageCrawler

google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                     date_min=None, date_max=None, feeder_thr_num=1,
                     parser_thr_num=1, downloader_thr_num=4,
                     min_size=(200,200), max_size=None)
bing_crawler = BingImageCrawler('your_image_dir')
bing_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                   feeder_thr_num=1, parser_thr_num=1, downloader_thr_num=4,
                   min_size=None, max_size=None)
baidu_crawler = BaiduImageCrawler('your_image_dir')
baidu_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                    feeder_thr_num=1, parser_thr_num=1, downloader_thr_num=4,
                    min_size=None, max_size=None)

Note: Only google image crawler supports date range parameters.

Flickr crawler is a little different.

from datetime import date
from icrawler.examples import FlickrImageCrawler

flickr_crawler = FlickrImageCrawler('your_apikey', 'your_image_dir')
flickr_crawler.crawl(max_num=1000, feeder_thr_num=1, parser_thr_num=1,
                     downloader_thr_num=1, tags='child,baby',
                     group_id='68012010@N00', min_upload_date=date(2015, 5, 1))

Supported optional searching auguments are

  • user_id – The NSID of the user who’s photo to search.

  • tags – A comma-delimited list of tags.

  • tag_mode – Either ‘any’ for an OR combination of tags, or ‘all’ for an AND combination.

  • text – A free text search. Photos who’s title, description or tags contain the text will be returned.

  • min_upload_date – Minimum upload date. The date can be in the form of datetime.date object, a unix timestamp or a string.

  • max_upload_date – Maximum upload date. Same form as min_upload_date.

  • group_id – The id of a group who’s pool to search.

  • extras – A comma-delimited list of extra information to fetch for each returned record. See here for more details.

  • per_page – Number of photos to return per page.

If you just want to crawl all the images from some website, then GreedyImageCrawler may be helpful.

from icrawler.examples import GreedyImageCrawler

greedy_crawler = GreedyImageCrawler('images/greedy')
greedy_crawler.crawl(domains='bbc.com', max_num=0,
                     parser_thr_num=1, downloader_thr_num=1,
                     min_size=None, max_size=None)

The argument domains can be either a url string or list. Second level domains and subpaths are supported, but there should be no scheme like ‘http’ in the domains.

You can see the complete example in test.py, to run it

python test.py [option]

option can be google, bing , baidu, flickr, greedy or all, using all by default if no auguments are specified.

Write your own crawler

The simplest way is to override some methods of Feeder, Parser and Downloader class.

  1. Feeder

    The method you need to override is

    feeder.feed(**kwargs)

    If you want to offer the start urls at one time, for example from ‘http://example.com/page_url/1’ up to ‘http://example.com/page_url/10

    from icrawler import Feeder
    
    class MyFeeder(Feeder):
        def feed(self):
            for i in range(10):
                url = 'http://example.com/page_url/{}'.format(i + 1)
                self.url_queue.put(url)
  2. Parser

    The method you need to override is

    parser.parse(response, **kwargs)

    response is the page content of the url from url_queue, what you need to do is to parse the page and extract image urls, and then put them into task_queue. Beautiful Soup package is recommended for parsing html pages. Taking GoogleParser for example,

    class GoogleParser(Parser):
    
        def parse(self, response):
            soup = BeautifulSoup(response.content, 'lxml')
            image_divs = soup.find_all('div', class_='rg_di rg_el ivg-i')
            for div in image_divs:
                meta = json.loads(div.text)
                if 'ou' in meta:
                    self.put_task_into_queue(dict(img_url=meta['ou']))
  3. Downloader

    If you just want to change the filename of downloaded images, you can override the method

    downloader.set_file_path(img_task)

    The default names of downloaded images are counting numbers, from 000001 to 999999.

    If you want to process meta data, for example save some annotations of the images, you can override the method

    downloader.process_meta(img_task):

    Note that your parser need to put meta data as well as image urls into task_queue.

    If you want to do more with the downloader, you can also override the method

    downloader.download(img_task, request_timeout, max_retry=3,
                        min_size=None, max_size=None, **kwargs)

    You can retrive tasks from task_queue and then do what you want to do.

  4. Crawler

    You can either use the base class ImageCrawler or inherit from it. Two main apis are

    crawler.__init__(self, img_dir='images', feeder_cls=Feeder, parser_cls=Parser,
                     downloader_cls=Downloader, log_level=logging.INFO)

    and

    crawler.crawl(self, feeder_thread_num=1, parser_thread_num=1,
                  downloader_thread_num=1, feeder_kwargs={},
                  parser_kwargs={}, downloader_kwargs={})

    So you can use your crawler like this

    crawler = Crawler(feeder_cls=SimpleSEFeeder, parser_cls=MyParser)
    crawler.crawl(feeder_thr_num=1, parser_thr_num=1, downloader_thr_num=4,
                  feeder_kwargs=dict(
                      url_template='https://www.some_search_engine.com/search?keyword={}&start={}',
                      keyword='cat',
                      offset=0,
                      max_num=1000,
                      page_step=50
                      ),
                  downloader_kwargs=dict(
                      max_num=1000,
                      min_size=None,
                      max_size=None
                      )
                  )

    Or define a class to avoid using complex and ugly dictionaries as arguments.

    class MyCrawler(Crawler):
    
        def __init__(self, img_dir='images', log_level=logging.INFO):
            ImageCrawler.__init__(self, img_dir, feeder_cls=SimpleSEFeeder,
                                  parser_cls=MyParser, log_level=log_level)
    
        def crawl(self, keyword, offset=0, max_num=1000, feeder_thr_num=1, parser_thr_num=1,
                  downloader_thr_num=1, min_size=None, max_size=None):
            feeder_kwargs = dict(
                url_template='https://www.some_search_engine.com/search?keyword={}&start={}',
                keyword=keyword,
                offset=offset,
                max_num=max_num,
                page_step=50
            )
            downloader_kwargs = dict(
                max_num=max_num,
                min_size=None,
                max_size=None
            )
            super(MyCrawler, self).crawl(
                feeder_thr_num, parser_thr_num, downloader_thr_num,
                feeder_kwargs=feeder_kwargs,
                downloader_kwargs=downloader_kwargs)
    
    crawler = MyCrawler()
    crawler.crawl(keyword='cat', offset=0, max_num=1000, feeder_thr_num=1,
                  parser_thr_num=1, downloader_thr_num=4, max_size=(1000,800))

API reference

To be continued.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icrawler-0.1.4.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

icrawler-0.1.4-py2.py3-none-any.whl (24.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file icrawler-0.1.4.tar.gz.

File metadata

  • Download URL: icrawler-0.1.4.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for icrawler-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f5bc4cf1cceef94d0e1f9c593a1f3620d1f71316dae54771ae4e5a16d9bdb995
MD5 d813ed38ad860e5b7e0a69889ac9ddf2
BLAKE2b-256 17e67d193daf929e10798654e29c0c7816ef36acee830cb7b00228d834e4787e

See more details on using hashes here.

File details

Details for the file icrawler-0.1.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for icrawler-0.1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 319a706d51799d2d4e81f9008a3d0c5bc5cbdb2443cf960cd60553fede8bffec
MD5 ca7c5367b009dc1f45e0989442770ad7
BLAKE2b-256 b8e800b04a96e35754fdf318085de72ab52c00ccb1e9efca490536648657f3ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page