imagebot

A web bot to crawl websites and scrape images.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
- Twisted
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- Microsoft :: Windows
- POSIX :: Linux
Programming Language
- Python :: 2.7
Topic
- Internet :: WWW/HTTP :: Indexing/Search

Project description

This bot (image scraper) crawls a given url(s) and downloads all the images.

Features

Supported platforms: Linux / Windows / Python 2.7.
Maintains a database of all downloaded images to avoid duplicate downloads.
Optionally, it can scrape only under a particular url, e.g. scraping http://website.com/albums/new with this option will only download from new album.
Filters urls by regex.
Filters images by minimum size.
Scrapes through javascript popup links (limited support).
Live monitor window for displaying images as they are scraped.
Asynchronous i/o design using scrapy and twisted.

Usage

crawl command:

Scrape images:

imagebot crawl http://website.com
imagebot crawl http://website.com,http://otherwebsite.com

Options for crawl command:

-d, –domains

Scrape images while allowing images to be downloaded from other domain(s) (add multiple domains with comma separated list). The domain in the start url(s) is(are) allowed by default.

imagebot crawl http://website.com -d otherwebsite.com,anotherwebsite.com

-is, –images-store

Specify image store location. Default: ~/Pictures/crawled/[jobname]

imagebot crawl http://website.com -is /home/images

-s, –min-size

Specify minimum size of image to be downloaded (width x height).

imagebot crawl http://website.com -s 300x300

-u, –stay-under

Stay under the start url. Only those urls that have the start url as prefix will be crawled. Useful, for example, to crawl an album or a subsection on a website.

imagebot crawl http://website.com/albums/new -u

-m, –monitor

Launch monitor window for displaying images as they are scraped.

imagebot crawl http://website.com -m

-a, –user-agent

Set user-agent string. Default: imagebot. It is recommended to change it to identify your bot as a matter of responsible crawling.

imagebot crawl http://website.com -a "my_imagebot(http://mysite.com)"

-r, –url-regex

Specify regex for urls. Only those urls matching the regex will be crawled. It does not apply to start url(s).

imagebot crawl http://website.com -r .*gallery.*

-dl, –depth-limit

Specify depth limit for crawling. Use value of 0 to scrape only on start url(s).

imagebot crawl http://website.com -dl 2

–no-cdns

A list of well known cdn’s is included and enabled by default for image downloads. Use this option to disable it.

-at, –auto-throttle

Enable auto throttle feature of scrapy. (details in scrapy docs).

-j, –jobname

Specify a job name. This will be used to store image meta data in the database. By default, domain name of the start url is used as the job name.

-nc, –no-cache

Disable http caching.

-l, –log-level

Specify log level. Supported levels: info, silent, critical, error, debug, warning. Default: error.

imagebot crawl http://website.com -l debug

-h, –help

Get help on crawl command options.

clear command:

This command is useful for various kinds of cleanup.
Options for clear command:

–cache

Clear http cache.

–db

Remove image metadata for a job from the database.

imagebot clear --db website.com

–duplicate-images

Multiple copies of same image may be downloaded due to different urls. Use this option to delete duplicate images for a job.

imagebot clear --duplicate-images website.com

-h, –help

Get help on clear command options.

Dependencies

pywin32 (http://sourceforge.net/projects/pywin32/)

Needed on Windows.
python-gi (Python GObject Introspection API)

Needed on Linux for gtk UI. (Optional). If not found, python built-in Tkinter will be used. On Ubuntu: apt-get install python-gi
scrapy (web crawling framework)

It will be automatically installed by pip.
Pillow (Python Imaging Library)

It will be automatically installed by pip.

Download

PyPI: http://pypi.python.org/pypi/imagebot/
Source: https://github.com/amol9/imagebot/

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
- Twisted
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- Microsoft :: Windows
- POSIX :: Linux
Programming Language
- Python :: 2.7
Topic
- Internet :: WWW/HTTP :: Indexing/Search

Release history Release notifications | RSS feed

This version

1.2.1

Jul 13, 2015

1.2.0

Mar 6, 2015

1.1.1

Feb 24, 2015

1.1.0

Feb 23, 2015

1.0.3

Feb 3, 2015

1.0.2

Feb 1, 2015

1.0.1

Jan 31, 2015

1.0

Jan 31, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imagebot-1.2.1.tar.gz (15.9 kB view details)

Uploaded Jul 13, 2015 Source

File details

Details for the file imagebot-1.2.1.tar.gz.

File metadata

Download URL: imagebot-1.2.1.tar.gz
Upload date: Jul 13, 2015
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for imagebot-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`6a5f44d3a97310b7f5b2af3b088718bc5becf5560ba423ff72e8b88dd1341d37`
MD5	`b51b896d7fb0b177d1ef6eacafdcb914`
BLAKE2b-256	`8e446384661b9942f16056240ba1fb9c0f4397686e4cfa2a45e1e4c059e65bae`

See more details on using hashes here.

imagebot 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Usage

Dependencies

Download

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes