Skip to main content

This is a plugin based web picture crawler. Now pixiv/gamersky plugin available!

Project description

crawl-me

中文README.

Crawl-me is a light-weight fast plugin based web picture crawler. You can download your favorite pictures via the plugin if the website is supported. For now, the plugins include gamersky and pixiv. If you want to contribute, please just feel free to contact with me.

Fork me on Github :) https://github.com/nyankosama/crawl-me

Features

  • Crawl-me core supports muti-thread downloading using http range-headers, so it’s very fast.
  • It’s plugin based, so you can free add any plugin you want.

Available plugins

  • pixiv : This plugin allows you to download any author’s paintings in pixiv site.
  • gamersky : This plugin supports downloading all pictures in special topic from gamersky site.

Installation

install via pip

Make sure you have already installed python2.7 and pip.

Due to the fact that package relies on lxml, if your platform is linux, please make sure you have installed lib libxslt-devel libxml2-devel. And for windows please select a suitable lxml installer to install.

And then:

$ pip install crawl-me

For windows, please add {$python-home}/Scripts/ to systempath

install via git

1. Ubuntu

Install the prerequisite library first:

sudo apt-get install libxml2-dev
sudo apt-get install libxslt1-dev

And then you should install setuptools in order to run the setup.py file

sudo apt-get install python-setuptools

Finally, git clone the source, and install:

$ git clone https://github.com/nyankosama/crawl-me.git
$ cd crawl-me/
$ sudo python setup.py install

2. Windows

Make sure you have already installed python2.7 and pip

You can install python2.7 via windows installer. You can install pip via downloading the get-pip.py, and run it via python:

python get-pip.py

And then install the prerequisite library lxml. please select a suitable lxml installer to install.

Finally git clone the source, and install:

$ git clone https://github.com/nyankosama/crawl-me.git
$ cd crawl-me/
$ sudo python setup.py install

For windows, please add {$python-home}/Scripts/ to systempath

Usage

Examples

  1. Download 10 pages pictures at the url of http://www.gamersky.com/ent/201404/352055.shtml in gamersky site, and store the pictures into local direcotry.

    crawl-me gamersky http://www.gamersky.com/ent/201404/352055.shtml ./gamersky-crawl 1 10
    
  2. Download all the paintings of 藤原(Fujiwara, Pixiv ID=27517), and store them into local directory.

    crawl-me pixiv 27517 ./pixiv-crawl <your pixiv loginid> <your password>
    

Command line options

  1. general help

    $ crawl-me -h
    
    usage: crawl-me [-h] plugin
    
    positional arguments:
        plugin      plugin the crawler uses
    
    optional arguments:
        -h, --help  show this help message and exit
    
    available plugins:
    ----gamersky
    ----pixiv
    
  2. gamersky

    $ crawl-me gamersky -h
    
    usage: crawl-me [-h] plugin url savePath beginPage endPage
    
    positional arguments:
        plugin      plugin the crawler uses
        url         your url to crawl
        savePath    the path where the imgs ars saved
        beginPage   the page where we start crawling
        endPage     the page where we end crawling
    
    optional arguments:
        -h, --help  show this help message and exit
    
  3. pixiv

    $ crawl-me pixiv -h
    
    usage: crawl-me [-h] plugin authorId savePath pixivId password
    
    positional arguments:
        plugin      plugin the crawler uses
        authorId    the author id you want to crawl
        savePath    the path where the imgs ars saved
        pixivId     your pixiv login id
        password    your pixiv login password
    
    optional arguments:
        -h, --help  show this help message and exit
    

TODO

  • Functions:
    • support breakpoint resume
  • Plugins:
    • weibo
    • qq zone

Licenses

MIT

ChangeLog

0.1.9dev-20140617-1

Date: 2014-06-17

  • add the projconf.py into crawl_me package
  • bug fix: pixiv plugin gets page size <= 9

0.1.8

Date: 2014-06-15

  • add English README

0.1.8dev-20140615

Date: 2014-06-15

  • bug fix:-v –version option load project.json fail

0.1.8dev-20140612

Date: 2014-06-12

  • add -v –version option for main runnable file to show the package version

0.1.7

Date: 2014-06-11

  • add the http range header support auto-check

0.1.6

Date: 2014-06-11

  • bug fix: terminal without colour doesnt display syslog prefix

0.1.5

Date: 2014-06-11

  • bug fix:pip install bug in windows platform

0.1.5dev-20140611

Date: 2014-06-11

  • bug fix:pypi data_files

0.1.4

Date: 2014-06-11

  • the latest release

0.1.4dev-20140611

Date: 2014-06-11

  • modify the README.md. Now it is consistent with rst format to display on pypi

0.1.4dev-20140610

Date: 2014-06-10

  • add support for installing from pip

0.1.4dev3

Date: 2014-06-10

  • bug fix
  • fix the binary write problem in windows platform

0.1.4dev2

Date: 2014-06-10

  • add setuptools install support

0.1.4dev1

Date: 2014-06-09

  • bug fix
  • rangedownloader:http range-headers may not be supported

0.1.3

Date: 2014-06-07

  • do some refactory
  • add conf dictionary

0.1.2

Date: 2014-06-06

  • add plugin
  • pixiv

0.1.1

Date: 2014-06-05

  • add plugin
  • gamersky

0.0.1

Date: 2014-06-05

  • init the project

Project details


Release history Release notifications

History Node

0.1.9dev-20140619-2

History Node

0.1.9dev-20140619-1

This version
History Node

0.1.9dev-20140617-1

History Node

0.1.8

History Node

0.1.8dev-20140615-1

History Node

0.1.8dev-20140615

History Node

0.1.8dev-20140612

History Node

0.1.7

History Node

0.1.6

History Node

0.1.5

History Node

0.1.5dev-20140611-1

History Node

0.1.5dev-20140611

History Node

0.1.4

History Node

0.1.4a3

History Node

0.1.4a2

History Node

0.1.4a1

History Node

0.1.4dev-20140611

History Node

0.1.4dev-20140610

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
crawl-me-0.1.9dev-20140617-1.tar.gz (12.2 kB) Copy SHA256 hash SHA256 Source None Jun 17, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page