Skip to main content

This is a plugin based web picture crawler. Now pixiv/gamersky plugin available!

Project description

crawl-me

中文README.

Crawl-me is a light-weight fast plugin based web picture crawler. You can download your favorite pictures via the plugin if the website is supported. For now, the plugins include gamersky and pixiv. If you want to contribute, please just feel free to contact with me.

Fork me on Github :) https://github.com/nyankosama/crawl-me

Features

  • Crawl-me core supports muti-thread downloading using http range-headers, so it’s very fast.

  • It’s plugin based, so you can free add any plugin you want.

Available plugins

  • pixiv : This plugin allows you to download any author’s paintings in pixiv site.

  • gamersky : This plugin supports downloading all pictures in special topic from gamersky site.

Installation

install via pip

Make sure you have already installed python2.7 and pip.

Due to the fact that package relies on lxml, if your platform is linux, please make sure you have installed lib libxslt-devel libxml2-devel. And for windows please select a suitable lxml installer to install.

And then:

$ pip install crawl-me

For windows, please add {$python-home}/Scripts/ to systempath

install via git

1. Ubuntu

Install the prerequisite library first:

sudo apt-get install libxml2-dev
sudo apt-get install libxslt1-dev

And then you should install setuptools in order to run the setup.py file

sudo apt-get install python-setuptools

Finally, git clone the source, and install:

$ git clone https://github.com/nyankosama/crawl-me.git
$ cd crawl-me/
$ sudo python setup.py install

2. Windows

Make sure you have already installed python2.7 and pip

You can install python2.7 via windows installer. You can install pip via downloading the get-pip.py, and run it via python:

python get-pip.py

And then install the prerequisite library lxml. please select a suitable lxml installer to install.

Finally git clone the source, and install:

$ git clone https://github.com/nyankosama/crawl-me.git
$ cd crawl-me/
$ sudo python setup.py install

For windows, please add {$python-home}/Scripts/ to systempath

Usage

Examples

  1. Download 10 pages pictures at the url of http://www.gamersky.com/ent/201404/352055.shtml in gamersky site, and store the pictures into local direcotry.

    crawl-me gamersky http://www.gamersky.com/ent/201404/352055.shtml ./gamersky-crawl 1 10
  2. Download all the paintings of 藤原(Fujiwara, Pixiv ID=27517), and store them into local directory.

    crawl-me pixiv 27517 ./pixiv-crawl <your pixiv loginid> <your password>

Command line options

  1. general help

    $ crawl-me -h
    
    usage: crawl-me [-h] plugin
    
    positional arguments:
        plugin      plugin the crawler uses
    
    optional arguments:
        -h, --help  show this help message and exit
    
    available plugins:
    ----gamersky
    ----pixiv
  2. gamersky

    $ crawl-me gamersky -h
    
    usage: crawl-me [-h] plugin url savePath beginPage endPage
    
    positional arguments:
        plugin      plugin the crawler uses
        url         your url to crawl
        savePath    the path where the imgs ars saved
        beginPage   the page where we start crawling
        endPage     the page where we end crawling
    
    optional arguments:
        -h, --help  show this help message and exit
  3. pixiv

    $ crawl-me pixiv -h
    
    usage: crawl-me [-h] plugin authorId savePath pixivId password
    
    positional arguments:
        plugin      plugin the crawler uses
        authorId    the author id you want to crawl
        savePath    the path where the imgs ars saved
        pixivId     your pixiv login id
        password    your pixiv login password
    
    optional arguments:
        -h, --help  show this help message and exit

TODO

  • Functions:

    • support breakpoint resume

  • Plugins:

    • weibo

    • qq zone

Licenses

MIT

ChangeLog

0.1.9dev-20140617-1

Date: 2014-06-17

  • add the projconf.py into crawl_me package

  • bug fix: pixiv plugin gets page size <= 9

0.1.8

Date: 2014-06-15

  • add English README

0.1.8dev-20140615

Date: 2014-06-15

  • bug fix:-v –version option load project.json fail

0.1.8dev-20140612

Date: 2014-06-12

  • add -v –version option for main runnable file to show the package version

0.1.7

Date: 2014-06-11

  • add the http range header support auto-check

0.1.6

Date: 2014-06-11

  • bug fix: terminal without colour doesnt display syslog prefix

0.1.5

Date: 2014-06-11

  • bug fix:pip install bug in windows platform

0.1.5dev-20140611

Date: 2014-06-11

  • bug fix:pypi data_files

0.1.4

Date: 2014-06-11

  • the latest release

0.1.4dev-20140611

Date: 2014-06-11

  • modify the README.md. Now it is consistent with rst format to display on pypi

0.1.4dev-20140610

Date: 2014-06-10

  • add support for installing from pip

0.1.4dev3

Date: 2014-06-10

  • bug fix

  • fix the binary write problem in windows platform

0.1.4dev2

Date: 2014-06-10

  • add setuptools install support

0.1.4dev1

Date: 2014-06-09

  • bug fix

  • rangedownloader:http range-headers may not be supported

0.1.3

Date: 2014-06-07

  • do some refactory

  • add conf dictionary

0.1.2

Date: 2014-06-06

  • add plugin

  • pixiv

0.1.1

Date: 2014-06-05

  • add plugin

  • gamersky

0.0.1

Date: 2014-06-05

  • init the project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawl-me-0.1.9dev-20140617-1.tar.gz (12.2 kB view details)

Uploaded Source

File details

Details for the file crawl-me-0.1.9dev-20140617-1.tar.gz.

File metadata

File hashes

Hashes for crawl-me-0.1.9dev-20140617-1.tar.gz
Algorithm Hash digest
SHA256 5ba607bfc10dd6e55d3c706af2b676b63fb8f4aaa5cc09407b27a32102369520
MD5 5a4cba468ee0b3656222aac6e9b86d0d
BLAKE2b-256 af60796e4352dc7b6f2f22c128cfea58ef5e02dc3b032d078d5c3bebbe483fac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page