comiccrawler

An image crawler with extendible modules and gui

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Development Status
- 5 - Production/Stable
Environment
- Console
- Win32 (MS Windows)
Intended Audience
- End Users/Desktop
License
- OSI Approved :: MIT License
Natural Language
- Chinese (Traditional)
Operating System
- Microsoft :: Windows :: Windows 7
Programming Language
- Python :: 3.4
Topic
- Internet

Project description

Comic Crawler
=============

Comic Crawler 是用來扒圖的一支 Python
Script。擁有簡易的下載管理員、圖書館功能、與方便的擴充能力。

2016.2.27 更新
-------------

- "www.comicvip.com" 被 "www.comicbus.com" 取代。詳細請參考 `#7 <https://github.com/eight04/ComicCrawler/issues/7>`__

Todos
-----

- Make grabber be able to return verbose info?
- Need a better error log system.
- Support pool in Sankaku.
- Add module.get_episode_id to make the module decide how to compare episodes.

Features
--------

- Extendible module design.
- Easy to use function grabhtml, grabimg.
- Auto setup referer and other common headers.

Dependencies
------------

- docopt - command line interface.
- pyexecjs - to execute javascript.
- pythreadworker - a small threading library.

Development Dependencies
------------------------

- wheel - create python wheel.
- twine - upload package.

下載和安裝（Windows）
---------------------

Comic Crawler is on
`PyPI <https://pypi.python.org/pypi/comiccrawler/2016.4.8>`__. 安裝完
python 後，可以直接用 pip 指令自動安裝。

Install Python
~~~~~~~~~~~~~~

你需要 Python 3.4 以上。安裝檔可以從它的
`官方網站 <https://www.python.org/>`__ 下載。

安裝時記得要選「Add python.exe to path」，才能使用 pip 指令。

Install Node.js
~~~~~~~~~~~~~~~

有些網站的 JavaScript 用 Windows 內建的 Windows Script Host
會解析失敗，建議安裝 `Node.js <https://nodejs.org/>`__.

Install Comic Crawler
~~~~~~~~~~~~~~~~~~~~~

在 cmd 底下輸入以下指令︰

::

pip install comiccrawler

更新時︰

::

pip install --upgrade comiccrawler

Supported domains
-----------------

chan.sankakucomplex.com comic.acgn.cc comic.ck101.com comic.sfacg.com danbooru.donmai.us deviantart.com exhentai.org g.e-hentai.org imgbox.com konachan.com m.dmzj.com manhua.dmzj.com seiga.nicovideo.jp tel.dm5.com tsundora.com tumblr.com tw.seemh.com www.8comic.com www.99comic.com www.chuixue.com www.comicbus.com www.comicvip.com www.dm5.com www.facebook.com www.iibq.com www.manhuadao.com www.pixiv.net www.seemh.com yande.re

使用說明
--------

::

Usage:
comiccrawler domains
comiccrawler download URL [--dest SAVE_FOLDER]
comiccrawler gui
comiccrawler migrate
comiccrawler (--help | --version)

Commands:
domains 列出支援的網址
download URL 下載指定的 url
gui 啟動主視窗
migrate 轉換當前目錄底下的 save.dat, library.dat 成新格式

Options:
--dest SAVE_FOLDER 設定下載目錄（預設為 "."）
--help 顯示幫助訊息
--version 顯示版本

圖形介面
--------

.. figure:: http://i.imgur.com/ZzF0YFx.png
:alt: 主視窗

主視窗

- 在文字欄貼上網址後點「加入連結」或是按 Enter
- 若是剪貼簿裡有支援的網址，且文字欄同時是空的，程式會自動貼上
- 對著任務右鍵，可以選擇把任務加入圖書館。圖書館內的任務，在每次程式啟動時，都會檢查是否有更新。

設定檔
------

::

[DEFAULT]
; 設定下載完成後要執行的程式，會傳入下載資料夾的位置
runafterdownload =

; 啟動時自動檢查圖書館更新
libraryautocheck = true

; 下載目的資料夾
savepath = ~/comiccrawler/download

; 開啟 grabber 偵錯
errorlog = false

; 每隔 5 分鐘自動存檔
autosave = 5

- 設定檔位於 ``%USERPROFILE%\comiccrawler\setting.ini``
- 執行一次 ``comiccrawler gui`` 後關閉，設定檔會自動產生
- 各別的網站會有自己的設定，通常是要填入一些登入相關資訊
- 設定檔會在重新啟動後生效。若 ComicCrawler 正在執行中，可以點「重載設定檔」來載入新設定

Module example
--------------

.. code:: python

#! python3
"""
This is an example to show how to write a comiccrawler module.

"""

import re, urllib.parse
from ..core import Episode

# The header used in grabber method
header = {}

# The cookies
cookie = {}

# Match domain. Support sub-domain.
domain = ["www.example.com", "comic.example.com"]

# Module name
name = "Example"

# With noepfolder = True, Comic Crawler won't generate subfolder for each episode.
noepfolder = False

# Wait 5 seconds between each download.
rest = 5

# Specific user settings
config = {
"user": "user-default-value",
"hash": "hash-default-value"
}

def load_config():
"""This function will be called each time the config reloaded.
"""
cookie.update(config)

def get_title(html, url):
"""Return mission title.

Title will be used in saving filepath, so be sure to avoid duplicate title.
"""
return re.search("<h1 id='title'>(.+?)</h1>", html).group(1)

def get_episodes(html, url):
"""Return episode list.

The episode list should be sorted by date, oldest first.
"""
match_iter = re.finditer("<a href='(.+?)'>(.+?)</a>", html)
episodes = []
for match in match_iter:
m_url, title = match.groups()
episodes.append(Episode(title, urllib.parse.urljoin(url, m_url)))
return episodes

def get_images(html, url):
"""Get the URL of all images. Return list, iterator, or string.

The list and iterator may generate URL string or a callback function to get URL string.
"""

match_iter = re.finditer("<img src='(.+?)'>", html)
return [match.group(1) for match in match_iter]

def get_next_page(html, url):
"""Return the url of the next page."""
match = re.search("<a id='nextpage' href='(.+?)'>next</a>", html)
if match:
return match.group(1)

def errorhandler(error, episode):
"""Downloader will call errorhandler if there is an error happened when
downloading image. Normally you can just ignore this function.
"""
pass

Changelog
---------

- 2016.4.8

- Fix get_next_page error.
- Fix key error in CLI.

- 2016.4.4

- Use new API!
- Analyzer will check the last episode to decide whether to analyze all pages.
- Support multiple images in one page.
- Change how getimgurl and getimgurls work.

- 2016.4.2

- Add tumblr module.
- Enhance: support sub-domain in ``mods.get_module``.

- 2016.3.27

- Fix: handle deleted post (konachan).
- Fix: enhance dialog. try to fix `#8 <https://github.com/eight04/ComicCrawler/issues/8>`__.

- 2016.2.29

- Fix: use latest comicview.js (8comic).

- 2016.2.27

- Fix: lastcheckupdate doesn't work.
- Add: comicbus domain (8comic).

- 2016.2.15.1

- Fix: can not add mission.

- 2016.2.15

- Add `lastcheckupdate` setting. Now the library will only automatically check updates once a day.
- Refactor. Use MissionProxy, Mission doesn't inherit UserWorker anymore.

- 2016.1.26

- Change: checking updates won't affect mission which is downloading.
- Fix: page won't skip if the savepath contains "~".
- Add: a new url pattern in facebook.

- 2016.1.17

- Fix: an url matching issue in Facebook.
- Enhance: downloader will loop through other episodes rather than stop current mission on crawlpage error.

- 2016.1.15

- Fix: ComicCrawler doesn't save session during downloading.

- 2016.1.13

- Handle HTTPError 429.

- 2016.1.12

- Add facebook module.
- Add ``circular`` option in module. Which should be set to ``True` if downloader doesn't know which is the last page of the album. (e.g. Facebook)

- 2016.1.3

- Fix downloading failed in seemh.

- 2015.12.9

- Fix build-time dependencies.

- 2015.11.8

- Fix next page issue in danbooru.

- 2015.10.25

- Support nico seiga.
- Try to fix MemoryError when writing files.

- 2015.10.9

- Fix unicode range error in gui. See http://is.gd/F6JfjD

- 2015.10.8

- Fix an error that unable to skip episode in pixiv module.

- 2015.10.7

- Fix errors that unable to create folder if title contains "{}"
characters.

- 2015.10.6

- Support search page in pixiv module.

- 2015.9.29

- Support http://www.chuixue.com.

- 2015.8.7

- Fixed sfacg bug.

- 2015.7.31

- Fixed: libraryautocheck option does not work.

- 2015.7.23

- Add module dmzj\_m. Some expunged manga may be accessed from
mobile page.
``http://manhua.dmzj.com/name => http://m.dmzj.com/info/name.html``

- 2015.7.22

- Fix bug in module eight.

- 2015.7.17

- Fix episode selecting bug.

- 2015.7.16

- Added:

- Cleanup unused missions after session loads.
- Handle ajax episode list in seemh.
- Show an error if no update to download when clicking "download
updates".
- Show an error if failing to load session.

- Changed:

- Always use "UPDATE" state if the mission is not complete after
re-analyzing.
- Create backup if failing to load session instead of moving them
to "invalid-save" folder.
- Check edit flag in MissionManager.save().

- Fixed:

- Can not download "updated" mission.
- Update checking will stop on error.
- Sankaku module is still using old method to create Episode.

- 2015.7.15

- Add module seemh.

- 2015.7.14

- Refactor: pull out download\_manager, mission\_manager.
- Enhance content\_write: use os.replace.
- Fix mission\_manager save loop interval.

- 2015.7.7

- Fix danbooru bug.
- Fix dmzj bug.

- 2015.7.6

- Fix getepisodes regex in exh.

- 2015.7.5

- Add error handler to dm5.
- Add error handler to acgn.

- 2015.7.4

- Support imgbox.

- 2015.6.22

- Support tsundora.

- 2015.6.18

- Fix url quoting issue.

- 2015.6.14

- Enhance ``safeprint``. Use ``echo`` command.
- Enhance ``content_write``. Add ``append=False`` option.
- Enhance ``Crawler``. Cache imgurl.
- Enhance ``grabber``. Add ``cookie=None`` option. Change errorlog
behavior.
- Fix ``grabber`` unicode encoding issue.
- Some module update.

- 2015.6.13

- Fix ``clean_finished``
- Fix ``console_download``
- Enhance ``get_by_state``

Author
------

- eight eight04@gmail.com

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

Development Status
- 5 - Production/Stable
Environment
- Console
- Win32 (MS Windows)
Intended Audience
- End Users/Desktop
License
- OSI Approved :: MIT License
Natural Language
- Chinese (Traditional)
Operating System
- Microsoft :: Windows :: Windows 7
Programming Language
- Python :: 3.4
Topic
- Internet

Release history Release notifications | RSS feed

2024.4.11

Apr 12, 2024

2024.4.10

Apr 10, 2024

2024.4.2

Apr 2, 2024

2024.3.25

Mar 25, 2024

2024.1.4

Jan 4, 2024

2023.12.24

Dec 24, 2023

2023.12.11

Dec 10, 2023

2023.10.11

Oct 11, 2023

2023.10.9

Oct 8, 2023

2022.11.21

Nov 21, 2022

2022.11.11

Nov 11, 2022

2022.2.6

Feb 6, 2022

2022.2.3

Feb 3, 2022

2021.12.2

Dec 1, 2021

2021.11.15

Nov 15, 2021

2021.9.15

Sep 15, 2021

2021.8.31

Aug 31, 2021

2020.10.29

Oct 29, 2020

2020.9.2

Sep 2, 2020

2020.6.3

Jun 3, 2020

2019.12.25

Dec 25, 2019

2019.11.19

Nov 19, 2019

2019.11.12

Nov 12, 2019

2019.10.28

Oct 28, 2019

2019.10.19

Oct 19, 2019

2019.9.2

Sep 2, 2019

2019.8.19

Aug 19, 2019

2019.7.1

Jul 1, 2019

2019.5.20

May 20, 2019

2019.5.3

May 3, 2019

2019.3.27

Mar 27, 2019

2019.3.26

Mar 26, 2019

2019.3.18

Mar 18, 2019

2019.3.13

Mar 13, 2019

2018.12.25

Dec 25, 2018

2018.11.18

Nov 18, 2018

2018.10.24

Oct 24, 2018

2018.9.30

Sep 30, 2018

2018.9.24

Sep 24, 2018

2018.9.23

Sep 23, 2018

2018.9.11

Sep 11, 2018

2018.9.7

Sep 7, 2018

2018.8.20

Aug 19, 2018

2018.8.11

Aug 10, 2018

2018.8.10

Aug 10, 2018

2018.7.18

Jul 18, 2018

2018.6.21

Jun 21, 2018

2018.6.14

Jun 14, 2018

2018.6.8

Jun 8, 2018

2018.5.24

May 24, 2018

2018.5.13

May 12, 2018

2018.5.5

May 4, 2018

2018.4.16

Apr 15, 2018

2018.4.8

Apr 8, 2018

2018.3.18

Mar 17, 2018

2018.3.15

Mar 14, 2018

2018.3.9

Mar 9, 2018

2018.3.7

Mar 6, 2018

2018.1.30.2

Jan 30, 2018

2018.1.30.1

Jan 29, 2018

2018.1.30

Jan 29, 2018

2017.12.15

Dec 15, 2017

2017.12.14

Dec 13, 2017

2017.12.9

Dec 9, 2017

2017.12.4

Dec 4, 2017

2017.11.29

Nov 29, 2017

2017.9.9

Sep 9, 2017

2017.9.5

Sep 5, 2017

2017.8.31

Aug 30, 2017

2017.8.26

Aug 26, 2017

2017.8.20.1

Aug 20, 2017

2017.8.20

Aug 19, 2017

2017.8.16

Aug 15, 2017

2017.8.13

Aug 12, 2017

2017.6.14

Jun 13, 2017

2017.5.29

May 28, 2017

2017.5.26

May 26, 2017

2017.5.22

May 22, 2017

2017.5.19

May 19, 2017

2017.5.5

May 4, 2017

2017.4.26

Apr 26, 2017

2017.4.24

Apr 23, 2017

2017.4.23

Apr 23, 2017

2017.4.22

Apr 22, 2017

2017.4.18

Apr 18, 2017

2017.4.6

Apr 6, 2017

2017.4.3

Apr 3, 2017

2017.3.26

Mar 26, 2017

2017.3.25

Mar 24, 2017

2017.3.9

Mar 9, 2017

2017.3.6

Mar 6, 2017

2017.2.5

Feb 5, 2017

2017.1.10

Jan 9, 2017

2017.1.6

Jan 5, 2017

2017.1.3.1

Jan 3, 2017

2017.1.3

Jan 3, 2017

2016.12.20

Dec 20, 2016

2016.12.6

Dec 6, 2016

2016.12.1

Dec 1, 2016

2016.11.27

Nov 26, 2016

2016.11.25

Nov 25, 2016

2016.11.2

Nov 2, 2016

2016.10.8

Oct 8, 2016

2016.10.4

Oct 4, 2016

2016.9.30

Sep 30, 2016

2016.9.27

Sep 26, 2016

2016.9.11

Sep 11, 2016

2016.8.24.1

Aug 24, 2016

2016.8.22

Aug 21, 2016

2016.8.19

Aug 18, 2016

2016.8.8

Aug 8, 2016

2016.7.2

Jul 1, 2016

2016.7.1

Jul 1, 2016

2016.6.30

Jun 30, 2016

2016.6.25

Jun 24, 2016

2016.6.14.1

Jun 14, 2016

2016.6.14

Jun 14, 2016

2016.6.12

Jun 12, 2016

2016.6.10

Jun 9, 2016

2016.6.4

Jun 4, 2016

2016.6.3

Jun 3, 2016

2016.5.30

May 29, 2016

2016.5.28

May 28, 2016

2016.5.24

May 24, 2016

2016.5.20

May 19, 2016

2016.5.15

May 15, 2016

2016.5.2

May 2, 2016

2016.5.1.1

May 1, 2016

2016.5.1

May 1, 2016

2016.4.27

Apr 27, 2016

2016.4.26.1

Apr 26, 2016

2016.4.26

Apr 26, 2016

2016.4.22.3

Apr 22, 2016

2016.4.22.2

Apr 22, 2016

2016.4.22.1

Apr 22, 2016

2016.4.22

Apr 22, 2016

2016.4.21

Apr 21, 2016

2016.4.20

Apr 20, 2016

2016.4.13

Apr 12, 2016

This version

2016.4.8

Apr 8, 2016

2016.4.4

Apr 4, 2016

2016.4.2

Apr 2, 2016

2016.3.27

Mar 27, 2016

2016.2.29

Feb 29, 2016

2016.2.27

Feb 26, 2016

2016.2.15.1

Feb 15, 2016

2016.2.15

Feb 15, 2016

2016.1.26

Jan 25, 2016

2016.1.17

Jan 17, 2016

2016.1.15

Jan 15, 2016

2016.1.13

Jan 13, 2016

2016.1.12

Jan 12, 2016

2016.1.3

Jan 2, 2016

2015.12.9

Dec 9, 2015

2015.11.8

Nov 7, 2015

2015.10.25

Oct 25, 2015

2015.10.9

Oct 9, 2015

2015.10.8

Oct 8, 2015

2015.10.7

Oct 7, 2015

2015.10.6

Oct 6, 2015

2015.9.29

Sep 29, 2015

2015.8.7

Aug 6, 2015

2015.7.31

Jul 30, 2015

2015.7.23

Jul 23, 2015

2015.7.22

Jul 22, 2015

2015.7.17

Jul 16, 2015

2015.7.16

Jul 16, 2015

2015.7.15

Jul 15, 2015

2015.7.14

Jul 13, 2015

2015.7.7

Jul 7, 2015

2015.7.6

Jul 6, 2015

2015.7.5

Jul 5, 2015

2015.7.4

Jul 3, 2015

2015.6.22

Jun 21, 2015

2015.6.18

Jun 17, 2015

2015.6.14.1

Jun 14, 2015

2015.6.14

Jun 14, 2015

2015.6.13

Jun 13, 2015

2015.6.11

Jun 10, 2015

2015.6.9

Jun 8, 2015

2015.6.8

Jun 7, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comiccrawler-2016.4.8.zip (58.8 kB view hashes)

Uploaded Apr 8, 2016 Source

Built Distribution

comiccrawler-2016.4.8-py3-none-any.whl (52.7 kB view hashes)

Uploaded Apr 8, 2016 Python 3

Hashes for comiccrawler-2016.4.8.zip

Hashes for comiccrawler-2016.4.8.zip
Algorithm	Hash digest
SHA256	`051595c9d50f4a87d44cb804aff0785a88654cf0f2104535ced560ff4ee67184`
MD5	`d4ce4b46c5c2b877d949a04b8c2564bb`
BLAKE2b-256	`05ddb49eaf1123a6d9d5062343b0f33371245c81095f0730f70b5ce8ea27f4b7`

Hashes for comiccrawler-2016.4.8-py3-none-any.whl

Hashes for comiccrawler-2016.4.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`504822a556f92d463cf3cedc010a05b27f4b6f489b93f33425fbedf9d6135a3a`
MD5	`3ae31e53dd62dfff2f98e0e577ca3697`
BLAKE2b-256	`b0d0a6651b8eeb3abad8415e40dfae6257432c776a49cb1ce914140db19329e6`