wecatch webcrawl
Project description
# webcrawl
[![Build Status](https://api.travis-ci.org/listen-lavender/webcrawl.svg?branch=master)](https://api.travis-ci.org/listen-lavender/webcrawl)
webcrawl是对抓取常用工具的封装,包括requests,lxml,phantomjs,并且实现了workflow,使coder在遵守规范的基础上更专注抓取业务,方便快速实现稳定的工程;还有一些其他会用到的工具的封装,例如rsa.py是http://www.ohdave.com/rsa 的Python版本,这个很多网站有用到;atlas.py设计到一些地图坐标的处理。
## http请求增强
handleRequest.py是对requests模块抓取常用的http方法以及lxml解析的封装,以及phantomsjs代理的支持,还有一些通用内容的处理
> - html
> - xml
> - json
> - text
> - response object
## task的简单控制
task.py(work.py)是任务流workflow的实现,是数据驱动异步执行的,类似于celery的chain,group,chord等的复合类型,但是比celery的这方面更强大更好用,并且控制着抓取代码的编写规范,依赖于pjq队列
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout
## queue支持
pjq.py是priority join queue,为了支持任务流的实现,其中mongo queue比较强大,支持task的增查改,就是在执行过程中subtask是可控的。
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout
## mongo queue
```
|-------put ---------- get insert insert
| / \ | |
| WAIT---[ready]--- RUNNING --------COMPLETED |
| | |
| | |
RETRY----------------------|----------------------ERROR
| |
| |
|__________________________________________________|
WAIT : 2
RUNNING : 3
RETRY : 4
ABANDONED: 5
COMPLETED: 1
ERROR : 0
ready - 10
```
# Getting started
No example now.
## Installation
To install webcrawl, simply:
````bash
$ pip install webcrawl
✨🍰✨
````
## Discussion and support
Report bugs on the *GitHub issue tracker <https://github.com/listen-lavender/webcrawl/issues*.
[![Build Status](https://api.travis-ci.org/listen-lavender/webcrawl.svg?branch=master)](https://api.travis-ci.org/listen-lavender/webcrawl)
webcrawl是对抓取常用工具的封装,包括requests,lxml,phantomjs,并且实现了workflow,使coder在遵守规范的基础上更专注抓取业务,方便快速实现稳定的工程;还有一些其他会用到的工具的封装,例如rsa.py是http://www.ohdave.com/rsa 的Python版本,这个很多网站有用到;atlas.py设计到一些地图坐标的处理。
## http请求增强
handleRequest.py是对requests模块抓取常用的http方法以及lxml解析的封装,以及phantomsjs代理的支持,还有一些通用内容的处理
> - html
> - xml
> - json
> - text
> - response object
## task的简单控制
task.py(work.py)是任务流workflow的实现,是数据驱动异步执行的,类似于celery的chain,group,chord等的复合类型,但是比celery的这方面更强大更好用,并且控制着抓取代码的编写规范,依赖于pjq队列
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout
## queue支持
pjq.py是priority join queue,为了支持任务流的实现,其中mongo queue比较强大,支持task的增查改,就是在执行过程中subtask是可控的。
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout
## mongo queue
```
|-------put ---------- get insert insert
| / \ | |
| WAIT---[ready]--- RUNNING --------COMPLETED |
| | |
| | |
RETRY----------------------|----------------------ERROR
| |
| |
|__________________________________________________|
WAIT : 2
RUNNING : 3
RETRY : 4
ABANDONED: 5
COMPLETED: 1
ERROR : 0
ready - 10
```
# Getting started
No example now.
## Installation
To install webcrawl, simply:
````bash
$ pip install webcrawl
✨🍰✨
````
## Discussion and support
Report bugs on the *GitHub issue tracker <https://github.com/listen-lavender/webcrawl/issues*.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
webcrawl-1.1.2.tar.gz
(43.9 kB
view details)
File details
Details for the file webcrawl-1.1.2.tar.gz
.
File metadata
- Download URL: webcrawl-1.1.2.tar.gz
- Upload date:
- Size: 43.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5157ef669f5676b3fe2dd1d4a307e93233a64f6e516304313261716022786b0 |
|
MD5 | 7b024da809e99026526f50951b46ef4e |
|
BLAKE2b-256 | 8f7b2f67a0a588ee2f91c30107aafc8225eed4e2eeb7852dda9489b5f40b6e05 |