Skip to main content

wecatch webcrawl

Project description

# webcrawl
[![Build Status](https://api.travis-ci.org/listen-lavender/webcrawl.svg?branch=master)](https://api.travis-ci.org/listen-lavender/webcrawl)

webcrawl是对抓取常用工具的封装,包括requests,lxml,phantomjs,并且实现了workflow,使coder在遵守规范的基础上更专注抓取业务,方便快速实现稳定的工程;还有一些其他会用到的工具的封装,例如rsa.py是http://www.ohdave.com/rsa 的Python版本,这个很多网站有用到;atlas.py设计到一些地图坐标的处理。

## http请求增强
handleRequest.py是对requests模块抓取常用的http方法以及lxml解析的封装,以及phantomsjs代理的支持,还有一些通用内容的处理
> - html
> - xml
> - json
> - text
> - response object

## task的简单控制
task.py(work.py)是任务流workflow的实现,是数据驱动异步执行的,类似于celery的chain,group,chord等的复合类型,但是比celery的这方面更强大更好用,并且控制着抓取代码的编写规范,依赖于pjq队列
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout

## queue支持
pjq.py是priority join queue,为了支持任务流的实现,其中mongo queue比较强大,支持task的增查改,就是在执行过程中subtask是可控的。
> - workflow
> - priority
> - selfloop
> - subtask timeout
> - task timeout

## mongo queue
```
|-------put ---------- get insert insert
| / \ | |
| WAIT---[ready]--- RUNNING --------COMPLETED |
| | |
| | |
RETRY----------------------|----------------------ERROR
| |
| |
|__________________________________________________|

WAIT : 2
RUNNING : 3
RETRY : 4
ABANDONED: 5
COMPLETED: 1
ERROR : 0
ready - 10
```

# Getting started

No example now.

## Installation

To install webcrawl, simply:

````bash

$ pip install webcrawl
✨🍰✨
````

## Discussion and support

Report bugs on the *GitHub issue tracker <https://github.com/listen-lavender/webcrawl/issues*.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcrawl-1.1.2.tar.gz (43.9 kB view details)

Uploaded Source

File details

Details for the file webcrawl-1.1.2.tar.gz.

File metadata

  • Download URL: webcrawl-1.1.2.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for webcrawl-1.1.2.tar.gz
Algorithm Hash digest
SHA256 b5157ef669f5676b3fe2dd1d4a307e93233a64f6e516304313261716022786b0
MD5 7b024da809e99026526f50951b46ef4e
BLAKE2b-256 8f7b2f67a0a588ee2f91c30107aafc8225eed4e2eeb7852dda9489b5f40b6e05

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page