A high-level Web Crawling and Web Scraping framework
Project description
boris-spider
一款高可用易上手的python爬虫框架,支持分布式及批次采集
特性
- 自动下载与重试异常请求,返回的response支持xpath、css、re等解析方式。自动处理中文乱码;
- 支持分布式爬虫,批次爬虫。批次爬虫封装了周期性采集数据的逻辑,批次开始时自动下发任务,抓取数据,统计任务处理速度,预估批次是否会超时,超时报警等;
- 可随时终止、启动爬虫,任务不丢失不漏采;
- 支持注册多模板,即可将多个网站的解析模板注册到同一个爬虫内,由该爬虫统一管理(适用场景:如抓取100家新闻网站,只需启动一个爬虫即可)
- 上手简单,且又支持复杂的爬虫需求
安装
From PyPi:
pip3 install boris-spider
From Git:
pip3 install git+https://github.com/Boris-code/boris-spider.git
快速上手
支持的命令行:
> spider
Spider 0.0.4
Usage:
spider <command> [options] [args]
Available commands:
create create spider、parser、item and so on
shell debug response
Use "spider <command> -h" to see more info about a command
生产爬虫模板
spider create -p first_spider
模板如下:
import spider
class FirstSpider(spider.SingleSpider):
def start_requests(self, *args, **kws):
yield spider.Request("https://www.baidu.com")
def parser(self, request, response):
# print(response.text)
print(response.xpath('//input[@type="submit"]/@value').extract_first())
if __name__ == "__main__":
FirstSpider().start()
直接运行,打印如下:
Thread-2|2020-05-19 18:23:41,128|request.py|get_response|line:283|DEBUG|
-------------- FirstSpider.parser request for ----------------
url = https://www.baidu.com
method = GET
body = {'timeout': 22, 'stream': True, 'verify': False, 'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'}}
百度一下
Thread-2|2020-05-19 18:23:41,727|parser_control.py|run|line:415|INFO| parser 等待任务 ...
FirstSpider|2020-05-19 18:23:44,735|single_spider.py|run|line:83|DEBUG| 无任务,爬虫结束
了解更多
未完待续...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
boris-spider-0.0.4.tar.gz
(85.1 kB
view hashes)
Built Distribution
boris_spider-0.0.4-py3-none-any.whl
(103.5 kB
view hashes)
Close
Hashes for boris_spider-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d250942254bc92d91023f45a861f191d9ed14c434b1af089c70acca9bad25d9b |
|
MD5 | 7e9a3df94738aab7dcc867d4363831b6 |
|
BLAKE2b-256 | 6bb02b66c8488e60dbfdf9fcb44959693ef351cba09c8fc40fa7e833bf26e984 |