lightsmile's personal spider for crawling data
Project description
lightSpider
lightsmile个人的用于爬取网络公开语料数据的mini通用爬虫框架。
声明
- 本项目仅是本人简单尝试,系统功能并不完善。
- 已有的特性:
- 启用代理池,降低ip被封带来的损失风险
- 启用多进程,加快进程爬取速度
- 实现断点重爬,即使程序因内外在原因挂掉了,可以继续执行任务脚本继续爬取
- 启用进度条,可实时显示当前爬取总进度与爬取速度,用户体验较好
- 本项目并没有:
- 验证码登录功能
- 各种自定义功能
- 等等。
安装
pip install lightSpider
建议使用国内源来安装,如使用以下命令:
pip install -i https://pypi.douban.com/simple/ lightSpider
使用流程
step1:启动代理池服务
详情参见Python3WebSpider/ProxyPool: Proxy Pool
step2: 引入必要依赖库
from lightspider import Spider
from lightspider import light
step3:编写页面解析函数
如:
# 编写页面解析函数
@light
def handler(html):
html = etree.HTML(html)
info = html.xpath('//div[@class="col-md-8"]')[0]
words = [re.sub(r'\(\d+\)', '', item.xpath('string(.)')) for item in info.xpath('./b')[:-1]]
mean = info.xpath('./a/text()')[0]
return {
'mean': mean,
'words': words
}, None
step4:编写得到tasks脚本
如:
tasks = []
for i in range(1, 30):
tasks.append(i)
step5: 创建Spider
对象
如:
base_url = 'https://www.cilin.org/jyc/b_{}.html'
spider = Spider(base_url=base_url, style='json', save_path=r'D:\Data\NLP\corpus\test')
step6: 执行Spider
对象的run
方法
如:
if __name__ == '__main__':
spider.run(tasks, handler)
注意:if-main
句式不能省略!
完整示例
from lightspider import Spider
from lightspider import light
from lxml import etree
import re
# 编写页面解析函数
@light
def handler(html):
html = etree.HTML(html)
info = html.xpath('//div[@class="col-md-8"]')[0]
words = [re.sub(r'\(\d+\)', '', item.xpath('string(.)')) for item in info.xpath('./b')[:-1]]
mean = info.xpath('./a/text()')[0]
return {
'mean': mean,
'words': words
}, None
tasks = []
for i in range(1, 30):
tasks.append(i)
base_url = 'https://www.cilin.org/jyc/b_{}.html'
spider = Spider(base_url=base_url, style='json', save_path=r'D:\Data\NLP\corpus\test')
if __name__ == '__main__':
spider.run(tasks, handler)
执行结果
如图:
断点重爬效果如图:
参考
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lightSpider-0.1.1.tar.gz
(6.2 kB
view hashes)
Built Distribution
Close
Hashes for lightSpider-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2e564d849bb069c018520b295e99bc36a38f2446003aebc55631072fad07937 |
|
MD5 | 54c8bde167e40d828bcfafdd4b068aac |
|
BLAKE2b-256 | 694410c8693b09c275aeae803c218d4f02952b42668f74e42c377ff35aee43ee |