Skip to main content

lightsmile's personal spider for crawling data

Project description

lightSpider

lightsmile个人的用于爬取网络公开语料数据的mini通用爬虫框架。

声明

  1. 本项目仅是本人简单尝试,系统功能并不完善。
  2. 已有的特性:
    • 启用代理池,降低ip被封带来的损失风险
    • 启用多进程,加快进程爬取速度
    • 实现断点重爬,即使程序因内外在原因挂掉了,可以继续执行任务脚本继续爬取
    • 启用进度条,可实时显示当前爬取总进度与爬取速度,用户体验较好
  3. 本项目并没有:
    • 验证码登录功能
    • 各种自定义功能
    • 等等。

安装

pip install lightSpider

建议使用国内源来安装,如使用以下命令:

pip install -i https://pypi.douban.com/simple/ lightSpider

使用流程

step1:启动代理池服务

详情参见Python3WebSpider/ProxyPool: Proxy Pool

step2: 引入必要依赖库

from lightspider import Spider
from lightspider import light

step3:编写页面解析函数

如:

# 编写页面解析函数
@light
def handler(html):
    html = etree.HTML(html)
    info = html.xpath('//div[@class="col-md-8"]')[0]
    words = [re.sub(r'\(\d+\)', '', item.xpath('string(.)')) for item in info.xpath('./b')[:-1]]
    mean = info.xpath('./a/text()')[0]
    return {
        'mean': mean,
        'words': words
    }, None

step4:编写得到tasks脚本

如:

tasks = []
for i in range(1, 30):
    tasks.append(i)

step5: 创建Spider对象

如:

base_url = 'https://www.cilin.org/jyc/b_{}.html'
spider = Spider(base_url=base_url, style='json', save_path=r'D:\Data\NLP\corpus\test')

step6: 执行Spider对象的run方法

如:

if __name__ == '__main__':
    spider.run(tasks, handler)

注意if-main句式不能省略!

完整示例

from lightspider import Spider
from lightspider import light

from lxml import etree
import re

# 编写页面解析函数
@light
def handler(html):
    html = etree.HTML(html)
    info = html.xpath('//div[@class="col-md-8"]')[0]
    words = [re.sub(r'\(\d+\)', '', item.xpath('string(.)')) for item in info.xpath('./b')[:-1]]
    mean = info.xpath('./a/text()')[0]
    return {
        'mean': mean,
        'words': words
    }, None


tasks = []
for i in range(1, 30):
    tasks.append(i)

base_url = 'https://www.cilin.org/jyc/b_{}.html'
spider = Spider(base_url=base_url, style='json', save_path=r'D:\Data\NLP\corpus\test')


if __name__ == '__main__':
    spider.run(tasks, handler)

执行结果

如图: Demo

断点重爬效果如图: 断点重爬1 断点重爬2

参考

  1. Python3WebSpider/ProxyPool: Proxy Pool
  2. 如何在python中通过多进程使用tqdm? - VoidCC
  3. 编写多进程爬虫 - 知乎

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lightSpider-0.1.1.tar.gz (6.2 kB view hashes)

Uploaded Source

Built Distribution

lightSpider-0.1.1-py3-none-any.whl (12.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page