A Simple Distributed Web Crawle
Project description
simplified-scrapy
simplified scrapy, A Simple Web Crawle
Requirements
- Python 2.7, 3.0+
- Works on Linux, Windows, Mac OSX, BSD
运行
进入项目根目录,执行下面命令
python start.py
Demo
项目中爬虫例子,在文件夹spiders下,文件名为demoSpider.py。自定义的爬虫类需要继承Spider类
from core.spider import Spider
class DemoSpider(Spider):
需要给爬虫定义一个名字,配置入口链接地址,与抽取数据用到的模型名称。下面是采集新浪健康资讯数据的一个例子。其中auto_main_2表示抽取相同2级域名的链接,auto_obj表示自动抽取页面中的资讯数据,包括标题、正文和时间。
name = 'demo-spider'
start_urls = ['http://health.sina.com.cn/']
models = ['auto_main_2','auto_obj']
其中模型文件在文件夹models下,如果需要自定义模型,可以使用这个模型工具,下载地址。使用说明在这里
pip安装
pip install simplified-scrapy
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size simplified_scrapy-0.0.12-py2.py3-none-any.whl (25.3 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View hashes |
Close
Hashes for simplified_scrapy-0.0.12-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a01aed9e6241eb6000e2e32f2ccd8ec545e9c003cd3695dfb7306a9afc528c89 |
|
MD5 | 724aacb2630600ef56501b64587eba34 |
|
BLAKE2-256 | 68408e52dfd77f8311e1bf3d33f0e504d89361127a1c940518409c77e6d9f197 |