A Simple Distributed Web Crawle
Project description
simplified-scrapy
simplified scrapy, A Simple Web Crawle
Requirements
- Python 2.7, 3.0+
- Works on Linux, Windows, Mac OSX, BSD
运行
进入项目根目录,执行下面命令
python start.py
Demo
项目中爬虫例子,在文件夹spiders下,文件名为demoSpider.py。自定义的爬虫类需要继承Spider类
from core.spider import Spider
class DemoSpider(Spider):
需要给爬虫定义一个名字,配置入口链接地址,与抽取数据用到的模型名称。下面是采集新浪健康资讯数据的一个例子。其中auto_main_2表示抽取相同2级域名的链接,auto_obj表示自动抽取页面中的资讯数据,包括标题、正文和时间。
name = 'demo-spider'
start_urls = ['http://health.sina.com.cn/']
models = ['auto_main_2','auto_obj']
其中模型文件在文件夹models下,如果需要自定义模型,可以使用这个模型工具,下载地址。使用说明在这里
pip安装
pip install simplified-scrapy
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size simplified_scrapy-0.1.22-py2.py3-none-any.whl (26.7 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes View hashes |
Close
Hashes for simplified_scrapy-0.1.22-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cea77ebec65764817a079f9a1576a7645bf9f37380229b9efcb3279fcf1a98d1 |
|
MD5 | 19bad1dc1e9df8114ea8d37b9bc4335c |
|
BLAKE2-256 | 1f7601bc37cf93db90be7175c88db8f3f2a4f0211aac20b566eeb60b94b75a00 |