Skip to main content

A Simple Distributed Web Crawle

Project description

simplified-scrapy

simplified scrapy, A Simple Web Crawle

Requirements

  • Python 2.7, 3.0+
  • Works on Linux, Windows, Mac OSX, BSD

运行

进入项目根目录,执行下面命令
python start.py

Demo

项目中爬虫例子,在文件夹spiders下,文件名为demoSpider.py。自定义的爬虫类需要继承Spider类

from core.spider import Spider 
class DemoSpider(Spider):

需要给爬虫定义一个名字,配置入口链接地址,与抽取数据用到的模型名称。下面是采集文章型数据的一个例子。其中auto_main表示抽取相同域名的链接,auto_obj表示自动抽取页面中的文章数据,包括标题、正文和时间。也可以重写抽取方法,实现自定义数据抽取。

name = 'demo-spider'
start_urls = ['http://www.scrapyd.cn/']
models = ['auto_main','auto_obj']

其中模型文件在文件夹models下,如果需要自定义模型,可以使用这个模型工具,下载地址。使用说明在这里

pip安装

pip install simplified-scrapy

例子

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for simplified-scrapy, version 0.2.35
Filename, size File type Python version Upload date Hashes
Filename, size simplified_scrapy-0.2.35-py2.py3-none-any.whl (32.0 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page