scrapy util
Project description
Scrapy util
基于scrapy 的一些扩展
pypi: https://pypi.org/project/scrapy-util
github: https://github.com/mouday/scrapy-util
pip install six scrapy-util
启用数据收集功能
此功能配合 spider-admin-pro 使用
# 设置收集运行日志的路径,会以post方式向 spider-admin-pro 提交json数据
# 注意:此处配置仅为示例,请设置为 spider-admin-pro 的真实路径
# 假设,我们的 spider-admin-pro 运行在http://127.0.0.1:5001
STATS_COLLECTION_URL = "http://127.0.0.1:5001/api/statsCollection/addItem"
# 启用数据收集扩展
EXTENSIONS = {
# ===========================================
# 可选:如果收集到的时间是utc时间,可以使用本地时间扩展收集
'scrapy.extensions.corestats.CoreStats': None,
'scrapy_util.extensions.LocaltimeCoreStats': 0,
# ===========================================
# 可选,打印程序运行时长
'scrapy_util.extensions.ShowDurationExtension': 100,
# 启用数据收集扩展
'scrapy_util.extensions.StatsCollectorExtension': 100
}
使用脚本Spider
仅做脚本执行,Request 不请求网络
# -*- coding: utf-8 -*-
from scrapy import cmdline
from scrapy_util.spiders import ScriptSpider
class BaiduScriptSpider(ScriptSpider):
name = 'baidu_script'
def execute(self):
print("hi")
if __name__ == '__main__':
cmdline.execute('scrapy crawl baidu_script'.split())
列表爬虫
ListNextRequestSpider基于 ListSpider 实现,如需自定义缓存,可以重写其中的方法
# -*- coding: utf-8 -*-
from scrapy import cmdline
from scrapy_util.spiders import ListNextRequestSpider
class BaiduListSpider(ListNextRequestSpider):
name = 'list_spider'
page_key = "list_spider"
# 必须实现的方法
def get_url(self, page):
return 'http://127.0.0.1:5000/list?page=' + str(page)
def parse(self, response):
print(response.text)
# 调用下一页,该方法会在start_requests 方法自动调用一次
# 如果不继续翻页,可以不调用
yield self.next_request(response)
if __name__ == '__main__':
cmdline.execute('scrapy crawl list_spider'.split())
MongoDB中间件
使用示例
settings.py
# 1、设置MongoDB 的数据库地址
MONGO_URI = "mongodb://localhost:27017/"
# 2、启用中间件MongoPipeline
ITEM_PIPELINES = {
'scrapy_util.pipelines.MongoPipeline': 100,
}
# -*- coding: utf-8 -*-
import scrapy
from scrapy import cmdline
from scrapy_util.items import MongoItem
class BaiduMongoSpider(scrapy.Spider):
name = 'baidu_mongo'
start_urls = ['http://baidu.com/']
# 1、设置数据库的表名
custom_settings = {
'MONGO_DATABASE': 'data',
'MONGO_TABLE': 'table'
}
def parse(self, response):
title = response.css('title::text').extract_first()
item = {
'data': {
'title': title
}
}
# 2、返回 MongoItem
return MongoItem(item)
if __name__ == '__main__':
cmdline.execute('scrapy crawl baidu_mongo'.split())
如果需要做微调,可以继承MongoPipeline
重写函数
工具方法
运行爬虫工具方法
import scrapy
from scrapy_util import spider_util
class BaiduSpider(scrapy.Spider):
name = 'baidu_spider'
if __name__ == '__main__':
# cmdline.execute('scrapy crawl baidu_spider'.split()
spider_util.run_spider(BaiduSpider)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy-util-1.0.3.tar.gz
(9.4 kB
view hashes)
Built Distribution
Close
Hashes for scrapy_util-1.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c38ac362a6cb588c066ee98b2c4d004e5d4977e1ea9130925f915936cd0c910b |
|
MD5 | c90eabc294f2b281c5ed92f367f18cf2 |
|
BLAKE2b-256 | a2ed9531e861680750e5b4acd6a5cdfaf9a903f3a416a581c48d4cea3ed607c1 |