a spider admin based scrapyd api and APScheduler
Project description
SpiderAdmin
功能介绍
-
对Scrapyd 接口进行可视化封装,对Scrapy爬虫项目进行删除 和 查看
-
并没有实现修改,添加功能, 部署推荐使用
$ scrapyd-deploy -a
- 对爬虫设置定时任务,支持apscheduler 的3中方式和随机延时,共计4中方式
- 单次运行 date
- 周期运行 corn
- 间隔运行 interval
- 随机运行 random
- 基于Flask-BasicAuth 做了简单的权限校验
启动运行
$ pip3 install spideradmin
$ spideradmin init # 初始化,可选配置,也可以使用默认配置
$ spideradmin # 启动服务
页面截图
TODO
增加登录页面做权限校验增加定时设置的多样性增加定时随机运行
部署Scrapyd注意版本问题
- Scrapyd==1.2.0
- Scrapy==1.6.0
- Twisted==18.9.0
启用执行结果扩展
安装依赖
pip install PureMySQL
新建数据表
CREATE TABLE `log_spider` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`spider_name` varchar(50) DEFAULT NULL,
`item_count` int(11) DEFAULT NULL,
`duration` int(11) DEFAULT NULL,
`log_error` int(11) DEFAULT NULL,
`create_time` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='爬虫执行结果统计'
1、在scrapy项目中添加数据收集扩展文件
item_count_extension.py
# -*- coding: utf-8 -*-
import logging
from scrapy import signals
from puremysql import PureMysql
from datetime import datetime
class SpiderItemCountExtension(object):
# 设置为数据库链接url eg: mysql://root:123456@127.0.0.1:3306/mydata
ITEM_LOG_DATABASE_URL = None
# 设置数据表
ITEM_LOG_TABLE = "log_spider"
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider, reason):
stats = spider.crawler.stats.get_stats()
scraped_count = stats.get("item_scraped_count", 0)
dropped_count = stats.get("item_dropped_count", 0)
log_error = stats.get("log_count/ERROR", 0)
start_time = stats.get("start_time")
finish_time = stats.get("finish_time")
duration = (finish_time - start_time).seconds
count = scraped_count + dropped_count
logging.debug("*" * 50)
logging.debug("* {}".format(spider.name))
logging.debug("* item count: {}".format(count))
logging.debug("*" * 50)
item = {
"spider_name": spider.name,
"item_count": count,
"duration": duration,
"log_error": log_error,
"create_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
}
mysql = PureMysql(db_url=self.ITEM_LOG_DATABASE_URL)
table = mysql.table(self.ITEM_LOG_TABLE)
table.insert(item)
mysql.close()
2、scrapy项目启用扩展 settings.py
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
"item_count_extension.SpiderItemCountExtension": 100
}
3、配置相同的db_url default_config.py
ITEM_LOG_DATABASE_URL = "mysql://root:123456@127.0.0.1:3306/mydata"
ITEM_LOG_TABLE = "log_spider"
更新日志
版本 | 日期 | 描述 |
---|---|---|
0.0.17 | 2019-07-02 | 优化文件,优化随机调度,增加调度历史统计和可视化 |
0.0.20 | 2019-10-08 | 增加执行结果统计 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
SpiderAdmin-0.0.27.tar.gz
(1.7 MB
view details)
Built Distribution
SpiderAdmin-0.0.27-py3-none-any.whl
(660.6 kB
view details)
File details
Details for the file SpiderAdmin-0.0.27.tar.gz
.
File metadata
- Download URL: SpiderAdmin-0.0.27.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 037db4e19594ee5c46c34792b72dcde2535fd266e92120dbc938c7f68a925c03 |
|
MD5 | 2124ec1b38abf66c4810d46350be184d |
|
BLAKE2b-256 | 8537a3dcd6bc2bb13c01d0d5adac2d9bf309e39cfbf0efa81c696011aaa7cca2 |
File details
Details for the file SpiderAdmin-0.0.27-py3-none-any.whl
.
File metadata
- Download URL: SpiderAdmin-0.0.27-py3-none-any.whl
- Upload date:
- Size: 660.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 987f72b1c88ffaef2b78ccface686c0d8d49dad946b9b9423d707c0e8204e89f |
|
MD5 | bdfa1f6369d503dbfee3447db0647752 |
|
BLAKE2b-256 | 5c855db6b30b3ce77eeea3015ffb5462e17be3a170e3318a758aa62d928757a0 |