a spider admin based scrapyd api and APScheduler
Project description
SpiderAdmin
功能介绍
-
对Scrapyd 接口进行可视化封装,对Scrapy爬虫项目进行删除 和 查看
-
并没有实现修改,添加功能, 部署推荐使用
$ scrapyd-deploy -a
- 对爬虫设置定时任务,支持apscheduler 的3中方式和随机延时,共计4中方式
- 单次运行 date
- 周期运行 corn
- 间隔运行 interval
- 随机运行 random
- 基于Flask-BasicAuth 做了简单的权限校验
启动运行
$ pip3 install spideradmin
$ spideradmin init # 初始化,可选配置,也可以使用默认配置
$ spideradmin # 启动服务
页面截图
TODO
增加登录页面做权限校验增加定时设置的多样性增加定时随机运行
部署Scrapyd注意版本问题
- Scrapyd==1.2.0
- Scrapy==1.6.0
- Twisted==18.9.0
启用执行结果扩展
安装依赖
pip install PureMySQL
1、在scrapy项目中添加数据收集扩展文件
item_count_extension.py
# -*- coding: utf-8 -*-
import logging
from scrapy import signals
from puremysql import PureMysql
from datetime import datetime
class SpiderItemCountExtension(object):
# 设置为数据库链接url eg: mysql://root:123456@127.0.0.1:3306/mydata
ITEM_LOG_DATABASE_URL = None
# 设置数据表
ITEM_LOG_TABLE = "log_spider"
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, spider, reason):
stats = spider.crawler.stats.get_stats()
scraped_count = stats.get("item_scraped_count", 0)
dropped_count = stats.get("item_dropped_count", 0)
log_error = stats.get("log_count/ERROR", 0)
start_time = stats.get("start_time")
finish_time = stats.get("finish_time")
duration = (finish_time - start_time).seconds
count = scraped_count + dropped_count
logging.debug("*" * 50)
logging.debug("* {}".format(spider.name))
logging.debug("* item count: {}".format(count))
logging.debug("*" * 50)
item = {
"spider_name": spider.name,
"item_count": count,
"duration": duration,
"log_error": log_error,
"create_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
}
mysql = PureMysql(db_url=self.ITEM_LOG_DATABASE_URL)
table = mysql.table(self.ITEM_LOG_TABLE)
table.insert(item)
mysql.close()
2、scrapy项目启用扩展 settings.py
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
"item_count_extension.SpiderItemCountExtension": 100
}
3、配置相同的db_url default_config.py
ITEM_LOG_DATABASE_URL = "mysql://root:123456@127.0.0.1:3306/mydata"
ITEM_LOG_TABLE = "log_spider"
更新日志
版本 | 日期 | 描述 |
---|---|---|
0.0.17 | 2019-07-02 | 优化文件,优化随机调度,增加调度历史统计和可视化 |
0.0.20 | 2019-10-08 | 增加执行结果统计 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
SpiderAdmin-0.0.20.tar.gz
(1.7 MB
view details)
Built Distribution
SpiderAdmin-0.0.20-py3-none-any.whl
(660.0 kB
view details)
File details
Details for the file SpiderAdmin-0.0.20.tar.gz
.
File metadata
- Download URL: SpiderAdmin-0.0.20.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f35820b3bb1ea00a4d3a4d272b178fd8fdbf0dd64d6ed705b4887250490f66ee |
|
MD5 | a6d54023bb0c94c8d9e0f246483f8be4 |
|
BLAKE2b-256 | 633ee2023ac88f06feb41af697859e0a92825e22818e95a32ae87b347f359560 |
File details
Details for the file SpiderAdmin-0.0.20-py3-none-any.whl
.
File metadata
- Download URL: SpiderAdmin-0.0.20-py3-none-any.whl
- Upload date:
- Size: 660.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e4c376cd6ac43116c0d6397dfe2166d85cd7945d28f36db91853b3de81a3d84 |
|
MD5 | 286a4902e2809a50dc8c3da7087bfef9 |
|
BLAKE2b-256 | 67858479ab4b383486f2bbb1bce11b29285bdd8eb475d7e85e9da982310a7f82 |