Skip to main content

a spider admin based scrapyd api and APScheduler

Project description

SpiderAdmin

PyPI

功能介绍

  1. 对Scrapyd 接口进行可视化封装,对Scrapy爬虫项目进行删除 和 查看

  2. 并没有实现修改,添加功能, 部署推荐使用

$ scrapyd-deploy -a
  1. 对爬虫设置定时任务,支持apscheduler 的3中方式和随机延时,共计4中方式
  • 单次运行 date
  • 周期运行 corn
  • 间隔运行 interval
  • 随机运行 random
  1. 基于Flask-BasicAuth 做了简单的权限校验

启动运行

$ pip3 install spideradmin

$ spideradmin init  # 初始化,可选配置,也可以使用默认配置

$ spideradmin       # 启动服务

访问: http://127.0.0.1:5000/

页面截图

TODO

  1. 增加登录页面做权限校验
  2. 增加定时设置的多样性
  3. 增加定时随机运行

部署Scrapyd注意版本问题

  • Scrapyd==1.2.0
  • Scrapy==1.6.0
  • Twisted==18.9.0

启用执行结果扩展

安装依赖

pip install PureMySQL

1、在scrapy项目中添加数据收集扩展文件

item_count_extension.py

# -*- coding: utf-8 -*-

import logging

from scrapy import signals
from puremysql import PureMysql
from datetime import datetime


class SpiderItemCountExtension(object):

    # 设置为数据库链接url eg: mysql://root:123456@127.0.0.1:3306/mydata 
    ITEM_LOG_DATABASE_URL = None

    # 设置数据表
    ITEM_LOG_TABLE = "log_spider"

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, spider, reason):
        stats = spider.crawler.stats.get_stats()
        scraped_count = stats.get("item_scraped_count", 0)
        dropped_count = stats.get("item_dropped_count", 0)

        log_error = stats.get("log_count/ERROR", 0)
        start_time = stats.get("start_time")
        finish_time = stats.get("finish_time")
        duration = (finish_time - start_time).seconds

        count = scraped_count + dropped_count

        logging.debug("*" * 50)
        logging.debug("* {}".format(spider.name))
        logging.debug("* item count: {}".format(count))
        logging.debug("*" * 50)

        item = {
            "spider_name": spider.name,
            "item_count": count,
            "duration": duration,
            "log_error": log_error,
            "create_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        }

        mysql = PureMysql(db_url=self.ITEM_LOG_DATABASE_URL)
        table = mysql.table(self.ITEM_LOG_TABLE)
        table.insert(item)
        mysql.close()

2、scrapy项目启用扩展 settings.py

EXTENSIONS = {
    # 'scrapy.extensions.telnet.TelnetConsole': None,
    "item_count_extension.SpiderItemCountExtension": 100
}

3、配置相同的db_url default_config.py

ITEM_LOG_DATABASE_URL = "mysql://root:123456@127.0.0.1:3306/mydata"
ITEM_LOG_TABLE = "log_spider"

更新日志

版本 日期 描述
0.0.17 2019-07-02 优化文件,优化随机调度,增加调度历史统计和可视化
0.0.20 2019-10-08 增加执行结果统计

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SpiderAdmin-0.0.20.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

SpiderAdmin-0.0.20-py3-none-any.whl (660.0 kB view details)

Uploaded Python 3

File details

Details for the file SpiderAdmin-0.0.20.tar.gz.

File metadata

  • Download URL: SpiderAdmin-0.0.20.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0

File hashes

Hashes for SpiderAdmin-0.0.20.tar.gz
Algorithm Hash digest
SHA256 f35820b3bb1ea00a4d3a4d272b178fd8fdbf0dd64d6ed705b4887250490f66ee
MD5 a6d54023bb0c94c8d9e0f246483f8be4
BLAKE2b-256 633ee2023ac88f06feb41af697859e0a92825e22818e95a32ae87b347f359560

See more details on using hashes here.

File details

Details for the file SpiderAdmin-0.0.20-py3-none-any.whl.

File metadata

  • Download URL: SpiderAdmin-0.0.20-py3-none-any.whl
  • Upload date:
  • Size: 660.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.0

File hashes

Hashes for SpiderAdmin-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 5e4c376cd6ac43116c0d6397dfe2166d85cd7945d28f36db91853b3de81a3d84
MD5 286a4902e2809a50dc8c3da7087bfef9
BLAKE2b-256 67858479ab4b383486f2bbb1bce11b29285bdd8eb475d7e85e9da982310a7f82

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page