A simple crawler framework that implements both single run and distributed run based on Celery, allowing you to write your crawler code as you please

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Backflow

Backflow 是一个简洁且灵活的爬虫框架，支持使用 Celery 实现分布式爬取。你可以根据自己的需求编写爬虫代码，并且框架提供了丰富的配置选项和扩展能力。

功能

支持单机多线程爬取
支持分布式爬取（基于 Celery）
灵活的配置选项
易于扩展的爬虫模板

运行平台

Linux
macOS
Windows

运行环境

Python 3.9+
Linux / Windows

Quick Start

安装依赖

pip install backflow

修改配置文件conf And settings

celery:
  host: 127.0.0.1
  port: 27017
  xxxx: xxxx

es:
  host: 127.0.0.1
  port: 9200
...

运行程序

# 查看命令帮助
backflow --help
# 查看目前所有的回流平台的爬虫列表
backflow list
# 运行单个爬虫
backflow crawl baijiahao  # Single run
backflow crawl jinritoutiao  # Single run

# Single run  The default is 20 pages. If you want to set page numbers, use the --page parameter
# For detailed parameters, please refer to the default parameter START_PAGE END_PAGE PAGE_STEP in settings.py
backflow crawl baijiahao --page 4  
# 运行所有爬虫 多线程单机运行
# main_back.py  # Can run all crawlers

分布式运行 Execute Cellery Distributed

 # 启动celery worker
celery -A tasks worker --loglevel=info 

# run distributed command / 分布式运行
backflow crawl baijiahao --distributed

# if you need to specify the number of pages for a distributed crawler, 100 pages, and a step size of 10
backflow crawl baijiahao --distributed --page 100 --step 10

添加新爬虫新增名为newspider的爬虫

# add new spider
backflow addspider newspider

模版爬虫代码

当你使用命令backflow addspider newspider创建好一个新的spider模版后，请确保你的爬虫代码请求可以正确采集到数据

from loguru import logger
from utils.tools import Tools
from Backflows.base import BackFlow
from Backflows.middleware import Request
import traceback


class {spider_name.capitalize()}(BackFlow):
    name = '{spider_name}'

    def __init__(self):
        super().__init__()
        self.ck = None
        self.headers = {{
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like "
                          "Gecko) Chrome/108.0.0.0 Safari/537.36"
        }}

    async def get_page_request(self, page):
        url = 'http://example.com/api/data?page={{}}'.format(page)
        return Request('GET', url=url, headers=self.headers, cookies=self.ck, meta={{'page': page}})

    async def parse(self, response):

        resp = response.json()
        datas = resp.get('data', {{}})
        if not datas:
            print(f'{spider_name} cookie might be expired or no data returned.')
            return
        for con in datas:
            news = {{
                'url': con.get('item_id', ''),
            }}
            yield news

正式服务器配置

server:
  120.xx.xx.249

demo server:
  host: 120.xx.xx.249
  path: /opt/workstation/xx/

定时任务:
  crontab/supervisor

框架运行步骤

启动程序：在命令行中运行runner.py，并传入相应的参数，例如爬虫的名称、是否以分布式模式运行、停止页码和步长。
命令行解析： argparse库解析命令行参数，并根据参数确定运行模式。
查找爬虫： SpiderRunner类的find_spiders方法搜索所有已定义的爬虫模块（如spiders），并导入这些模块。它查找所有继承自BackFlow类的爬虫，并将它们存储在一个字典中，键为爬虫的名称。
加载中间件&&Settings： load_middlewares方法根据settings.MIDDLEWARES配置加载中间件。中间件包括用户代理、重试和代理中间件，它们用于修改请求和处理响应。
运行爬虫：如果选择了分布式模式，run方法会为每个页面生成一个run_spider任务，并使用Celery将其添加到任务队列中。否则，它会为每个页面创建一个异步任务，并使用asyncio.gather等待所有任务完成。
执行任务：在分布式模式下，Celery worker会从任务队列中获取任务并执行。每个任务都会调用async_run_spider异步函数。
获取页面请求： async_run_spider函数会创建一个Page对象，并调用爬虫的get_page_request方法来获取该页面的请求对象。
处理请求： process_request方法使用中间件处理请求，例如添加用户代理、设置代理等。
抓取页面： fetch_page方法使用httpx.AsyncClient发送请求并获取响应。它使用信号量来限制并发请求数量。
处理响应： process_response方法使用中间件处理响应，例如检查响应状态码。
解析页面：爬虫的parse方法解析响应内容，提取数据项。对于每个提取的数据项，它会调用管道的process_item方法进行处理。
管道处理数据项： Pipeline的process_item方法将数据项存储到MongoDB或文件中。
统计信息：爬虫会更新已抓取页面数和已发送请求数。在所有页面抓取完成后，print_stats方法会打印这些统计信息。
错误处理：如果任何步骤出现错误，异常会被捕获，错误信息会被记录，并通过钉钉发送通知。

注意事项

本项目仅供学习交流使用，未经本地测试请不要在生产环境使用。
分布式运行需要启动celery worker。
restart_celery.py 脚本用于监控并重启celery worker。需要在supervisor中配置或者直接运行
爬虫代码需要在backflow/spiders目录下。
爬虫配置文件需要在backflow/conf目录和 backflow/settings.py中配置。

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.4

Jan 15, 2025

This version

0.2.3

Jan 9, 2025

0.2.2

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

backflow-0.2.3.tar.gz (24.6 kB view details)

Uploaded Jan 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

backflow-0.2.3-py3-none-any.whl (35.1 kB view details)

Uploaded Jan 9, 2025 Python 3

File details

Details for the file backflow-0.2.3.tar.gz.

File metadata

Download URL: backflow-0.2.3.tar.gz
Upload date: Jan 9, 2025
Size: 24.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for backflow-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`031ad23ca94994fafbea4c36881cf5bca2ee1a0f7788decbd784ae6a0570b044`
MD5	`4006e1f353e64a3396910672f2b9386d`
BLAKE2b-256	`82c320bbc45df8bfe1197796286514e74fa73c1d9d10bcc3f57ce7a27bbf6c71`

See more details on using hashes here.

File details

Details for the file backflow-0.2.3-py3-none-any.whl.

File metadata

Download URL: backflow-0.2.3-py3-none-any.whl
Upload date: Jan 9, 2025
Size: 35.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for backflow-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f7a74be10a7d34c60f2ef6653cf3c12c2366d3c5cf25eed2f02dfa79c856860`
MD5	`8d665cf599466b06ab74a277d3ecc3ec`
BLAKE2b-256	`ab26b2a202f2e5cc139988a6814d4c9754e5ed521440e6faa43b03333424f96d`

See more details on using hashes here.

backflow 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Backflow

目录

功能

运行平台

运行环境

Quick Start

安装依赖

修改配置文件conf And settings

运行程序

分布式运行 Execute Cellery Distributed

添加新爬虫 新增名为newspider的爬虫

模版爬虫代码

正式服务器配置

框架运行步骤

注意事项

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

添加新爬虫新增名为newspider的爬虫