Skip to main content

高效异步并发爬虫框架

Project description

HunterX

Static Badge GitHub License

项目背景

这个项目为作者在工作学习中诞生的,一直以来作为本人的工作利器,经过多年的实战打磨,决定开源出来和大家一起学习进步,项目中也存在诸多可优化迭代的方向,期待和你一起完善。

项目简介

HunterX 是一款可以帮助你快速开发一个网络爬虫应用的一套异步并发框架,他提供了许多内置方法, 让你的开发代码更加的简洁,爬虫代码更加规范,方便维护,除此以外还可以多线程并发的做一些数据处理的工作, 更多功能请查看 官方文档 或添加开发者的微信 YSH026-

官方文档

快速开始

环境准备

  • python3.11及以上版本

安装说明

执行以下命令安装hunterx

pip install hunterx

安装完成后执行以下命令

hunterx

成功执行后你将看到以下输出,输入和选择你的创建信息

  • ManagerRabbitmq: 以 rabbitmq 作为优先级队列的爬虫任务。
  • ManagerRedis: 以 redis 作为优先级队列的爬虫任务。
  • ManagerMemory: 以 内存 作为优先级队列的爬虫任务。
? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel: 
  ManagerRabbitmq
  ManagerRedis
❯ ManagerMemory
? You are about to create a new project. Please follow the prompts to fill in the information. Yes
? 📁Enter the project name for your project: my_project
? 💡Enter the task name for the project: first_spider
? ⚙️Please select a kernel: ManagerMemory
The project name is: my_project.
The task name is: first_spider.
The selected kernel is: ManagerMemory.
Created file: my_project/generator.py
Created file: my_project/__init__.py
Created file: my_project/items.py
Created file: my_project/middleware.py
Created file: my_project/pipelines.py
Created file: my_project/settings.py
Created file: my_project/spiders/__init__.py
Created file: my_project/spiders/first_spider.py
Project structure created at: /your_path/my_project

将创建完成后项目根目录下的 settings.py 文件中的各项配置改为自己配置信息

项目结构

my_project
    ├── spiders
    │    ├── __init__.py
    │    └── first_spider.py
    ├── __init__.py
    ├── generator.py
    ├── items.py
    ├── middleware.py
    ├── pipelines.py
    └── settings.py

测试运行

  • 使用命令行
cd my_project/spiders
python first_spider.py
  • 使用IDE

执行 spiders 文件夹下的 first_spider.py

创建爬虫

  • 打开 generator.py 文件,根据里面的提示填写信息,完成后运行即可创建

示例:

from hunterx.utils.generator import production

# spider_dir: 爬虫分层目录名称(路径不存在时会自动创建,无需手动创建目录)
# spider_name: 创建的爬虫名称
# kernel_code: 需要使用的核心引擎 默认优先使用内存优先级队列,默认为3(内存队列),1为rabbitmq队列,2为redis队列
production(spider_name='second_spider', kernel_code=3)

执行后你将在 spiders 目录下看到刚才创建的名为 second_spider 的爬虫文件

item配置

打开 items.py 文件,您应该可以看到以下内容

# -*- coding: utf-8 -*-
# @Description: 自定义item类
# Define here the models for your scraped items
from hunterx.items.baseitem import Item, dataclass, field


@dataclass
class MyProjectItem(Item):
    # name: str = field(default="")
    pass

那么你可以根据示例 name: str = field(default="") 继续创建更多字段,注意要设置好字段类型

接下来你可以在爬虫中这样使用

# -*- coding: utf-8 -*-
import hunterx
from hunterx.spiders import MemorySpider
from items import MyProjectItem


class FirstSpiderSpider(MemorySpider):
    name = 'first_spider'

    def __init__(self):
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
        }

    def start_requests(self):
        url = 'https://www.example.com/'
        yield hunterx.Requests(url=url, headers=self.header, callback=self.parse, level=1)

    async def parse(self, response):
        item = MyProjectItem()
        item.name = 'hunterx'
        yield item


if __name__ == '__main__':
    start_run = FirstSpiderSpider()
    start_run.run()

这样在执行后设置的字段就可以被正确的赋值了,接下来可以使用管道 pipelines.py 中进行下一步的处理

pipline配置

打开 pipelines.py 文件,你应该可以看到以下内容

from hunterx.piplines.basepipeline import Pipeline
from hunterx.test.my_project.items import MyProjectItem


class MyProjectPipeline(Pipeline):

    async def process_item(self, item, spider):
        if isinstance(item, MyProjectItem):
            print(item)
            print(spider.name)

在这里可以获取到在 items.py 中设置的字段的值,你可以在这里进一步的对数据进行处理,当然这需要爬虫中正确调用并传递。

以上就是一个快速简单的使用案例,更多使用技巧请查看 官方文档

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hunterx-0.1.2.tar.gz (55.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hunterx-0.1.2-py3-none-any.whl (92.8 kB view details)

Uploaded Python 3

File details

Details for the file hunterx-0.1.2.tar.gz.

File metadata

  • Download URL: hunterx-0.1.2.tar.gz
  • Upload date:
  • Size: 55.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.11.11 Darwin/21.6.0

File hashes

Hashes for hunterx-0.1.2.tar.gz
Algorithm Hash digest
SHA256 60338b3dec874d4a9ad11c9a6724cb11d9e161bd5c6f0a8fd92128e6a21ddcaa
MD5 00acd07900f343055d2b6dc11394b783
BLAKE2b-256 5295c8c260eda2ebf20e6a94d4b72ca0a481d7fba7346d1fe14d9b0dccebf82d

See more details on using hashes here.

File details

Details for the file hunterx-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hunterx-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 92.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.11.11 Darwin/21.6.0

File hashes

Hashes for hunterx-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c7ce65e9261cdaf635687aa9f00485fb5b111c50fbd2f809010f20bbb25de19
MD5 b282422906cec519df553d1d68598975
BLAKE2b-256 90c7f64824e4f30e4d83fa86f620ab5fb4b012bb6e3833b43ad040ad13aa614c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page