Skip to main content

A high-level Web Crawling and Web Scraping framework based on Asyncio

Project description

AioScrapy

AioScrapy是一个基于Python异步IO的强大网络爬虫框架。它的设计理念源自Scrapy,但完全基于异步IO实现,提供更高的性能和更灵活的配置选项。
AioScrapy is a powerful asynchronous web crawling framework built on Python's asyncio library. It is inspired by Scrapy but completely reimplemented with asynchronous IO, offering higher performance and more flexible configuration options.

特性 | Features

  • 完全异步:基于Python的asyncio库,实现高效的并发爬取

  • 多种下载处理程序:支持多种HTTP客户端,包括aiohttp、httpx、requests、pyhttpx、curl_cffi、DrissionPage、playwright和sbcdp

  • 灵活的中间件系统:轻松添加自定义功能和处理逻辑

  • 强大的数据处理管道:支持多种数据库存储选项

  • 内置信号系统:方便的事件处理机制

  • 丰富的配置选项:高度可定制的爬虫行为

  • 分布式爬取:支持使用Redis和RabbitMQ进行分布式爬取

  • 数据库集成:内置支持Redis、MySQL、MongoDB、PostgreSQL和RabbitMQ

  • Fully Asynchronous: Built on Python's asyncio for efficient concurrent crawling

  • Multiple Download Handlers: Support for various HTTP clients including aiohttp, httpx, requests, pyhttpx, curl_cffi, DrissionPage, playwright and sbcdp

  • Flexible Middleware System: Easily add custom functionality and processing logic

  • Powerful Data Processing Pipelines: Support for various database storage options

  • Built-in Signal System: Convenient event handling mechanism

  • Rich Configuration Options: Highly customizable crawler behavior

  • Distributed Crawling: Support for distributed crawling using Redis and RabbitMQ

  • Database Integration: Built-in support for Redis, MySQL, MongoDB, PostgreSQL, and RabbitMQ

安装 | Installation

要求 | Requirements

  • Python 3.9+

使用pip安装 | Install with pip

pip install aio-scrapy

# Install the latest aio-scrapy
# pip install git+https://github.com/ConlinH/aio-scrapy

开始 | Start

from aioscrapy import Spider, logger


class MyspiderSpider(Spider):
    name = 'myspider'
    custom_settings = {
        "CLOSE_SPIDER_ON_IDLE": True
    }
    start_urls = ["https://quotes.toscrape.com"]

    @staticmethod
    async def process_request(request, spider):
        """ request middleware """
        pass

    @staticmethod
    async def process_response(request, response, spider):
        """ response middleware """
        return response

    @staticmethod
    async def process_exception(request, exception, spider):
        """ exception middleware """
        pass

    async def parse(self, response):
        for quote in response.css('div.quote'):
            item = {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }
            yield item

    async def process_item(self, item):
        logger.info(item)


if __name__ == '__main__':
    MyspiderSpider.start()

文档 | Documentation

文档目录 | Documentation Contents

许可证 | License

本项目采用MIT许可证 - 详情请查看LICENSE文件。
This project is licensed under the MIT License - see the LICENSE file for details.

联系

QQ: 995018884
WeChat: h995018884

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aio_scrapy-2.1.9.tar.gz (299.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aio_scrapy-2.1.9-py3-none-any.whl (400.9 kB view details)

Uploaded Python 3

File details

Details for the file aio_scrapy-2.1.9.tar.gz.

File metadata

  • Download URL: aio_scrapy-2.1.9.tar.gz
  • Upload date:
  • Size: 299.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for aio_scrapy-2.1.9.tar.gz
Algorithm Hash digest
SHA256 a6811e9ea0b9349b1a5e29bf06fe2d04345a6cbc9bd76a136087e8b091da8921
MD5 8879009c8f331151c2bcee14f474b998
BLAKE2b-256 1f39be43b8da441fd0cd9c19f995e6634b8c8481d7f1faa13e963b7b33295b3e

See more details on using hashes here.

File details

Details for the file aio_scrapy-2.1.9-py3-none-any.whl.

File metadata

  • Download URL: aio_scrapy-2.1.9-py3-none-any.whl
  • Upload date:
  • Size: 400.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for aio_scrapy-2.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 1355921f7d03089bf44d5a3870fb415ae14c35412db03a4538f0044fec07c160
MD5 6c1076249b5bf9c484234a8b266ced50
BLAKE2b-256 67263d62ad5a36a9c75ce1268d1710cf9357672ca4fcc51fd606faacb3490b99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page