Skip to main content

A high-level Web Crawling and Web Scraping framework based on Asyncio

Project description

aio-scrapy

An asyncio + aiolibs crawler imitate scrapy framework

English | 中文

Overview

  • aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
  • aio-scrapy implements compatibility with scrapyd.
  • aio-scrapy implements redis queue and rabbitmq queue.
  • aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
  • Distributed crawling/scraping.

Requirements

  • Python 3.7+
  • Works on Linux, Windows, macOS, BSD

Install

The quick way:

# Install the latest aio-scrapy
pip install git+https://github.com/conlin-huang/aio-scrapy

# default
pip install aio-scrapy

# Install all dependencies 
pip install aio-scrapy[all]

# When you need to use mysql/httpx/rabbitmq/mongo
pip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]

Usage

create project spider:

aioscrapy startproject project_quotes
cd project_quotes
aioscrapy genspider quotes 

quotes.py

from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'

    start_urls = ['https://quotes.toscrape.com']

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


if __name__ == '__main__':
    QuotesMemorySpider.start()

run the spider:

aioscrapy crawl quotes

create single script spider:

aioscrapy genspider single_quotes -t single

single_quotes.py:

from aioscrapy.spiders import Spider


class QuotesMemorySpider(Spider):
    name = 'QuotesMemorySpider'
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
        'CLOSE_SPIDER_ON_IDLE': True,
        # 'DOWNLOAD_DELAY': 3,
        # 'RANDOMIZE_DOWNLOAD_DELAY': True,
        # 'CONCURRENT_REQUESTS': 1,
        # 'LOG_LEVEL': 'INFO'
    }

    start_urls = ['https://quotes.toscrape.com']

    @staticmethod
    async def process_request(request, spider):
        """ request middleware """
        return request

    @staticmethod
    async def process_response(request, response, spider):
        """ response middleware """
        return response

    @staticmethod
    async def process_exception(request, exception, spider):
        """ exception middleware """
        pass

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

    async def process_item(self, item):
        print(item)


if __name__ == '__main__':
    QuotesMemorySpider.start()

run the spider:

aioscrapy runspider quotes.py

more commands:

aioscrapy -h

Documentation

doc

Ready

please submit your sugguestion to owner by issue

Thanks

aiohttp

scrapy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aio-scrapy-1.2.7.tar.gz (90.9 kB view details)

Uploaded Source

Built Distribution

aio_scrapy-1.2.7-py3-none-any.whl (131.6 kB view details)

Uploaded Python 3

File details

Details for the file aio-scrapy-1.2.7.tar.gz.

File metadata

  • Download URL: aio-scrapy-1.2.7.tar.gz
  • Upload date:
  • Size: 90.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for aio-scrapy-1.2.7.tar.gz
Algorithm Hash digest
SHA256 0a727834782b9994445891837754c5c1bfbd1a4a1ab41ac027338a166a8570ed
MD5 28662abb20ed31d6789d0df83d137ea7
BLAKE2b-256 c098e5044f1dca11056f14686f416141477e23d37015feafada1906fdfdbe8f0

See more details on using hashes here.

File details

Details for the file aio_scrapy-1.2.7-py3-none-any.whl.

File metadata

  • Download URL: aio_scrapy-1.2.7-py3-none-any.whl
  • Upload date:
  • Size: 131.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.2

File hashes

Hashes for aio_scrapy-1.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 418d4c2d8d20d14a0c2a26e9c3d9ebd914b4e23049c3e3beb82a39bc0c1a95f4
MD5 fcc213ff55746af593ce38f93c730ffc
BLAKE2b-256 5cd9cfb20e90e0e5ead1a1d6a598c88440c499f1c2f6b60fe1ee1d52c3a14f8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page