Skip to main content

An asyncio + aiolibs crawler imitate scrapy framework

Project description

aioscpy

Aioscpy

An asyncio + aiolibs crawler imitate scrapy framework

English | 中文

Overview

Aioscpy framework is base on opensource project Scrapy & scrapy_redis.

Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.

Dynamic variable injection is implemented and asynchronous coroutine feature support.

Distributed crawling/scraping.

Requirements

  • Python 3.8+
  • Works on Linux, Windows, macOS, BSD

Install

The quick way:

# default
pip install aioscpy

# at latest version
pip install git+https://github.com/ihandmine/aioscpy

# install all dependencies 
pip install aioscpy[all]

# install extra packages
pip install aioscpy[aiohttp,httpx]

Usage

create project spider:

aioscpy startproject project_quotes
cd project_quotes
aioscpy genspider quotes 

tree

quotes.py:

from aioscpy.spider import Spider


class QuotesSpider(Spider):
    name = 'quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

create single script spider:

aioscpy onespider single_quotes

single_quotes.py:

from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat


class SingleQuotesSpider(Spider):
    name = 'single_quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/',
    ]

    async def process_request(self, request):
        request.headers = Header(url=request.url, platform='windows', connection=True).random
        return request

    async def process_response(self, request, response):
        if response.status in [404, 503]:
            return request
        return response

    async def process_exception(self, request, exc):
        raise exc

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    async def process_item(self, item):
        self.logger.info("{item}", **{'item': pformat(item)})


if __name__ == '__main__':
    quotes = SingleQuotesSpider()
    quotes.start()

run the spider:

aioscpy crawl quotes
aioscpy runspider quotes.py

run

start.py:

from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings

"""start spider method one:
from cegex.baidu import BaiduSpider
from cegex.httpbin import HttpBinSpider

process = CrawlerProcess()
process.crawl(HttpBinSpider)
process.crawl(BaiduSpider)
process.start()
"""


def load_file_to_execute():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.load_spider(path='./cegex', spider_like='baidu')
    process.start()


def load_name_to_execute():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.crawl('baidu', path="./cegex")
    process.start()


if __name__ == '__main__':
    load_file_to_execute()

more commands:

aioscpy -h

Ready

please submit your sugguestion to owner by issue

Thanks

aiohttp

scrapy

loguru

httpx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscpy-0.3.12.tar.gz (58.9 kB view details)

Uploaded Source

Built Distribution

aioscpy-0.3.12-py3-none-any.whl (80.8 kB view details)

Uploaded Python 3

File details

Details for the file aioscpy-0.3.12.tar.gz.

File metadata

  • Download URL: aioscpy-0.3.12.tar.gz
  • Upload date:
  • Size: 58.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.9.6

File hashes

Hashes for aioscpy-0.3.12.tar.gz
Algorithm Hash digest
SHA256 9fde5852892ffb5518541f7d34515ddc9a6b22211d374e38551f353c4e107c4e
MD5 ddc8b1e64db8849a156f798676a72501
BLAKE2b-256 16f21ff1deae84464adae975d04653d927f93a865e6e1fdae0852d296e861ee8

See more details on using hashes here.

File details

Details for the file aioscpy-0.3.12-py3-none-any.whl.

File metadata

  • Download URL: aioscpy-0.3.12-py3-none-any.whl
  • Upload date:
  • Size: 80.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.9.6

File hashes

Hashes for aioscpy-0.3.12-py3-none-any.whl
Algorithm Hash digest
SHA256 62453e07ef7dc9ed591c3b1192dfd5ea58b97487fb2e81392d7b1089c7766d93
MD5 d9f2148365df74fc0ff220a86f65bf83
BLAKE2b-256 f584e8cc5ded19e05833aab66cb1bbaf78aabc9b238940f48b07d9dfe472edab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page