aioscpy

An asyncio + aiolibs crawler imitate scrapy framework

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Framework
- AsyncIO
License
- OSI Approved :: MIT License
Operating System
Programming Language
- Python
- Python :: 3.7

Project description

aioscpy

Aioscpy

An asyncio + aiolibs crawler imitate scrapy framework

English | 中文

Overview

Aioscpy framework is base on opensource project Scrapy & scrapy_redis.

Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.

Dynamic variable injection is implemented and asynchronous coroutine feature support.

Distributed crawling/scraping.

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

# default
pip install aioscpy

# at latest version
pip install git+https://github.com/ihandmine/aioscpy

# install all dependencies 
pip install aioscpy[all]

# install extra packages
pip install aioscpy[aiohttp,httpx]

Usage

create project spider:

aioscpy startproject project_quotes

cd project_quotes
aioscpy genspider quotes

tree

quotes.py:

from aioscpy.spider import Spider


class QuotesSpider(Spider):
    name = 'quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

create single script spider:

aioscpy onespider single_quotes

single_quotes.py:

from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat


class SingleQuotesSpider(Spider):
    name = 'single_quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/',
    ]

    async def process_request(self, request):
        request.headers = Header(url=request.url, platform='windows', connection=True).random
        return request

    async def process_response(self, request, response):
        if response.status in [404, 503]:
            return request
        return response

    async def process_exception(self, request, exc):
        raise exc

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    async def process_item(self, item):
        self.logger.info("{item}", **{'item': pformat(item)})


if __name__ == '__main__':
    quotes = SingleQuotesSpider()
    quotes.start()

run the spider:

aioscpy crawl quotes
aioscpy runspider quotes.py

run

start.py:

from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings

"""start spider method one:
from cegex.baidu import BaiduSpider
from cegex.httpbin import HttpBinSpider

process = CrawlerProcess()
process.crawl(HttpBinSpider)
process.crawl(BaiduSpider)
process.start()
"""


def load_file_to_execute():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.load_spider(path='./cegex', spider_like='baidu')
    process.start()


def load_name_to_execute():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.crawl('baidu', path="./cegex")
    process.start()


if __name__ == '__main__':
    load_file_to_execute()

more commands:

aioscpy -h

Ready

please submit your sugguestion to owner by issue

Thanks

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Framework
- AsyncIO
License
- OSI Approved :: MIT License
Operating System
Programming Language
- Python
- Python :: 3.7

Release history Release notifications | RSS feed

This version

0.3.12

May 17, 2023

0.3.11

Nov 29, 2022

0.3.10

Nov 25, 2022

0.3.9

Nov 25, 2022

0.3.8

Nov 4, 2022

0.3.7

Oct 26, 2022

0.3.6

Sep 27, 2022

0.3.5

Sep 5, 2022

0.3.4

Aug 22, 2022

0.3.3

Aug 5, 2022

0.3.2

Aug 4, 2022

0.2.12

Jul 28, 2022

0.2.11

Jul 14, 2022

0.2.10

Jul 7, 2022

0.2.9

Jul 5, 2022

0.2.8

Jun 15, 2022

0.2.7

Jun 14, 2022

0.2.6

Jun 14, 2022

0.2.5

Jun 10, 2022

0.2.4

May 24, 2022

0.2.3

May 24, 2022

0.2.2

May 23, 2022

0.2.1

May 20, 2022

0.2.0

May 20, 2022

0.1.12

May 18, 2022

0.1.9

May 13, 2022

0.1.8

May 11, 2022

0.1.7

May 10, 2022

0.1.6

May 9, 2022

0.1.5

May 7, 2022

0.1.4

May 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscpy-0.3.12.tar.gz (58.9 kB view details)

Uploaded May 17, 2023 Source

Built Distribution

aioscpy-0.3.12-py3-none-any.whl (80.8 kB view details)

Uploaded May 17, 2023 Python 3

File details

Details for the file aioscpy-0.3.12.tar.gz.

File metadata

Download URL: aioscpy-0.3.12.tar.gz
Upload date: May 17, 2023
Size: 58.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.9.6

File hashes

Hashes for aioscpy-0.3.12.tar.gz
Algorithm	Hash digest
SHA256	`9fde5852892ffb5518541f7d34515ddc9a6b22211d374e38551f353c4e107c4e`
MD5	`ddc8b1e64db8849a156f798676a72501`
BLAKE2b-256	`16f21ff1deae84464adae975d04653d927f93a865e6e1fdae0852d296e861ee8`

See more details on using hashes here.

File details

Details for the file aioscpy-0.3.12-py3-none-any.whl.

File metadata

Download URL: aioscpy-0.3.12-py3-none-any.whl
Upload date: May 17, 2023
Size: 80.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.9.6

File hashes

Hashes for aioscpy-0.3.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62453e07ef7dc9ed591c3b1192dfd5ea58b97487fb2e81392d7b1089c7766d93`
MD5	`d9f2148365df74fc0ff220a86f65bf83`
BLAKE2b-256	`f584e8cc5ded19e05833aab66cb1bbaf78aabc9b238940f48b07d9dfe472edab`