An asyncio + aiolibs crawler imitate scrapy framework

These details have not been verified by PyPI

Project links

Homepage

Project description

aioscpy

Aioscpy

A powerful, high-performance asynchronous web crawling and scraping framework built on Python's asyncio ecosystem.

English | 中文

Overview

Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It draws inspiration from Scrapy and scrapy_redis but is designed from the ground up to leverage the full power of asynchronous programming.

Key Features

Fully Asynchronous: Built on Python's asyncio for high-performance concurrent operations
Scrapy-like API: Familiar API for those coming from Scrapy
Distributed Crawling: Support for distributed crawling using Redis
Multiple HTTP Backends: Support for aiohttp, httpx, and requests
Dynamic Variable Injection: Powerful dependency injection system
Flexible Middleware System: Customizable request/response processing pipeline
Robust Item Processing: Pipeline for processing scraped items

Requirements

Python 3.8+
Works on Linux, Windows, macOS, BSD

Installation

Basic Installation

pip install aioscpy

With All Dependencies

pip install aioscpy[all]

With Specific HTTP Backends

pip install aioscpy[aiohttp,httpx]

Latest Version from GitHub

pip install git+https://github.com/ihandmine/aioscpy

Quick Start

Creating a New Project

aioscpy startproject myproject
cd myproject

Creating a Spider

aioscpy genspider myspider

This will create a basic spider in the spiders directory.

tree

Example Spider

from aioscpy.spider import Spider


class QuotesSpider(Spider):
    name = 'quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Creating a Single Spider Script

aioscpy onespider single_quotes

Advanced Spider Example

from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat


class SingleQuotesSpider(Spider):
    name = 'single_quotes'
    custom_settings = {
        "SPIDER_IDLE": False
    }
    start_urls = [
        'https://quotes.toscrape.com/',
    ]

    async def process_request(self, request):
        request.headers = Header(url=request.url, platform='windows', connection=True).random
        return request

    async def process_response(self, request, response):
        if response.status in [404, 503]:
            return request
        return response

    async def process_exception(self, request, exc):
        raise exc

    async def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    async def process_item(self, item):
        self.logger.info("{item}", **{'item': pformat(item)})


if __name__ == '__main__':
    quotes = SingleQuotesSpider()
    quotes.start()

Running Spiders

# Run a spider from a project
aioscpy crawl quotes

# Run a single spider script
aioscpy runspider quotes.py

run

Running from Python Code

from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings

# Method 1: Load all spiders from a directory
def load_spiders_from_directory():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.load_spider(path='./spiders')
    process.start()

# Method 2: Run a specific spider by name
def run_specific_spider():
    process = call_grace_instance("crawler_process", get_project_settings())
    process.crawl('myspider')
    process.start()

if __name__ == '__main__':
    run_specific_spider()

Configuration

Aioscpy can be configured through the settings.py file in your project. Here are the most important settings:

Concurrency Settings

# Maximum number of concurrent items being processed
CONCURRENT_ITEMS = 100

# Maximum number of concurrent requests
CONCURRENT_REQUESTS = 16

# Maximum number of concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Maximum number of concurrent requests per IP
CONCURRENT_REQUESTS_PER_IP = 0

Download Settings

# Delay between requests (in seconds)
DOWNLOAD_DELAY = 0

# Timeout for requests (in seconds)
DOWNLOAD_TIMEOUT = 20

# Whether to randomize the download delay
RANDOMIZE_DOWNLOAD_DELAY = True

# HTTP backend to use
DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.httpx.HttpxDownloadHandler"
# Other options:
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler"
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.requests.RequestsDownloadHandler"

Scheduler Settings

# Scheduler to use (memory-based or Redis-based)
SCHEDULER = "aioscpy.core.scheduler.memory.MemoryScheduler"
# For distributed crawling:
# SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"

# Redis connection settings (for Redis scheduler)
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Response API

Aioscpy provides a rich API for working with responses:

Extracting Data

# Using CSS selectors
title = response.css('title::text').get()
all_links = response.css('a::attr(href)').getall()

# Using XPath
title = response.xpath('//title/text()').get()
all_links = response.xpath('//a/@href').getall()

Following Links

# Follow a link
yield response.follow('next-page.html', self.parse)

# Follow a link with a callback
yield response.follow('details.html', self.parse_details)

# Follow all links matching a CSS selector
yield from response.follow_all(css='a.product::attr(href)', callback=self.parse_product)

More Commands

aioscpy -h

Distributed Crawling

To enable distributed crawling with Redis:

Configure Redis in settings:

SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"

Run multiple instances of your spider on different machines, all connecting to the same Redis server.

Contributing

Please submit your suggestions to the owner by creating an issue.

Thanks

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.13

Apr 18, 2025

0.3.12

May 17, 2023

0.3.11

Nov 29, 2022

0.3.10

Nov 25, 2022

0.3.9

Nov 25, 2022

0.3.8

Nov 4, 2022

0.3.7

Oct 26, 2022

0.3.6

Sep 27, 2022

0.3.5

Sep 5, 2022

0.3.4

Aug 22, 2022

0.3.3

Aug 5, 2022

0.3.2

Aug 4, 2022

0.2.12

Jul 28, 2022

0.2.11

Jul 14, 2022

0.2.10

Jul 7, 2022

0.2.9

Jul 5, 2022

0.2.8

Jun 15, 2022

0.2.7

Jun 14, 2022

0.2.6

Jun 14, 2022

0.2.5

Jun 10, 2022

0.2.4

May 24, 2022

0.2.3

May 24, 2022

0.2.2

May 23, 2022

0.2.1

May 20, 2022

0.2.0

May 20, 2022

0.1.12

May 18, 2022

0.1.9

May 13, 2022

0.1.8

May 11, 2022

0.1.7

May 10, 2022

0.1.6

May 9, 2022

0.1.5

May 7, 2022

0.1.4

May 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aioscpy-0.3.13.tar.gz (63.3 kB view details)

Uploaded Apr 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aioscpy-0.3.13-py3-none-any.whl (86.7 kB view details)

Uploaded Apr 18, 2025 Python 3

File details

Details for the file aioscpy-0.3.13.tar.gz.

File metadata

Download URL: aioscpy-0.3.13.tar.gz
Upload date: Apr 18, 2025
Size: 63.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for aioscpy-0.3.13.tar.gz
Algorithm	Hash digest
SHA256	`59b34f9db25cc6745e17df78de9951404d15bbf665b630f1c09ed74684330baa`
MD5	`a26d9f9ae6088d7a9fa8a6161e878365`
BLAKE2b-256	`8a01f01a8e1b2171924ebe702092da93341d32271c316fd66cdc5122a24240ba`

See more details on using hashes here.

File details

Details for the file aioscpy-0.3.13-py3-none-any.whl.

File metadata

Download URL: aioscpy-0.3.13-py3-none-any.whl
Upload date: Apr 18, 2025
Size: 86.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for aioscpy-0.3.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0d38a63f365d63a1022526a44bb315dbfa10c67c49ec5fbee16f0261c4b2b21`
MD5	`3a3bd56ed5b86c44d4a5106b7ff71e7c`
BLAKE2b-256	`87aaa16e166b606ed09de8f8d306c4e6d890cd1ea8e966fc340d296332095f8a`

See more details on using hashes here.

aioscpy 0.3.13

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Aioscpy

Overview

Key Features

Requirements

Installation

Basic Installation

With All Dependencies

With Specific HTTP Backends

Latest Version from GitHub

Quick Start

Creating a New Project

Creating a Spider

Example Spider

Creating a Single Spider Script

Advanced Spider Example

Running Spiders

Running from Python Code

Configuration

Concurrency Settings

Download Settings

Scheduler Settings

Response API

Extracting Data

Following Links

More Commands

Distributed Crawling

Contributing

Thanks

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes