An asyncio + aiolibs crawler imitate scrapy framework
Project description
Aioscpy
A powerful, high-performance asynchronous web crawling and scraping framework built on Python's asyncio ecosystem.
English | 中文
Overview
Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It draws inspiration from Scrapy and scrapy_redis but is designed from the ground up to leverage the full power of asynchronous programming.
Key Features
- Fully Asynchronous: Built on Python's asyncio for high-performance concurrent operations
- Scrapy-like API: Familiar API for those coming from Scrapy
- Distributed Crawling: Support for distributed crawling using Redis
- Multiple HTTP Backends: Support for aiohttp, httpx, and requests
- Dynamic Variable Injection: Powerful dependency injection system
- Flexible Middleware System: Customizable request/response processing pipeline
- Robust Item Processing: Pipeline for processing scraped items
Requirements
- Python 3.8+
- Works on Linux, Windows, macOS, BSD
Installation
Basic Installation
pip install aioscpy
With All Dependencies
pip install aioscpy[all]
With Specific HTTP Backends
pip install aioscpy[aiohttp,httpx]
Latest Version from GitHub
pip install git+https://github.com/ihandmine/aioscpy
Quick Start
Creating a New Project
aioscpy startproject myproject
cd myproject
Creating a Spider
aioscpy genspider myspider
This will create a basic spider in the spiders directory.
Example Spider
from aioscpy.spider import Spider
class QuotesSpider(Spider):
name = 'quotes'
custom_settings = {
"SPIDER_IDLE": False
}
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Creating a Single Spider Script
aioscpy onespider single_quotes
Advanced Spider Example
from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat
class SingleQuotesSpider(Spider):
name = 'single_quotes'
custom_settings = {
"SPIDER_IDLE": False
}
start_urls = [
'https://quotes.toscrape.com/',
]
async def process_request(self, request):
request.headers = Header(url=request.url, platform='windows', connection=True).random
return request
async def process_response(self, request, response):
if response.status in [404, 503]:
return request
return response
async def process_exception(self, request, exc):
raise exc
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
async def process_item(self, item):
self.logger.info("{item}", **{'item': pformat(item)})
if __name__ == '__main__':
quotes = SingleQuotesSpider()
quotes.start()
Running Spiders
# Run a spider from a project
aioscpy crawl quotes
# Run a single spider script
aioscpy runspider quotes.py
Running from Python Code
from aioscpy.crawler import call_grace_instance
from aioscpy.utils.tools import get_project_settings
# Method 1: Load all spiders from a directory
def load_spiders_from_directory():
process = call_grace_instance("crawler_process", get_project_settings())
process.load_spider(path='./spiders')
process.start()
# Method 2: Run a specific spider by name
def run_specific_spider():
process = call_grace_instance("crawler_process", get_project_settings())
process.crawl('myspider')
process.start()
if __name__ == '__main__':
run_specific_spider()
Configuration
Aioscpy can be configured through the settings.py file in your project. Here are the most important settings:
Concurrency Settings
# Maximum number of concurrent items being processed
CONCURRENT_ITEMS = 100
# Maximum number of concurrent requests
CONCURRENT_REQUESTS = 16
# Maximum number of concurrent requests per domain
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Maximum number of concurrent requests per IP
CONCURRENT_REQUESTS_PER_IP = 0
Download Settings
# Delay between requests (in seconds)
DOWNLOAD_DELAY = 0
# Timeout for requests (in seconds)
DOWNLOAD_TIMEOUT = 20
# Whether to randomize the download delay
RANDOMIZE_DOWNLOAD_DELAY = True
# HTTP backend to use
DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.httpx.HttpxDownloadHandler"
# Other options:
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.aiohttp.AioHttpDownloadHandler"
# DOWNLOAD_HANDLER = "aioscpy.core.downloader.handlers.requests.RequestsDownloadHandler"
Scheduler Settings
# Scheduler to use (memory-based or Redis-based)
SCHEDULER = "aioscpy.core.scheduler.memory.MemoryScheduler"
# For distributed crawling:
# SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"
# Redis connection settings (for Redis scheduler)
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"
Response API
Aioscpy provides a rich API for working with responses:
Extracting Data
# Using CSS selectors
title = response.css('title::text').get()
all_links = response.css('a::attr(href)').getall()
# Using XPath
title = response.xpath('//title/text()').get()
all_links = response.xpath('//a/@href').getall()
Following Links
# Follow a link
yield response.follow('next-page.html', self.parse)
# Follow a link with a callback
yield response.follow('details.html', self.parse_details)
# Follow all links matching a CSS selector
yield from response.follow_all(css='a.product::attr(href)', callback=self.parse_product)
More Commands
aioscpy -h
Distributed Crawling
To enable distributed crawling with Redis:
- Configure Redis in settings:
SCHEDULER = "aioscpy.core.scheduler.redis.RedisScheduler"
REDIS_URI = "redis://localhost:6379"
QUEUE_KEY = "%(spider)s:queue"
- Run multiple instances of your spider on different machines, all connecting to the same Redis server.
Contributing
Please submit your suggestions to the owner by creating an issue.
Thanks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aioscpy-0.3.13.tar.gz.
File metadata
- Download URL: aioscpy-0.3.13.tar.gz
- Upload date:
- Size: 63.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59b34f9db25cc6745e17df78de9951404d15bbf665b630f1c09ed74684330baa
|
|
| MD5 |
a26d9f9ae6088d7a9fa8a6161e878365
|
|
| BLAKE2b-256 |
8a01f01a8e1b2171924ebe702092da93341d32271c316fd66cdc5122a24240ba
|
File details
Details for the file aioscpy-0.3.13-py3-none-any.whl.
File metadata
- Download URL: aioscpy-0.3.13-py3-none-any.whl
- Upload date:
- Size: 86.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0d38a63f365d63a1022526a44bb315dbfa10c67c49ec5fbee16f0261c4b2b21
|
|
| MD5 |
3a3bd56ed5b86c44d4a5106b7ff71e7c
|
|
| BLAKE2b-256 |
87aaa16e166b606ed09de8f8d306c4e6d890cd1ea8e966fc340d296332095f8a
|