A high-level Web Crawling and Web Scraping framework based on Asyncio
Project description
aio-scrapy
An asyncio + aiolibs crawler imitate scrapy framework
English | 中文
Overview
- aio-scrapy framework is base on opensource project Scrapy & scrapy_redis.
- aio-scrapy implements compatibility with scrapyd.
- aio-scrapy implements redis queue and rabbitmq queue.
- aio-scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- Distributed crawling/scraping.
Requirements
- Python 3.9+
- Works on Linux, Windows, macOS, BSD
Install
The quick way:
# Install the latest aio-scrapy
pip install git+https://github.com/conlin-huang/aio-scrapy
# default
pip install aio-scrapy
# Install all dependencies
pip install aio-scrapy[all]
# When you need to use mysql/httpx/rabbitmq/mongo
pip install aio-scrapy[aiomysql,httpx,aio-pika,mongo]
Usage
create project spider:
aioscrapy startproject project_quotes
cd project_quotes
aioscrapy genspider quotes
quotes.py
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
start_urls = ['https://quotes.toscrape.com']
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
if __name__ == '__main__':
QuotesMemorySpider.start()
run the spider:
aioscrapy crawl quotes
create single script spider:
aioscrapy genspider single_quotes -t single
single_quotes.py:
from aioscrapy.spiders import Spider
class QuotesMemorySpider(Spider):
name = 'QuotesMemorySpider'
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
'CLOSE_SPIDER_ON_IDLE': True,
# 'DOWNLOAD_DELAY': 3,
# 'RANDOMIZE_DOWNLOAD_DELAY': True,
# 'CONCURRENT_REQUESTS': 1,
# 'LOG_LEVEL': 'INFO'
}
start_urls = ['https://quotes.toscrape.com']
@staticmethod
async def process_request(request, spider):
""" request middleware """
pass
@staticmethod
async def process_response(request, response, spider):
""" response middleware """
return response
@staticmethod
async def process_exception(request, exception, spider):
""" exception middleware """
pass
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
async def process_item(self, item):
print(item)
if __name__ == '__main__':
QuotesMemorySpider.start()
run the spider:
aioscrapy runspider quotes.py
more commands:
aioscrapy -h
more example
Documentation
Ready
please submit your sugguestion to owner by issue
Thanks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aio-scrapy-2.1.0.tar.gz
(96.9 kB
view details)
Built Distribution
aio_scrapy-2.1.0-py3-none-any.whl
(139.5 kB
view details)
File details
Details for the file aio-scrapy-2.1.0.tar.gz
.
File metadata
- Download URL: aio-scrapy-2.1.0.tar.gz
- Upload date:
- Size: 96.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 06330ad81a21f40590a3d1c1fb5bddaab169396e4e2d5bfdc7b3113f6ddcd897 |
|
MD5 | 0f3ae50d03f43fd33564ace65d4fb768 |
|
BLAKE2b-256 | 016ba17c577c86067ba5b59e3715d5dc01eaca1dd4793d661307ca189a76ea2e |
File details
Details for the file aio_scrapy-2.1.0-py3-none-any.whl
.
File metadata
- Download URL: aio_scrapy-2.1.0-py3-none-any.whl
- Upload date:
- Size: 139.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39b1e3dfe63607db3d2b1bba9ef4807b5f28eca05ed9cefcdb662a14540ebb6f |
|
MD5 | 0afe0502c6c4f053641db42501017315 |
|
BLAKE2b-256 | 0a445e78dd0ed7ec133faa4a5ff5ba1750b29e1942943ba5b3b35a7f2f5a29db |