An asyncio + aiolibs crawler imitate scrapy framework
Project description
Aioscpy
An asyncio + aiolibs crawler imitate scrapy framework
English | 中文
Overview
Aioscpy framework is base on opensource project Scrapy & scrapy_redis.
Aioscpy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
Dynamic variable injection is implemented and asynchronous coroutine feature support.
Distributed crawling/scraping.
Requirements
- Python 3.7+
- Works on Linux, Windows, macOS, BSD
Install
The quick way:
pip install aioscpy
Usage
create project spider:
aioscpy startproject project_quotes
cd project_quotes
aioscpy genspider quotes
quotes.py:
from aioscpy.spider import Spider
class QuotesSpider(Spider):
name = 'quotes'
custom_settings = {
"SPIDER_IDLE": False
}
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
create single script spider:
aioscpy onespider single_quotes
single_quotes.py:
from aioscpy.spider import Spider
from anti_header import Header
from pprint import pprint, pformat
class SingleQuotesSpider(Spider):
name = 'single_quotes'
custom_settings = {
"SPIDER_IDLE": False
}
start_urls = [
'https://quotes.toscrape.com/',
]
async def process_request(self, request):
request.headers = Header(url=request.url, platform='windows', connection=True).random
return request
async def process_response(self, request, response):
if response.status in [404, 503]:
return request
return response
async def process_exception(self, request, exc):
raise exc
async def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
async def process_item(self, item):
self.logger.info("{item}", **{'item': pformat(item)})
if __name__ == '__main__':
quotes = QuotesSpider()
quotes.start()
run the spider:
aioscpy crawl quotes
aioscpy runspider quotes.py
start.py:
from aioscpy import call_grace_instance
from aioscpy.utils.tools import get_project_settings
def load_file_to_execute():
process = call_grace_instance("crawler_process", get_project_settings())
process.load_spider(path='./spiders')
process.start()
def load_name_to_execute():
process = call_grace_instance("crawler_process", get_project_settings())
process.crawl('[spider_name]')
process.start()
more commands:
aioscpy -h
Ready
please submit your sugguestion to owner by issue
Thanks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aioscpy-0.2.11.tar.gz
.
File metadata
- Download URL: aioscpy-0.2.11.tar.gz
- Upload date:
- Size: 57.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ab3426c461035e9d9034c2e8810b16bcc32b9db883deccbc1a2cd2e4f3cdff65 |
|
MD5 | 626e8580e55940ba5bd51cb577cb9e54 |
|
BLAKE2b-256 | 5b2bcff7fefd97dee232bf98067e08e1d791eb3d789e6627b4091d742546ec0c |
File details
Details for the file aioscpy-0.2.11-py3-none-any.whl
.
File metadata
- Download URL: aioscpy-0.2.11-py3-none-any.whl
- Upload date:
- Size: 79.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4848c6d5746cfcf70779ce896cba979e4d1c38fd2b9998591c016d40aba506c |
|
MD5 | 618b7bcdf99e3a71a7d9e1520a425da0 |
|
BLAKE2b-256 | 40a3930467b880ec12be9d003e4fe71c70987f3e78e8f6c85e47b1dd1d3a82fe |