Ruia - An async web scraping micro-framework based on asyncio.

Project description

## Ruia

### Overview

An async web scraping micro-framework, written with `asyncio` and `aiohttp`, aims to make crawling url as convenient as possible.

Write less, run faster:

- Documentation: [中文文档]( |[documentation](
- Plugins: [](

### Installation

``` shell
# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+

### Usage

#### Request & Response

We provide an easy way to `request` a url and return a friendly `response`:

``` python
import asyncio

from ruia import Request

request = Request("")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-INFO <GET:>
# <Response url[text]: status:200 metadata:{}>

**JavaScript Support**:

You can load js by using [ruia-pyppeteer](

For example:

import asyncio

from ruia_pyppeteer import PyppeteerRequest as Request

request = Request("", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())

#### Item

Let's take a look at a quick example of using `Item` to extract target data. Start off by adding the following to your

``` python
import asyncio

from ruia import AttrField, TextField, Item

class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')

items = asyncio.get_event_loop().run_until_complete(HackerNewsItem.get_items(url=""))
for item in items:
print(item.title, item.url)

Run: `python`

``` shell
Notorious ‘Hijack Factory’ Shunned from Web

#### Spider

For multiple pages, you can solve this with `Spider`

Create ``:

``` python
import aiofiles

from ruia import AttrField, TextField, Item, Spider

class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')

async def clean_title(self, value):
return value

class HackerNewsSpider(Spider):
start_urls = ['', '']

async def parse(self, res):
items = await HackerNewsItem.get_items(html=res.html)
for item in items:
async with'./hacker_news.txt', 'a') as f:
await f.write(item.title + '\n')

if __name__ == '__main__':

Run ``:

``` shell
[2018-09-21 17:27:14,497]-ruia-INFO spider::l54: Spider started!
[2018-09-21 17:27:14,502]-Request-INFO request::l77: <GET:>
[2018-09-21 17:27:14,527]-Request-INFO request::l77: <GET:>
[2018-09-21 17:27:16,388]-ruia-INFO spider::l122: Stopping spider: ruia
[2018-09-21 17:27:16,389]-ruia-INFO spider::l68: Total requests: 2
[2018-09-21 17:27:16,389]-ruia-INFO spider::l71: Time usage: 0:00:01.891688
[2018-09-21 17:27:16,389]-ruia-INFO spider::l72: Spider finished!

#### Custom middleware

`ruia` provides an easy way to customize requests, *as long as it does not return it*.

The following middleware code is based on the above example:

``` python
from ruia import Middleware

middleware = Middleware()

async def print_on_request(request):
request.metadata = {
'index': request.url.split('=')[-1]
print(f"request: {request.metadata}")

async def print_on_response(request, response):
print(f"response: {response.metadata}")

# Add HackerNewsSpider

if __name__ == '__main__':

### Features

- Custom middleware
- JavaScript support
- Friendly response

### TODO

- [ ] Distributed crawling/scraping

### Contribution

- Pull Request
- Open Issue

### Thanks

- [sanic](
- [demiurge](

