Skip to main content

Ruia - An async web scraping micro-framework based on asyncio.

Project description

travis PyPI - Python Version PyPI license


An async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:


  • Easy: Declarative programming
  • Fast: Powered by asyncio
  • Extensible: Middlewares and plugins
  • Powerful: JavaScript support


# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+



Item can be used standalone, for testing, and for tiny crawlers.

import asyncio

from ruia import AttrField, TextField, Item

class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

items =""))
for item in items:
    print(item.title, item.url)

Run: python

Notorious ‘Hijack Factory’ Shunned from Web

Spider Control

Spider is used for control requests better. Spider supports concurrency control, which is very important for spiders.

import aiofiles

from ruia import AttrField, TextField, Item, Spider

class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        """Define clean_* functions for data cleaning"""
        return value.strip()

class HackerNewsSpider(Spider):
    start_urls = [f'{index}' for index in range(1, 3)]

    async def parse(self, response):
        items = await HackerNewsItem.get_items(html=response.html)
        for item in items:
            async with'./hacker_news.txt', mode='a', encoding='utf-8') as f:
                await f.write(item.title + '\n')

if __name__ == '__main__':


[2018-09-21 17:27:14,497]-ruia-INFO  spider::l54: Spider started!
[2018-09-21 17:27:14,502]-Request-INFO  request::l77: <GET:>
[2018-09-21 17:27:14,527]-Request-INFO  request::l77: <GET:>
[2018-09-21 17:27:16,388]-ruia-INFO  spider::l122: Stopping spider: ruia
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l68: Total requests: 2
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l71: Time usage: 0:00:01.891688
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l72: Spider finished!

Custom middleware

ruia provides an easy way to customize requests.

The following middleware is based on the above example:

from ruia import Middleware

middleware = Middleware()

async def print_on_request(request):
    request.metadata = {
        'index': request.url.split('=')[-1]
    print(f"request: {request.metadata}")
    # Just operate request object, and do not return anything.

async def print_on_response(request, response):
    print(f"response: {response.metadata}")

# Add HackerNewsSpider

if __name__ == '__main__':

JavaScript Support

You can load js by using ruia-pyppeteer.

For example:

import asyncio

from ruia_pyppeteer import PyppeteerRequest as Request

request = Request("", load_js=True)
response = # Python 3.7


  • Cache for debug, to decreasing request limitation
  • Distributed crawling/scraping


Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs
  • Require or publish plugins
  • Write or fix documentation
  • Add test cases


Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ruia, version 0.2.0
Filename, size File type Python version Upload date Hashes
Filename, size ruia-0.2.0-py3-none-any.whl (19.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size ruia-0.2.0.tar.gz (16.4 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page