Skip to main content

An async web scraping micro-framework based on asyncio.

Project description

travis PyPI - Python Version PyPI license

Overview

An async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Features

  • Easy: Declarative programming

  • Fast: Powered by asyncio

  • Extensible: Middlewares and plugins

  • Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

  1. Overview

  2. Installation

  3. Define Data Items

  4. Spider Control

  5. Request & Response

  6. Customize Middleware

  7. Write a Plugins

Usage

Item

Item can be used standalone, for testing, and for tiny crawlers.

import asyncio

from ruia import AttrField, TextField, Item


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

async def main():
    async for item in HackerNewsItem.get_items(url="https://news.ycombinator.com/"):
        print(item.title, item.url)

if __name__ == '__main__':
     items = asyncio.run(main())

Run: python demo.py

Notorious ‘Hijack Factory’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
 ......

Spider Control

Spider is used for control requests better. Spider supports concurrency control, which is very important for spiders.

import aiofiles

from ruia import AttrField, TextField, Item, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        """Define clean_* functions for data cleaning"""
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(1, 3)]

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=response.html):
            yield item

    async def process_item(self, item: HackerNewsItem):
        """Ruia build-in method"""
        async with aiofiles.open('./hacker_news.txt', 'a') as f:
            await f.write(str(item.title) + '\n')


if __name__ == '__main__':
    HackerNewsSpider.start()

Run hacker_news_spider.py:

[2018-09-21 17:27:14,497]-ruia-INFO  spider::l54: Spider started!
[2018-09-21 17:27:14,502]-Request-INFO  request::l77: <GET: https://news.ycombinator.com/news?p=2>
[2018-09-21 17:27:14,527]-Request-INFO  request::l77: <GET: https://news.ycombinator.com/news?p=1>
[2018-09-21 17:27:16,388]-ruia-INFO  spider::l122: Stopping spider: ruia
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l68: Total requests: 2
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l71: Time usage: 0:00:01.891688
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l72: Spider finished!

Custom middleware

ruia provides an easy way to customize requests.

The following middleware is based on the above example:

from ruia import Middleware

middleware = Middleware()


@middleware.request
async def print_on_request(request):
    request.metadata = {
        'index': request.url.split('=')[-1]
    }
    print(f"request: {request.metadata}")
    # Just operate request object, and do not return anything.


@middleware.response
async def print_on_response(request, response):
    print(f"response: {response.metadata}")

# Add HackerNewsSpider

if __name__ == '__main__':
    HackerNewsSpider.start(middleware=middleware)

JavaScript Support

You can load js by using ruia-pyppeteer.

For example:

import asyncio

from ruia_pyppeteer import PyppeteerRequest as Request

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.run(request.fetch()) # Python 3.7
print(response.html)

TODO

  • Cache for debug, to decreasing request limitation

  • Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs

  • Require or publish plugins

  • Write or fix documentation

  • Add test cases

Thanks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruia-0.4.4.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ruia-0.4.4-py2.py3-none-any.whl (21.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file ruia-0.4.4.tar.gz.

File metadata

  • Download URL: ruia-0.4.4.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for ruia-0.4.4.tar.gz
Algorithm Hash digest
SHA256 3783c9fe0fd42a3afb89201d316e0f35520d7cdeb9853e270902cd6b1f3bdf4f
MD5 2028edaf23548acafcfc852e9ce7d5e7
BLAKE2b-256 2f3ca13d5b372cd33823f8fbc757dbcbd8c3bda04fa3cab598c8b15dd7b65a6e

See more details on using hashes here.

File details

Details for the file ruia-0.4.4-py2.py3-none-any.whl.

File metadata

  • Download URL: ruia-0.4.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for ruia-0.4.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 4d2dcdbb5e2aa0d1424b240d53a727e43cf49bce8e92f5f89d54a6f66fb42a26
MD5 b72e87dd1a68d3a9c3be09d1ec0e4928
BLAKE2b-256 e1958c9035cc200ded37234dbc10836b09a176f1bf9dec3dd0abbe25b3883002

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page