ruia · PyPI

An async web scraping micro-framework based on asyncio.

These details have not been verified by PyPI

Project links

Project description

Overview

An async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Documentation: 中文文档 |documentation
Awesome: https://github.com/ruia-plugins/awesome-ruia
Organization: https://github.com/ruia-plugins

Features

Easy: Declarative programming
Fast: Powered by asyncio
Extensible: Middlewares and plugins
Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

Usage

Item

Item can be used standalone, for testing, and for tiny crawlers.

import asyncio

from ruia import AttrField, TextField, Item


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

async def main():
    async for item in HackerNewsItem.get_items(url="https://news.ycombinator.com/"):
        print(item.title, item.url)

if __name__ == '__main__':
     items = asyncio.run(main())

Run: python demo.py

Notorious ‘Hijack Factory’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
 ......

Spider Control

Spider is used for control requests better. Spider supports concurrency control, which is very important for spiders.

import aiofiles

from ruia import AttrField, TextField, Item, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        """Define clean_* functions for data cleaning"""
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = [f'https://news.ycombinator.com/news?p={index}' for index in range(1, 3)]

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=response.html):
            yield item

    async def process_item(self, item: HackerNewsItem):
        """Ruia build-in method"""
        async with aiofiles.open('./hacker_news.txt', 'a') as f:
            await f.write(str(item.title) + '\n')


if __name__ == '__main__':
    HackerNewsSpider.start()

Run hacker_news_spider.py:

[2018-09-21 17:27:14,497]-ruia-INFO  spider::l54: Spider started!
[2018-09-21 17:27:14,502]-Request-INFO  request::l77: <GET: https://news.ycombinator.com/news?p=2>
[2018-09-21 17:27:14,527]-Request-INFO  request::l77: <GET: https://news.ycombinator.com/news?p=1>
[2018-09-21 17:27:16,388]-ruia-INFO  spider::l122: Stopping spider: ruia
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l68: Total requests: 2
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l71: Time usage: 0:00:01.891688
[2018-09-21 17:27:16,389]-ruia-INFO  spider::l72: Spider finished!

Custom middleware

ruia provides an easy way to customize requests.

The following middleware is based on the above example:

from ruia import Middleware

middleware = Middleware()


@middleware.request
async def print_on_request(request):
    request.metadata = {
        'index': request.url.split('=')[-1]
    }
    print(f"request: {request.metadata}")
    # Just operate request object, and do not return anything.


@middleware.response
async def print_on_response(request, response):
    print(f"response: {response.metadata}")

# Add HackerNewsSpider

if __name__ == '__main__':
    HackerNewsSpider.start(middleware=middleware)

JavaScript Support

You can load js by using ruia-pyppeteer.

For example:

import asyncio

from ruia_pyppeteer import PyppeteerRequest as Request

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.run(request.fetch()) # Python 3.7
print(response.html)

TODO

Cache for debug, to decreasing request limitation
Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

Report or fix bugs
Require or publish plugins
Write or fix documentation
Add test cases

Thanks

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.5

Sep 6, 2022

0.8.4

May 7, 2021

0.8.3

Apr 9, 2021

0.8.2

Feb 6, 2021

0.8.0

Dec 27, 2020

0.7.1

Dec 5, 2020

0.7.0

Oct 21, 2020

0.6.9

Aug 15, 2020

0.6.8

Jun 7, 2020

0.6.7

Mar 22, 2020

0.6.6

Mar 18, 2020

0.6.5

Feb 23, 2020

0.6.4

Jan 1, 2020

0.6.3

Oct 18, 2019

0.6.2

Aug 27, 2019

0.6.1

Jul 12, 2019

0.6.0

Jun 15, 2019

0.5.9

Jun 10, 2019

0.5.8

May 4, 2019

0.5.7

Mar 13, 2019

0.5.6

Mar 13, 2019

0.5.5

Feb 23, 2019

0.5.4

Feb 22, 2019

0.5.3

Feb 18, 2019

0.5.2

Feb 17, 2019

0.5.1

Feb 16, 2019

0.5.0

Feb 14, 2019

0.4.9

Feb 14, 2019

0.4.8

Feb 14, 2019

0.4.7

Feb 10, 2019

0.4.6

Feb 9, 2019

0.4.5

Jan 28, 2019

This version

0.4.4

Jan 28, 2019

0.4.3

Jan 26, 2019

0.4.2

Jan 25, 2019

0.4.1

Jan 25, 2019

0.4.0

Jan 25, 2019

0.3.3

Jan 24, 2019

0.3.2

Jan 23, 2019

0.3.1

Jan 23, 2019

0.3.0

Jan 23, 2019

0.2.1

Jan 22, 2019

0.2.0

Jan 22, 2019

0.1.3

Jan 19, 2019

0.1.2

Jan 16, 2019

0.1.1

Jan 16, 2019

0.1.0

Dec 28, 2018

0.0.9

Dec 20, 2018

0.0.8

Nov 30, 2018

0.0.7

Nov 30, 2018

0.0.5

Nov 23, 2018

0.0.4

Nov 23, 2018

0.0.3

Nov 22, 2018

0.0.2

Oct 17, 2018

0.0.1

Oct 17, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruia-0.4.4.tar.gz (18.1 kB view details)

Uploaded Jan 28, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ruia-0.4.4-py2.py3-none-any.whl (21.4 kB view details)

Uploaded Jan 28, 2019 Python 2Python 3

File details

Details for the file ruia-0.4.4.tar.gz.

File metadata

Download URL: ruia-0.4.4.tar.gz
Upload date: Jan 28, 2019
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for ruia-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`3783c9fe0fd42a3afb89201d316e0f35520d7cdeb9853e270902cd6b1f3bdf4f`
MD5	`2028edaf23548acafcfc852e9ce7d5e7`
BLAKE2b-256	`2f3ca13d5b372cd33823f8fbc757dbcbd8c3bda04fa3cab598c8b15dd7b65a6e`

See more details on using hashes here.

File details

Details for the file ruia-0.4.4-py2.py3-none-any.whl.

File metadata

Download URL: ruia-0.4.4-py2.py3-none-any.whl
Upload date: Jan 28, 2019
Size: 21.4 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for ruia-0.4.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d2dcdbb5e2aa0d1424b240d53a727e43cf49bce8e92f5f89d54a6f66fb42a26`
MD5	`b72e87dd1a68d3a9c3be09d1ec0e4928`
BLAKE2b-256	`e1958c9035cc200ded37234dbc10836b09a176f1bf9dec3dd0abbe25b3883002`

See more details on using hashes here.

ruia 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Features

Installation

Tutorials

Usage

Item

Spider Control

Custom middleware

JavaScript Support

TODO

Contribution

Thanks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes