Skip to main content

Python Library to Build Web Robots

Project description

Python Web Robot Builder

Maintainability Test Coverage PyPI version License Contributor Covenant

Forked from ettoreleandrotognoli/etto-robot

The main idea of py-robot is to simplify the code, and improve the performance of web crawlers.

Install

pip install git+https://github.com/ettoreleandrotognoli/py-robot

Intro

Bellow we have a simple example of crawler that needs to get a page, and for each specific item get another page. Because it was written without the use of async requests, it will make a request and make the another one only when the previous has finished.

# examples/iot_eetimes.py

import requests
import json

from lxml import html
from pyquery.pyquery import PyQuery as pq

page = requests.get('https://iot.eetimes.com/')
dom = pq(html.fromstring(page.content.decode()))

result = []
for link in dom.find('.theiaStickySidebar ul li'):
    news = {
        'category': pq(link).find('span').text(),
        'url': pq(link).find('a[href]').attr('href'),
    }
    news_page = requests.get(news['url'])
    dom = pq(news_page.content.decode())
    news['body'] = dom.find('p').text()
    news['title'] = dom.find('h1.post-title').text()
    result.append(news)

print(json.dumps(result, indent=4))

We can rewrite that using py-robot, and it will look like that:

# examples/iot_eetimes2.py

import json
from robot import Robot
from robot.collector.shortcut import *
import logging

logging.basicConfig(level=logging.DEBUG)

collector = pipe(
    const('https://iot.eetimes.com/'),
    get(),
    css('.theiaStickySidebar ul li'),
    foreach(dict(
        pipe(
            css('a[href]'), attr('href'), any(),
            get(),
            dict(
                body=pipe(css('p'), as_text()),
                title=pipe(css('h1.post-title'), as_text()),
            )
        ),
        category=pipe(css('span'), as_text()),
        url=pipe(css('a[href]'), attr('href'), any(), url())
    ))
)

with Robot() as robot:
    result = robot.sync_run(collector)
print(json.dumps(result, indent=4))

Now all the requests will be async, so it will start all the requests for each item at the same time, and it will improve the performance of the crawler.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etto-robot-0.0.dev1630358362.tar.gz (13.4 kB view details)

Uploaded Source

File details

Details for the file etto-robot-0.0.dev1630358362.tar.gz.

File metadata

  • Download URL: etto-robot-0.0.dev1630358362.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.11

File hashes

Hashes for etto-robot-0.0.dev1630358362.tar.gz
Algorithm Hash digest
SHA256 51540ac51ccbe11940a9735bf6033d9a2ada83173271aea2fd2211675ed4b148
MD5 0c5201aefd2310b22f582537872686a2
BLAKE2b-256 793a225ffb21ff47a6e57b56cc310ab9ecdb15eb1034a50280f6cc8e4801a791

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page