Python Library to Build Web Robots

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Framework
- AsyncIO
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP :: Indexing/Search

Project description

Python Web Robot Builder

The main idea of py-robot is to simplify the code, and improve the performance of web crawlers.

Install

pip install ciag-robot

Intro

Bellow we have a simple example of crawler that needs to get a page, and for each specific item get another page. Because it was written without the use of async requests, it will make a request and make the another one only when the previous has finished.

# examples/iot_eetimes.py

import requests
import json

from lxml import html
from pyquery.pyquery import PyQuery as pq

page = requests.get('https://iot.eetimes.com/')
dom = pq(html.fromstring(page.content.decode()))

result = []
for link in dom.find('.theiaStickySidebar ul li'):
    news = {
        'category': pq(link).find('span').text(),
        'url': pq(link).find('a[href]').attr('href'),
    }
    news_page = requests.get(news['url'])
    dom = pq(news_page.content.decode())
    news['body'] = dom.find('p').text()
    news['title'] = dom.find('h1.post-title').text()
    result.append(news)

print(json.dumps(result, indent=4))

We can rewrite that using py-robot, and it will look like that:

# examples/iot_eetimes2.py

import json
from robot import Robot
from robot.collector.shortcut import *
import logging

logging.basicConfig(level=logging.DEBUG)

collector = pipe(
    const('https://iot.eetimes.com/'),
    get(),
    css('.theiaStickySidebar ul li'),
    foreach(dict(
        pipe(
            css('a[href]'), attr('href'), any(),
            get(),
            dict(
                body=pipe(css('p'), as_text()),
                title=pipe(css('h1.post-title'), as_text()),
            )
        ),
        category=pipe(css('span'), as_text()),
        url=pipe(css('a[href]'), attr('href'), any(), url())
    ))
)

with Robot() as robot:
    result = robot.sync_run(collector)
print(json.dumps(result, indent=4))

Now all the requests will be async, so it will start all the requests for each item at the same time, and it will improve the performance of the crawler.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Framework
- AsyncIO
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP :: Indexing/Search

Release history Release notifications | RSS feed

0.4.dev1697464011 pre-release

Oct 16, 2023

0.4.dev1644352361 pre-release

Feb 8, 2022

0.4.dev1610168567 pre-release

Jan 9, 2021

0.4.dev1610153303 pre-release

Jan 9, 2021

This version

0.4.dev1610152548 pre-release

Jan 9, 2021

0.4.dev1610151867 pre-release

Jan 9, 2021

0.4.dev1610151595 pre-release

Jan 9, 2021

0.4.dev1610040836 pre-release

Jan 7, 2021

0.3.0

Jan 7, 2021

0.3.dev1610040110 pre-release

Jan 7, 2021

0.3.dev1609437896 pre-release

Dec 31, 2020

0.3.dev1609346365 pre-release

Dec 30, 2020

0.3.dev1608816307 pre-release

Dec 24, 2020

0.3.dev1608813685 pre-release

Dec 24, 2020

0.3.dev1608394521 pre-release

Dec 19, 2020

0.3.dev1608231497 pre-release

Dec 17, 2020

0.3.dev1608229779 pre-release

Dec 17, 2020

0.3.dev1608146104 pre-release

Dec 16, 2020

0.3.dev1608127091 pre-release

Dec 16, 2020

0.3.dev1607862911 pre-release

Dec 13, 2020

0.3.dev1607821683 pre-release

Dec 13, 2020

0.3.dev1607821221 pre-release

Dec 13, 2020

0.3.dev1607820209 pre-release

Dec 13, 2020

0.3.dev1607819237 pre-release

Dec 13, 2020

0.3.dev1607790087 pre-release

Dec 12, 2020

0.3.dev1607777884 pre-release

Dec 12, 2020

0.3.dev1607777640 pre-release

Dec 12, 2020

0.3.dev1607735393 pre-release

Dec 12, 2020

0.3.dev1607733776 pre-release

Dec 12, 2020

0.3.dev1607732818 pre-release

Dec 12, 2020

0.2.3

Dec 9, 2020

0.2.2.post1607473204

Dec 9, 2020

0.2.1

Dec 8, 2020

0.2.0

Dec 8, 2020

0.2.dev1607473775 pre-release

Dec 9, 2020

0.0.1

Aug 13, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ciag-robot-0.4.dev1610152548.tar.gz (14.3 kB view details)

Uploaded Jan 9, 2021 Source

File details

Details for the file ciag-robot-0.4.dev1610152548.tar.gz.

File metadata

Download URL: ciag-robot-0.4.dev1610152548.tar.gz
Upload date: Jan 9, 2021
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.7.9

File hashes

Hashes for ciag-robot-0.4.dev1610152548.tar.gz
Algorithm	Hash digest
SHA256	`cab669a4d773fb91574b681bc230ceef4201adeede4ab288f3dd0eef168df923`
MD5	`15ce71ae54a59a883c24f6cfc8138d83`
BLAKE2b-256	`c3f15163ae5e09812d884a7738bab7e4fb7743f07fc19821015540b32b723f3c`

See more details on using hashes here.

ciag-robot 0.4.dev1610152548

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python Web Robot Builder

Install

Intro

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes