Skip to main content

dude uncomplicated data extraction (For Pyto on iOS)

Project description

License License Version Version
Github Actions Github Actions Coverage CodeCov
Supported versions Python Versions Wheel Wheel
Status Status Downloads Downloads
All Contributors All Contributors

dude uncomplicated data extraction (For Pyto on iOS)

dude_pyto is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. dude_pyto has an easy-to-learn syntax.

๐Ÿšจ dude_pyto is currently in Pre-Alpha. Please expect breaking changes.

Special Version for Pyto

This branch makes Braveblock an optional dependency for use with Pyto on iOS.

Pyto, and other similar iOS apps, do not support the compilation of code after the app has been approved.

So, the Rust-based code of Braveblock will not be downloaded through Pyto.

Please visit roniemartinez/dude for the original repository.

Installation

To install, simply run the following from terminal.

pip install pydude
playwright install  # Install playwright binaries for Chrome, Firefox and Webkit.

Minimal web scraper

The simplest web scraper will look like this:

from dude_pyto import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

The example above will get all the hyperlink elements in a page and calls the handler function get_link() for each element.

How to run the scraper

You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python scripts to dude_pyto scrape command.

dude_pyto scrape --url "<url>" --output data.json path/to/script.py

The output in data.json should contain the actual URL and the metadata prepended with underscore.

[
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 0,
    "url": "/url-1.html"
  },
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 1,
    "url": "/url-2.html"
  },
  {
    "_page_number": 1,
    "_page_url": "https://dude.ron.sh/",
    "_group_id": 4502003824,
    "_group_index": 0,
    "_element_index": 2,
    "url": "/url-3.html"
  }
]

Changing the output to --output data.csv should result in the following CSV content.

data.csv

Features

  • Simple Flask-inspired design - build a scraper with decorators.
  • Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
  • Data grouping - group related results.
  • URL pattern matching - run functions on matched URLs.
  • Priority - reorder functions based on priority.
  • Setup function - enable setup steps (clicking dialogs or login).
  • Navigate function - enable navigation steps to move to other pages.
  • Custom storage - option to save data to other formats or database.
  • Async support - write async handlers.
  • Option to use other parser backends aside from Playwright.
  • Option to follow all links indefinitely (Crawler/Spider).
  • Events - attach functions to startup, pre-setup, post-setup and shutdown events.
  • Option to save data on every page.

Supported Parser Backends

By default, dude_pyto uses Playwright but gives you an option to use parser backends that you are familiar with. It is possible to use parser backends like BeautifulSoup4, Parsel, lxml, Pyppeteer, and Selenium.

Here is the summary of features supported by each parser backend.

Parser Backend Supports
Sync?
Supports
Async?
Selectors Setup
Handler
Navigate
Handler
CSS XPath Text Regex
Playwright โœ… โœ… โœ… โœ… โœ… โœ… โœ… โœ…
BeautifulSoup4 โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ ๐Ÿšซ
Parsel โœ… โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ
lxml โœ… โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ ๐Ÿšซ
Pyppeteer ๐Ÿšซ โœ… โœ… โœ… โœ… ๐Ÿšซ โœ… โœ…
Selenium โœ… โœ… โœ… โœ… โœ… ๐Ÿšซ โœ… โœ…

Using the Docker image

Pull the docker image using the following command.

docker pull roniemartinez/dude

Assuming that script.py exist in the current directory, run Dude using the following command.

docker run -it --rm -v "$PWD":/code roniemartinez/dude dude scrape --url <url> script.py

Documentation

Read the complete documentation at https://roniemartinez.github.io/dude/. All the advanced and useful features are documented there.

Requirements

  • โœ… Any dude_pyto should know how to work with selectors (CSS or XPath).
  • โœ… Familiarity with any backends that you love (see Supported Parser Backends)
  • โœ… Python decorators... you'll live, dude!

Why name this project "dude"?

  • โœ… A Recursive acronym looks nice.
  • โœ… Adding "uncomplicated" (like ufw) into the name says it is a very simple framework.
  • โœ… Puns! I also think that if you want to do web scraping, there's probably some random dude_pyto around the corner who can make it very easy for you to start with it. ๐Ÿ˜Š

Author

Ronie Martinez

Contributors โœจ

Thanks goes to these wonderful people (emoji key):


Ronie Martinez

๐Ÿšง ๐Ÿ’ป ๐Ÿ“– ๐Ÿš‡

This project follows the all-contributors specification. Contributions of any kind welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydude_pyto-0.22.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydude_pyto-0.22.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file pydude_pyto-0.22.0.tar.gz.

File metadata

  • Download URL: pydude_pyto-0.22.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.10.6 Darwin/21.6.0

File hashes

Hashes for pydude_pyto-0.22.0.tar.gz
Algorithm Hash digest
SHA256 56f51ef766d18b4cac3182ae478db6d4c843c1d4956b1bbafb50b37391b71b94
MD5 6d07758336c375751f89aceab39a48d6
BLAKE2b-256 b252c4be15a7084e2ab6f7133e99b16636cfadda3044f24ee22b3de70faab2bb

See more details on using hashes here.

File details

Details for the file pydude_pyto-0.22.0-py3-none-any.whl.

File metadata

  • Download URL: pydude_pyto-0.22.0-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0 CPython/3.10.6 Darwin/21.6.0

File hashes

Hashes for pydude_pyto-0.22.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b3925e187b6f5fae19ed786ab704798703d3a84fd67e13f91eff9c08a121f67
MD5 c5696c40f6dce432d87a5ee6efa46f3e
BLAKE2b-256 80c4e441db65239b5d50399f9b9ba23fffde6b23d7254817b3c117cea86bc674

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page