dude uncomplicated data extraction

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

License		Version
Github Actions		Coverage
Supported versions		Wheel
Status		Downloads

dude uncomplicated data extraction

Dude is a very simple framework to write a web scraper using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy to learn syntax.

🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.

Minimal web scraper

The simplest web scraper will look like this:

from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

The example above will get all the hyperlink elements in a page and calls the handler function get_link() for each element. To start scraping, just simply run in your terminal:

dude scrape --url "<url>" path/to/file.py

Another option is to run from python code by calling dude.run() like below and running python path/to/file.py:

from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh/"])

Features

Simple Flask-inspired design - build a scraper with decorators.
Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
Data grouping - group related scraping data.
URL pattern matching - run functions on specific URLs.
Priority - reorder functions based on priority.
Setup function - enable setup steps (clicking dialogs or login).
Navigate function - enable navigation steps to move to other pages.
Custom storage - option to save data to other formats or database.
Async support - write async handlers.
BeautifulSoup4 - option to use BeautifulSoup4 instead of Playwright.

Support

This project is at a very early stage. This dude needs some love! ❤️

Contribute to this project by feature requests, idea discussions, reporting bugs, opening pull requests, or through Github Sponsors. Your help is highly appreciated.

How to use

Requirements

✅ Any dude should know how to work with selectors (CSS or XPath).
✅ This library was built on top of Playwright. Any dude should be at least familiar with the basics of Playwright - they also extended the selectors to support text, regular expressions, etc. See Selectors | Playwright Python.
✅ Python decorators... you'll live, dude!

Installation

To install, simply run:

pip install pydude
playwright install

The second command will install playwright binaries for Chrome, Firefox and Webkit. See https://playwright.dev/python/docs/intro#pip

Basic Usage

To use dude, start by importing the library.

from dude import select

A basic handler function consists of the structure below. A handler function should accept 1 argument (element) and should be decorated with @select(). The handler should return a dictionary.

@select(selector="<put-your-selector-here>")
def handler(element):
    ...
    # This dictionary can contain multiple items
    return {"<key>": "<value-extracted-from-element>"}

The example handler below extracts the text content of any element that matches the selector css=.title.

@select(selector="css=.title")
def result_title(element):
    """
    Result title.
    """
    return {"title": element.text_content()}

To run your handler functions, simply call dude.run(urls=["<url-you-want-to-scrape>"]).

import dude

dude.run(urls=["https://dude.ron.sh/"])

It is possible to attach a single handler to multiple selectors.

@select(selector="<a-selector>")
@select(selector="<another-selector>")
def handler(element):
    return {"<key>": "<value-extracted-from-element>"}

Check out the example in examples/flat.py and run it on your terminal using the command python examples/flat.py.

Advanced Usage

Setup

Some websites might require you to click on dialog buttons. You can pass setup=True parameter to declare the setup actions.

@select(selector="text=I agree", setup=True)
def agree(element, page):
    """
    Clicks "I agree" in order to use the website.
    """
    with page.expect_navigation():
        element.click()

Navigate

To navigate to another page, you can pass navigate=True parameter to declare the navigate actions.

@select(selector="text=Next", navigate=True)
def next_page(element, page):
    """
    Clicks the Next button/link to navigate to the next page.
    """
    with page.expect_navigation():
        element.click()

Grouping Results

When scraping a page containing a list of information, for example, a Search Engine Results Page (SERP) can have URLs, titles and descriptions, it is important to know how data can be grouped. By default, all scraped results are grouped by :root which is the root document, creating a flat list. To specify grouping, pass group=<selector-for-grouping> to @select() decorator.

In the example below, the results are grouped by an element with class custom-group. The matched selectors should be children of this element.

@select(selector="css=.title", group="css=.custom-group")
def result_title(element):
    return {"title": element.text_content()}

A more extensive example can be found at examples/grouping.py.

The group parameter has the advantage of making sure that items are in their correct group. Take for example the HTML below, notice that in the second div, there is no description.

    <div class="custom-group">
        <p class="title">Title 1</p>
        <p class="description">Description 1</p>
    </div>
    <div class="custom-group">
        <p class="title">Title 2</p>
    </div>
    <div class="custom-group">
        <p class="title">Title 3</p>
        <p class="description">Description 3</p>
    </div>

When the group is not specified, it will result in "Description 3" being grouped with "Title 2".

[
  {
    "_page_number": 1,
    // ...
    "description": "Description 1",
    "title": "Title 1"
  },
  {
    "_page_number": 1,
    // ...
    "description": "Description 3",
    "title": "Title 2"
  },
  {
    "_page_number": 1,
    // ...
    "title": "Title 3"
  }
]

By specifying the group in @select(..., group="css=.custom-group"), we will be able to get a better result.

[
  {
    "_page_number": 1,
    // ...
    "description": "Description 1",
    "title": "Title 1"
  },
  {
    "_page_number": 1,
    // ...
    "title": "Title 2"
  },
  {
    "_page_number": 1,
    // ...
    "description": "Description 3",
    "title": "Title 3"
  }
]

The `group` parameter simplifies how you write your code

ℹ️ The examples below are both acceptable way to write a scraper. You have the option to choose how you write the code.

A common way developers write scraper can be illustrated using this example below (see examples/single_handler.py for the complete script). While this works, it can be hard to maintain.

@select(selector="css=.custom-group")
def result_handler(element):
    """
    Perform all the heavy-lifting in a single handler.
    """
    data = {}

    url = element.query_selector("a.url")
    if url:
        data["url"] = url.get_attribute("href")

    title = element.query_selector(".title")
    if title:
        data["title"] = title.text_content()

    description = element.query_selector(".description")
    if description:
        data["description"] = description.text_content()

    return data

This can be rewritten in a much simpler way like below (see examples/grouping.py for the complete script). It will require you to write 3 simple functions but is much easier to read as you don't have to deal with querying the child elements.

@select(selector="css=a.url", group="css=.custom-group")
def result_url(element):
    return {"url": element.get_attribute("href")}


@select(selector="css=.title", group="css=.custom-group")
def result_title(element):
    return {"title": element.text_content()}


@select(selector="css=.description", group="css=.custom-group")
def result_description(element):
    return {"description": element.text_content()}

URL Pattern Matching

In order to use a handler function to just specific websites, a url pattern parameter can be passed to @select(). The url pattern parameter should be a valid regular expression. The example below will only run if the URL of the current page matches .*\.com.

@select(selector="css=.title", url=r".*\.com")
def result_title(element):
    return {"title": element.text_content()}

A more extensive example can be found at examples/url_pattern.py.

Prioritization

Handlers are sorted based on the following sequence:

URL Pattern
Group
Selector
Priority

If all handlers have the same priority value, they will be executed based on which handler was inserted into the rule list first. This arrangement depends on how handlers are defined inside python files and which python files was imported first. If no priority was provided to @select() decorator, the value defaults to 100.

The example below makes sure that result_description() will be called first before result_title().

@select(selector="css=.title", priority=1)
def result_title(element):
    return {"title": element.text_content()}


@select(selector="css=.description", priority=0)
def result_description(element):
    return {"description": element.text_content()}

The priority value is most useful on Setup and Navigate handlers. As an example below, the selector css=#pnnext will be queried first before looking for text=Next. Take note that if css=#pnnext exists, then text=Next will not be queried anymore.

@select(selector="text=Next", navigate=True)
@select(selector="css=#pnnext", navigate=True, priority=0)
def next_page(element, page):
    with page.expect_navigation():
        element.click()

A more extensive example can be found at examples/priority.py.

Custom Storage

Dude currently support json, yaml/yml and csv formats only (the Scraper class only support json). However, this can be extended to support a custom storage or override the existing formats using the @save() decorator. The save function should accept 2 parameters, data (list of dictionary of scraped data) and optional output (can be filename or None). Take note that the save function must return a boolean for success.

The example below prints the output to terminal using tabulate for illustration purposes only. You can use the @save() decorator in other ways like saving the scraped data to spreadsheets, database or send it to an API.

import tabulate


@save("table")
def save_table(data, output) -> bool:
    """
    Prints data to stdout using tabulate.
    """
    print(tabulate.tabulate(tabular_data=data, headers="keys", maxcolwidths=50))
    return True

The custom storage can then be called using any of these methods:

From terminal

dude scrape --url "<url>" path/to/file.py --format table

From python

if __name__ == "__main__":
    import dude

    dude.run(urls=["<url>"], pages=2, format="table")

A more extensive example can be found at examples/custom_storage.py.

Using the Scraper application class

The decorator @select() and the function run() simplifies the usage of the framework. It is possible to create your own scraper application object using the example below.

🚨 WARNING: This is not currently supported by the command line interface! Please use the command python path/to/file.py to run the scraper application.

from dude import Scraper

app = Scraper()


@app.select(selector="css=.title")
def result_title(element):
    return {"title": element.text_content()}


if __name__ == '__main__':
    app.run(urls=["https://dude.ron.sh/"])

A more extensive example can be found at examples/application.py.

Async Support

Handler functions can be converted to async. It is not possible to mix async and sync handlers since Playwright does not support this. It is however, possible to have async and sync storage handlers at the same time since this is not connected to Playwright anymore.

@select(selector="css=.title")
async def result_title(element):
    return {"title": await element.text_content()}

@save("json")
async def save_json(data, output) -> bool:
    ...
    return True

@save("xml")
def save_xml(data, output) -> bool:
    # sync storage handler can be used on sync and async mode
    ...
    return True

A more extensive example can be found at examples/async.py.

Using BeautifulSoup4

Option to use BeautifulSoup4 is now available. To install, run:

pip install pydude[bs4]

Attributes and texts from soup objects can be accessed using the examples below:

@select(selector="a.url")
def result_url(soup):
    return {"url": soup["href"]}


@select(selector=".title")
def result_title(soup):
    return {"title": soup.get_text()}

To use BeautifulSoup4 from the command line, just add the --bs4 argument:

dude scrape --url "<url>" --bs4 path/to/file.py

To use BeautifulSoup4 from python code, just pass the parameter parser="bs4" to run() function.

dude.run(urls=["https://dude.ron.sh/"], format="bs4")

Examples are can be found at examples/soup.py and examples/async_soup.py.

CLI

% dude scrape -h                                                                 
usage: dude scrape [-h] --url URL [--playwright | --bs4] [--headed] [--browser {chromium,webkit,firefox}] [--pages PAGES] [--output OUTPUT] [--format FORMAT] [--proxy-server PROXY_SERVER]
                   [--proxy-user PROXY_USER] [--proxy-pass PROXY_PASS]
                   PATH [PATH ...]

Run the dude scraper.

options:
  -h, --help            show this help message and exit

required arguments:
  PATH                  Path to python file/s containing the handler functions.
  --url URL             Website URL to scrape. Accepts one or more url (e.g. "dude scrape --url <url1> --url <url2> ...")

optional arguments:
  --playwright          Use Playwright.
  --bs4                 Use BeautifulSoup4.
  --headed              Run headed browser.
  --browser {chromium,webkit,firefox}
                        Browser type to use.
  --pages PAGES         Maximum number of pages to crawl before exiting (default=1). This is only valid when a navigate handler is defined.
  --output OUTPUT       Output file. If not provided, prints into the terminal.
  --format FORMAT       Output file format. If not provided, uses the extension of the output file or defaults to "json". Supports "json", "yaml/yml", and "csv" but can be extended using the @save()
                        decorator.
  --proxy-server PROXY_SERVER
                        Proxy server.
  --proxy-user PROXY_USER
                        Proxy username.
  --proxy-pass PROXY_PASS
                        Proxy password.

Why name this project "dude"?

✅ A Recursive acronym looks nice.
✅ Adding "uncomplicated" (like ufw) into the name says it is a very simple framework.
✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊

Author

Ronie Martinez

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.28.0

Mar 2, 2024

0.27.0

Aug 12, 2023

0.26.0

Jun 13, 2023

0.25.1

May 21, 2023

0.25.0

Mar 21, 2023

0.24.0

Mar 9, 2023

0.23.0

Feb 7, 2023

0.22.0

Jul 4, 2022

0.21.1

Jun 8, 2022

0.21.0

Jun 4, 2022

0.20.3

Jun 1, 2022

0.20.2

May 25, 2022

0.20.1

May 21, 2022

0.20.0

May 21, 2022

0.19.0

May 5, 2022

0.18.0

Apr 10, 2022

0.17.0

Apr 9, 2022

0.16.0

Apr 5, 2022

0.15.2

Apr 4, 2022

0.15.1

Mar 31, 2022

0.15.0

Mar 31, 2022

0.14.0

Mar 29, 2022

0.13.0

Mar 27, 2022

0.12.2

Mar 27, 2022

0.12.1

Mar 27, 2022

0.12.0

Mar 25, 2022

0.11.0

Mar 23, 2022

0.10.1

Mar 13, 2022

0.10.0

Mar 13, 2022

0.9.2

Mar 12, 2022

0.9.1

Mar 11, 2022

0.9.0

Mar 10, 2022

0.8.0

Mar 7, 2022

0.7.1

Mar 6, 2022

0.7.0

Mar 6, 2022

0.6.1

Mar 6, 2022

0.6.0

Mar 6, 2022

0.5.1

Mar 6, 2022

0.5.0

Mar 6, 2022

0.4.2

Mar 4, 2022

0.4.1

Mar 4, 2022

0.4.0

Mar 4, 2022

0.3.0

Mar 2, 2022

This version

0.2.0

Feb 25, 2022

0.1.0

Feb 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydude-0.2.0.tar.gz (32.0 kB view hashes)

Uploaded Feb 25, 2022 Source

Built Distribution

pydude-0.2.0-py3-none-any.whl (29.7 kB view hashes)

Uploaded Feb 25, 2022 Python 3

Hashes for pydude-0.2.0.tar.gz

Hashes for pydude-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`1dff9fdec6b45caa0119b6c618da422d8335eb475b8c58ee734af0de66dfd2eb`
MD5	`95e62ff1ea82d24e47e8c5c7de67ecf0`
BLAKE2b-256	`2a53089c062bf6a5c622c58c66494e844f1dd6c8a95d92ac8d0499e2ae84d2df`

Hashes for pydude-0.2.0-py3-none-any.whl

Hashes for pydude-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb86cb3630abf6d9accf3ad9b1cb0258998ce0fcd7165c829b46a204e2a8ed5c`
MD5	`abb8d731d03378ba1d1a45028507389f`
BLAKE2b-256	`8a63de2dc272e6c081c883c5a998614f792ab9133bd0d0288735bf9503e9110e`

pydude 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

dude uncomplicated data extraction

Minimal web scraper

Features

Support

How to use

Requirements

Installation

Basic Usage

Advanced Usage

Setup

Navigate

Grouping Results

The group parameter simplifies how you write your code

URL Pattern Matching

Prioritization

Custom Storage

Using the Scraper application class

Async Support

Using BeautifulSoup4

CLI

Why name this project "dude"?

Author

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

The `group` parameter simplifies how you write your code