Skip to main content

dude uncomplicated data extraction

Project description

License License Version Version
Github Actions Github Actions Coverage CodeCov
Supported versions Python Versions Wheel Wheel
Status Status Downloads Downloads

dude uncomplicated data extraction

Dude is a very simple framework for writing a web scraper using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy to learn syntax.

🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.

Minimal web scraper

The simplest web scraper will look like this:

from dude import select


@select(selector="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

The example above will get all the hyperlink elements in a page and calls the handler function get_link() for each element. To start scraping, just simply run in your terminal:

How to run the scraper

You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python codes to dude scrape command.

dude scrape --url "<url>" --output data.json path/to/file.py #(1)

Features

  • Simple Flask-inspired design - build a scraper with decorators.
  • Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
  • Data grouping - group related scraping data.
  • URL pattern matching - run functions on specific URLs.
  • Priority - reorder functions based on priority.
  • Setup function - enable setup steps (clicking dialogs or login).
  • Navigate function - enable navigation steps to move to other pages.
  • Custom storage - option to save data to other formats or database.
  • Async support - write async handlers.
  • BeautifulSoup4 - option to use BeautifulSoup4 instead of Playwright.

Documentation

Read the complete documentation at https://roniemartinez.github.io/dude/. All the advanced and useful features are documented there.

Support

This project is at a very early stage. This dude needs some love! ❤️

Contribute to this project by feature requests, idea discussions, reporting bugs, opening pull requests, or through Github Sponsors. Your help is highly appreciated.

Github Sponsors

Requirements

  • ✅ Any dude should know how to work with selectors (CSS or XPath).
  • ✅ This library was built on top of Playwright. Any dude should be at least familiar with the basics of Playwright - they also extended the selectors to support text, regular expressions, etc. See Selectors | Playwright Python.
  • ✅ Python decorators... you'll live, dude!

Why name this project "dude"?

  • ✅ A Recursive acronym looks nice.
  • ✅ Adding "uncomplicated" (like ufw) into the name says it is a very simple framework.
  • ✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊

Author

Ronie Martinez

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydude-0.3.0.tar.gz (24.4 kB view hashes)

Uploaded Source

Built Distribution

pydude-0.3.0-py3-none-any.whl (26.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page