dude uncomplicated data extraction
Project description
License | Version | ||
Github Actions | Coverage | ||
Supported versions | Wheel | ||
Status | Downloads |
dude uncomplicated data extraction
Dude is a very simple framework for writing a web scraper using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.
🚨 Dude is currently in Pre-Alpha. Please expect breaking changes.
Minimal web scraper
The simplest web scraper will look like this:
from dude import select
@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
The example above will get all the hyperlink elements in a page and calls the handler function get_link()
for each element.
How to run the scraper
You can run your scraper from terminal/shell/command-line by supplying URLs, the output filename of your choice and the paths to your python codes to dude scrape
command.
dude scrape --url "<url>" --output data.json path/to/file.py
Features
- Simple Flask-inspired design - build a scraper with decorators.
- Uses Playwright API - run your scraper in Chrome, Firefox and Webkit and leverage Playwright's powerful selector engine supporting CSS, XPath, text, regex, etc.
- Data grouping - group related scraping data.
- URL pattern matching - run functions on specific URLs.
- Priority - reorder functions based on priority.
- Setup function - enable setup steps (clicking dialogs or login).
- Navigate function - enable navigation steps to move to other pages.
- Custom storage - option to save data to other formats or database.
- Async support - write async handlers.
- Option to use other parsers aside from Playwright.
Documentation
Read the complete documentation at https://roniemartinez.github.io/dude/. All the advanced and useful features are documented there.
Support
This project is at a very early stage. This dude needs some love! ❤️
Contribute to this project by feature requests, idea discussions, reporting bugs, opening pull requests, or through Github Sponsors. Your help is highly appreciated.
Requirements
- ✅ Any dude should know how to work with selectors (CSS or XPath).
- ✅ This library was built on top of Playwright. Any dude should be at least familiar with the basics of Playwright - they also extended the selectors to support text, regular expressions, etc. See Selectors | Playwright Python.
- ✅ Python decorators... you'll live, dude!
Why name this project "dude"?
- ✅ A Recursive acronym looks nice.
- ✅ Adding "uncomplicated" (like
ufw
) into the name says it is a very simple framework. - ✅ Puns! I also think that if you want to do web scraping, there's probably some random dude around the corner who can make it very easy for you to start with it. 😊
Author
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.