Skip to main content

Web scraping engine

Project description

Rubbernecker

A web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Supports configurable page actions and depth-based crawling.

Installation

Prerequisites

Python 3.12+

Google Chrome

macOS:

brew install --cask google-chrome

Fedora/RHEL (including WSL 2):

sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable

Ubuntu/Debian:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable

Setup

make install

Or manually:

uv sync

Quick Start

See QUICKSTART.md for a step-by-step tutorial.

Commands

Command Description
crawl Scrape websites and save raw HTML to Avro files
fetch Download assets from a list of URLs
parse Extract structured data from crawled HTML
sitemap Discover page URLs from sitemaps or robots.txt

Documentation

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubbernecker-0.0.7.tar.gz (96.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubbernecker-0.0.7-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file rubbernecker-0.0.7.tar.gz.

File metadata

  • Download URL: rubbernecker-0.0.7.tar.gz
  • Upload date:
  • Size: 96.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.7.tar.gz
Algorithm Hash digest
SHA256 00833689a2ea90c021aa9e4d8fc19489f54d79f0ac3caf88a832fdf7f12a8215
MD5 ce2e034f14e1df8671dfd3f4faac7908
BLAKE2b-256 e94d226151cf452937f650b81d37f810cbaad2ee8e3cb645d50af5e6b21645b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.7.tar.gz:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubbernecker-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: rubbernecker-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 3602187bc53222a2cdcdef0cf195b9c909c47de81585ba9b96528f60b8704d71
MD5 915a0d1d93a85e7baff13b529bad6ff3
BLAKE2b-256 b1a23007b339e015088267c42fbd6140886baee46c25cfeca88f360160495cb9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.7-py3-none-any.whl:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page