Skip to main content

Web scraping engine

Project description

Rubbernecker

A web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Supports configurable page actions and depth-based crawling.

Installation

Prerequisites

Python 3.12+

Google Chrome

macOS:

brew install --cask google-chrome

Fedora/RHEL (including WSL 2):

sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable

Ubuntu/Debian:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable

Setup

make install

Or manually:

uv sync

Quick Start

See QUICKSTART.md for a step-by-step tutorial.

Commands

Command Description
crawl Scrape websites and save raw HTML to Avro files
fetch Download assets from a list of URLs
parse Extract structured data from crawled HTML
sitemap Discover page URLs from sitemaps or robots.txt

Documentation

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubbernecker-0.0.8.tar.gz (102.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubbernecker-0.0.8-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file rubbernecker-0.0.8.tar.gz.

File metadata

  • Download URL: rubbernecker-0.0.8.tar.gz
  • Upload date:
  • Size: 102.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.8.tar.gz
Algorithm Hash digest
SHA256 bb24b01505b46e24609c3dc49d694f01d7ffc72fcd501f8b0053ae861f7a56b6
MD5 db132b262e1191de3d4ed0761aee8780
BLAKE2b-256 dde37a430efb2a35408a9d0c6a6401c1bb282aa8fb52fc86874ae844538a82b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.8.tar.gz:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubbernecker-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: rubbernecker-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2c66cc6b7aa68ff9db645b5e66ba910719c0e852c481b7d6bb0791897529f59a
MD5 3fea6058df2f65465b18bce03fde5e8b
BLAKE2b-256 6cddddb0d9bd6e180c4a0fa79f32a297308638e634dbbdacd2e8b73a2ef6b30f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.8-py3-none-any.whl:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page