Web scraping engine
Project description
Rubbernecker
A web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Supports configurable page actions and depth-based crawling.
Installation
Prerequisites
Python 3.12+
Google Chrome
macOS:
brew install --cask google-chrome
Fedora/RHEL (including WSL 2):
sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable
Ubuntu/Debian:
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable
Setup
make install
Or manually:
uv sync
Quick Start
See QUICKSTART.md for a step-by-step tutorial.
Commands
| Command | Description |
|---|---|
crawl |
Scrape websites and save raw HTML to Avro files |
fetch |
Download assets from a list of URLs |
parse |
Extract structured data from crawled HTML |
sitemap |
Discover page URLs from sitemaps or robots.txt |
Documentation
- Action Scripts — Automate page interactions during crawls
- Output Formats — Avro schemas for all command outputs
- Development — Testing, linting, and build commands
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rubbernecker-0.0.7.tar.gz.
File metadata
- Download URL: rubbernecker-0.0.7.tar.gz
- Upload date:
- Size: 96.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00833689a2ea90c021aa9e4d8fc19489f54d79f0ac3caf88a832fdf7f12a8215
|
|
| MD5 |
ce2e034f14e1df8671dfd3f4faac7908
|
|
| BLAKE2b-256 |
e94d226151cf452937f650b81d37f810cbaad2ee8e3cb645d50af5e6b21645b9
|
Provenance
The following attestation bundles were made for rubbernecker-0.0.7.tar.gz:
Publisher:
release.yml on brandtg/rubbernecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubbernecker-0.0.7.tar.gz -
Subject digest:
00833689a2ea90c021aa9e4d8fc19489f54d79f0ac3caf88a832fdf7f12a8215 - Sigstore transparency entry: 1191992790
- Sigstore integration time:
-
Permalink:
brandtg/rubbernecker@b5f78ccfc5b2e2334ad52c585c4b79781deb0303 -
Branch / Tag:
refs/tags/v0.0.7 - Owner: https://github.com/brandtg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b5f78ccfc5b2e2334ad52c585c4b79781deb0303 -
Trigger Event:
release
-
Statement type:
File details
Details for the file rubbernecker-0.0.7-py3-none-any.whl.
File metadata
- Download URL: rubbernecker-0.0.7-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3602187bc53222a2cdcdef0cf195b9c909c47de81585ba9b96528f60b8704d71
|
|
| MD5 |
915a0d1d93a85e7baff13b529bad6ff3
|
|
| BLAKE2b-256 |
b1a23007b339e015088267c42fbd6140886baee46c25cfeca88f360160495cb9
|
Provenance
The following attestation bundles were made for rubbernecker-0.0.7-py3-none-any.whl:
Publisher:
release.yml on brandtg/rubbernecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubbernecker-0.0.7-py3-none-any.whl -
Subject digest:
3602187bc53222a2cdcdef0cf195b9c909c47de81585ba9b96528f60b8704d71 - Sigstore transparency entry: 1191992791
- Sigstore integration time:
-
Permalink:
brandtg/rubbernecker@b5f78ccfc5b2e2334ad52c585c4b79781deb0303 -
Branch / Tag:
refs/tags/v0.0.7 - Owner: https://github.com/brandtg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b5f78ccfc5b2e2334ad52c585c4b79781deb0303 -
Trigger Event:
release
-
Statement type: