Web scraping engine
Project description
Rubbernecker
A powerful web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Rubbernecker supports configurable page actions, depth-based crawling, and proxy integration.
Overview
Rubbernecker provides four main commands:
chrome- Launch a Chrome browser instance with debugging capabilitiescrawl- Scrape websites and save raw HTML to Avro filesparse- Extract structured data from crawled HTMLproxy- Run a local proxy server for routing requests
Installation
Prerequisites
Python 3.12+
Rubbernecker requires Python 3.12 or higher.
Google Chrome
Rubbernecker uses SeleniumBase with Chrome for web crawling.
macOS:
brew install --cask google-chrome
Fedora/RHEL (including WSL 2):
sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable
Ubuntu/Debian:
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable
Setup
Install dependencies and set up environment:
make install
Or manually:
poetry env use python3.12
poetry install
Quick Start
See QUICKSTART.md for a step-by-step tutorial to get up and running in minutes.
Commands
rubbernecker chrome
Launch a Chrome browser instance with DevTools Protocol enabled.
Options:
--headless- Run Chrome in headless mode (no GUI)--chrome_debug_port PORT- Port for Chrome DevTools Protocol (default: 9222)--proxy_server URL- Route traffic through a proxy server
Examples:
# Launch Chrome with visual interface
poetry run rubbernecker chrome
# Launch headless Chrome on custom port
poetry run rubbernecker chrome --headless --chrome_debug_port 9223
# Launch Chrome through a proxy
poetry run rubbernecker chrome --proxy_server "http://localhost:3128"
rubbernecker crawl
Crawl web pages and save raw HTML to Avro files.
Syntax:
rubbernecker crawl [OPTIONS] INPUT_URL OUTPUT_URL
Arguments:
INPUT_URL- File containing URLs to crawl (text, JSON, or Avro format)OUTPUT_URL- Path where crawled data will be saved (Avro format)
Key Options:
--input_format FORMAT- Input file format: TEXT, JSON, or AVRO--chrome_debug_port PORT- Connect to Chrome on this port (default: 9222)--max_depth N- Maximum crawl depth for following links (default: 0)--max_retries N- Retry failed requests up to N times--sleep_success SECONDS- Wait time after successful requests--sleep_error SECONDS- Wait time after errors--load_actions FILE- Actions to perform after page load--crawl_actions FILE- Actions to discover and crawl additional links--use_bloom_filter- Skip duplicate URLs (useful for large crawls)--max_errors N- Stop after N errors--interactive- Prompt before each crawl action
Examples:
# Basic crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro
# Crawl with depth (follow links up to 2 levels)
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --max_depth 2
# Crawl with custom actions
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
--load_actions tmp/load-actions.txt \
--crawl_actions tmp/crawl-actions.txt
# Crawl with error handling
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
--max_retries 3 \
--max_errors 10 \
--sleep_error 5
rubbernecker parse
Extract structured data from crawled HTML using parsers.
Syntax:
rubbernecker parse PARSER_CLASS INPUT_URL OUTPUT_URL
Arguments:
PARSER_CLASS- Fully qualified parser class nameINPUT_URL- Avro file from crawl commandOUTPUT_URL- Path for parsed output (Avro format)
Available Parsers:
rubbernecker.parse.standard.StandardPageParser- Extracts title, headers, links, and body text
Examples:
# Parse with standard parser
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
tmp/raw.avro tmp/parsed.avro
# View parsed results
poetry run avrokit tojson tmp/parsed.avro | jq .
rubbernecker proxy
Run a local proxy server to route requests through an upstream proxy.
Syntax:
rubbernecker proxy UPSTREAM [LISTEN]
Arguments:
UPSTREAM- Upstream proxy (e.g.,username:password@proxy.example.com:8000)LISTEN- Local address to listen on (default:127.0.0.1:3128)
Example:
# Start proxy server
poetry run rubbernecker proxy "$PROXY_USER:$PROXY_PASS@proxy.example.com:8000" "127.0.0.1:3128"
# Use proxy in Chrome
poetry run rubbernecker chrome --proxy_server "http://127.0.0.1:3128" --headless
# Use proxy in crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --chrome_debug_port 9222
Action Scripts
Action scripts define automated interactions with web pages using CSS selectors.
Action Script Format
[url_pattern_regex]
ACTION_NAME selector arguments
ACTION_NAME selector arguments
...
Available Actions
- SLEEP
seconds- Wait for specified duration - SCROLL
pixels- Scroll page vertically - INPUT
selector text- Fill form input with text - CLICK
selector- Click an element - CLICK_IF_EXISTS
selector- Click if element is present
Example: Load Actions
Actions to perform after each page loads (use --load_actions flag):
cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 2
SCROLL 500
SLEEP 1
EOF
Example: Crawl Actions
Actions to discover additional URLs during crawling (use --crawl_actions flag):
cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF
This will click the "More" link on Hacker News to discover additional pages.
Advanced Usage
Full Crawl Example
Complete example crawling Hacker News with actions:
# Prepare directories
mkdir -p tmp
# Create URL list
cat > tmp/requests.txt << EOF
https://news.ycombinator.com/
EOF
# Create load actions (wait for page to stabilize)
cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 1
SCROLL 500
EOF
# Create crawl actions (discover more pages)
cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF
# Start Chrome
poetry run rubbernecker chrome --headless &
# Crawl with depth 2
poetry run rubbernecker crawl tmp/requests.txt tmp/hn-raw.avro \
--load_actions tmp/load-actions.txt \
--crawl_actions tmp/crawl-actions.txt \
--max_depth 2 \
--max_retries 2 \
--sleep_success 1
# Parse results
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
tmp/hn-raw.avro tmp/hn-parsed.avro
# View results
poetry run avrokit tojson tmp/hn-parsed.avro | jq '.title, .links | length'
Using with Proxies
Route traffic through a commercial proxy service:
# Start local proxy server
poetry run rubbernecker proxy \
"$PROXY_USER:$PROXY_PASS@residential.proxy.com:8000" \
"127.0.0.1:3128" &
# Start Chrome through proxy
poetry run rubbernecker chrome \
--proxy_server "http://127.0.0.1:3128" \
--chrome_debug_port 9222 \
--headless &
# Crawl through proxy
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
--chrome_debug_port 9222
Output Formats
Crawl Output (Raw HTML)
Avro schema with fields:
url(string) - Crawled URLtimestamp(long) - Unix timestamp in millisecondsbody(string|null) - Raw HTML contenterror(string|null) - Error message if request failedmetadata(map|null) - Custom metadata
Parse Output (StandardPageParser)
Avro schema with fields:
url(string) - Page URLtimestamp(long) - Crawl timestamptitle(string|null) - Page titlecontent_length(int) - HTML content lengthbody_text(string|null) - Extracted text contentheaders(array|null) - H1-H6 headers with level and textlinks(array|null) - Links with text, URL, and external flag
Troubleshooting
Chrome connection issues:
- Ensure Chrome is running with
--chrome_debug_portmatching crawl command - Check if port 9222 is available:
lsof -i :9222
SeleniumBase errors:
- Update Chrome to the latest version
Memory issues with large crawls:
- Use
--use_bloom_filterto reduce memory for duplicate detection - Process in smaller batches with multiple crawl commands
Development
Run tests:
make test
Run all tests (including integration tests):
make test-all
Run tests with coverage:
make test-coverage
Lint and type check:
make lint
make typecheck
Format code:
make format
Build the package:
make build
Clean up build artifacts:
make clean
Run with debug logging:
poetry run rubbernecker --debug crawl tmp/urls.txt tmp/output.avro
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rubbernecker-0.0.5.tar.gz.
File metadata
- Download URL: rubbernecker-0.0.5.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78e6810f1b15b435b3dbdfe5ca915aead4251985b23faafd95148d1bfb638858
|
|
| MD5 |
00e22288088dc72748cd7cb0a867b622
|
|
| BLAKE2b-256 |
9818f4f747d631cf5e889011d28bae7a69bae24a605f263f252fc3d396fe60ee
|
Provenance
The following attestation bundles were made for rubbernecker-0.0.5.tar.gz:
Publisher:
release.yml on brandtg/rubbernecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubbernecker-0.0.5.tar.gz -
Subject digest:
78e6810f1b15b435b3dbdfe5ca915aead4251985b23faafd95148d1bfb638858 - Sigstore transparency entry: 955505270
- Sigstore integration time:
-
Permalink:
brandtg/rubbernecker@e8fdc9c70c9ffedbe6e68faab70c9983e817e5ec -
Branch / Tag:
refs/tags/v0.0.5 - Owner: https://github.com/brandtg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e8fdc9c70c9ffedbe6e68faab70c9983e817e5ec -
Trigger Event:
release
-
Statement type:
File details
Details for the file rubbernecker-0.0.5-py3-none-any.whl.
File metadata
- Download URL: rubbernecker-0.0.5-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fea2f8f7e6da431dfaa9bf48ffce33fc60316c8c83249646f16044962ccc920e
|
|
| MD5 |
aa2c763689bef854ba486c17150f4c66
|
|
| BLAKE2b-256 |
36a8abeba97569f87be0c99ae4ca44416f44df8f913a2482215e0b48b7b689aa
|
Provenance
The following attestation bundles were made for rubbernecker-0.0.5-py3-none-any.whl:
Publisher:
release.yml on brandtg/rubbernecker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rubbernecker-0.0.5-py3-none-any.whl -
Subject digest:
fea2f8f7e6da431dfaa9bf48ffce33fc60316c8c83249646f16044962ccc920e - Sigstore transparency entry: 955505312
- Sigstore integration time:
-
Permalink:
brandtg/rubbernecker@e8fdc9c70c9ffedbe6e68faab70c9983e817e5ec -
Branch / Tag:
refs/tags/v0.0.5 - Owner: https://github.com/brandtg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e8fdc9c70c9ffedbe6e68faab70c9983e817e5ec -
Trigger Event:
release
-
Statement type: