Skip to main content

Web scraping engine

Project description

Rubbernecker

A powerful web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Rubbernecker supports configurable page actions, depth-based crawling, and proxy integration.

Overview

Rubbernecker provides four main commands:

  • chrome - Launch a Chrome browser instance with debugging capabilities
  • crawl - Scrape websites and save raw HTML to Avro files
  • parse - Extract structured data from crawled HTML
  • proxy - Run a local proxy server for routing requests

Installation

Prerequisites

Python 3.12+

Rubbernecker requires Python 3.12 or higher.

Google Chrome

Rubbernecker uses SeleniumBase with Chrome for web crawling.

macOS:

brew install --cask google-chrome

Fedora/RHEL (including WSL 2):

sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable

Ubuntu/Debian:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable

Setup

Install dependencies and set up environment:

make install

Or manually:

poetry env use python3.12
poetry install

Quick Start

See QUICKSTART.md for a step-by-step tutorial to get up and running in minutes.

Commands

rubbernecker chrome

Launch a Chrome browser instance with DevTools Protocol enabled.

Options:

  • --headless - Run Chrome in headless mode (no GUI)
  • --chrome_debug_port PORT - Port for Chrome DevTools Protocol (default: 9222)
  • --proxy_server URL - Route traffic through a proxy server

Examples:

# Launch Chrome with visual interface
poetry run rubbernecker chrome

# Launch headless Chrome on custom port
poetry run rubbernecker chrome --headless --chrome_debug_port 9223

# Launch Chrome through a proxy
poetry run rubbernecker chrome --proxy_server "http://localhost:3128"

rubbernecker crawl

Crawl web pages and save raw HTML to Avro files.

Syntax:

rubbernecker crawl [OPTIONS] INPUT_URL OUTPUT_URL

Arguments:

  • INPUT_URL - File containing URLs to crawl (text, JSON, or Avro format)
  • OUTPUT_URL - Path where crawled data will be saved (Avro format)

Key Options:

  • --input_format FORMAT - Input file format: TEXT, JSON, or AVRO
  • --chrome_debug_port PORT - Connect to Chrome on this port (default: 9222)
  • --max_depth N - Maximum crawl depth for following links (default: 0)
  • --max_retries N - Retry failed requests up to N times
  • --sleep_success SECONDS - Wait time after successful requests
  • --sleep_error SECONDS - Wait time after errors
  • --load_actions FILE - Actions to perform after page load
  • --crawl_actions FILE - Actions to discover and crawl additional links
  • --use_bloom_filter - Skip duplicate URLs (useful for large crawls)
  • --max_errors N - Stop after N errors
  • --interactive - Prompt before each crawl action

Examples:

# Basic crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro

# Crawl with depth (follow links up to 2 levels)
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --max_depth 2

# Crawl with custom actions
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --load_actions tmp/load-actions.txt \
    --crawl_actions tmp/crawl-actions.txt

# Crawl with error handling
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --max_retries 3 \
    --max_errors 10 \
    --sleep_error 5

rubbernecker parse

Extract structured data from crawled HTML using parsers.

Syntax:

rubbernecker parse PARSER_CLASS INPUT_URL OUTPUT_URL

Arguments:

  • PARSER_CLASS - Fully qualified parser class name
  • INPUT_URL - Avro file from crawl command
  • OUTPUT_URL - Path for parsed output (Avro format)

Available Parsers:

  • rubbernecker.parse.standard.StandardPageParser - Extracts title, headers, links, and body text

Examples:

# Parse with standard parser
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
    tmp/raw.avro tmp/parsed.avro

# View parsed results
poetry run avrokit tojson tmp/parsed.avro | jq .

rubbernecker proxy

Run a local proxy server to route requests through an upstream proxy.

Syntax:

rubbernecker proxy UPSTREAM [LISTEN]

Arguments:

  • UPSTREAM - Upstream proxy (e.g., username:password@proxy.example.com:8000)
  • LISTEN - Local address to listen on (default: 127.0.0.1:3128)

Example:

# Start proxy server
poetry run rubbernecker proxy "$PROXY_USER:$PROXY_PASS@proxy.example.com:8000" "127.0.0.1:3128"

# Use proxy in Chrome
poetry run rubbernecker chrome --proxy_server "http://127.0.0.1:3128" --headless

# Use proxy in crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --chrome_debug_port 9222

Action Scripts

Action scripts define automated interactions with web pages using CSS selectors.

Action Script Format

[url_pattern_regex]
ACTION_NAME selector arguments
ACTION_NAME selector arguments
...

Available Actions

  • SLEEP seconds - Wait for specified duration
  • SCROLL pixels - Scroll page vertically
  • INPUT selector text - Fill form input with text
  • CLICK selector - Click an element
  • CLICK_IF_EXISTS selector - Click if element is present

Example: Load Actions

Actions to perform after each page loads (use --load_actions flag):

cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 2
SCROLL 500
SLEEP 1
EOF

Example: Crawl Actions

Actions to discover additional URLs during crawling (use --crawl_actions flag):

cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF

This will click the "More" link on Hacker News to discover additional pages.

Advanced Usage

Full Crawl Example

Complete example crawling Hacker News with actions:

# Prepare directories
mkdir -p tmp

# Create URL list
cat > tmp/requests.txt << EOF
https://news.ycombinator.com/
EOF

# Create load actions (wait for page to stabilize)
cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 1
SCROLL 500
EOF

# Create crawl actions (discover more pages)
cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF

# Start Chrome
poetry run rubbernecker chrome --headless &

# Crawl with depth 2
poetry run rubbernecker crawl tmp/requests.txt tmp/hn-raw.avro \
    --load_actions tmp/load-actions.txt \
    --crawl_actions tmp/crawl-actions.txt \
    --max_depth 2 \
    --max_retries 2 \
    --sleep_success 1

# Parse results
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
    tmp/hn-raw.avro tmp/hn-parsed.avro

# View results
poetry run avrokit tojson tmp/hn-parsed.avro | jq '.title, .links | length'

Using with Proxies

Route traffic through a commercial proxy service:

# Start local proxy server
poetry run rubbernecker proxy \
    "$PROXY_USER:$PROXY_PASS@residential.proxy.com:8000" \
    "127.0.0.1:3128" &

# Start Chrome through proxy
poetry run rubbernecker chrome \
    --proxy_server "http://127.0.0.1:3128" \
    --chrome_debug_port 9222 \
    --headless &

# Crawl through proxy
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --chrome_debug_port 9222

Output Formats

Crawl Output (Raw HTML)

Avro schema with fields:

  • url (string) - Crawled URL
  • timestamp (long) - Unix timestamp in milliseconds
  • body (string|null) - Raw HTML content
  • error (string|null) - Error message if request failed
  • metadata (map|null) - Custom metadata

Parse Output (StandardPageParser)

Avro schema with fields:

  • url (string) - Page URL
  • timestamp (long) - Crawl timestamp
  • title (string|null) - Page title
  • content_length (int) - HTML content length
  • body_text (string|null) - Extracted text content
  • headers (array|null) - H1-H6 headers with level and text
  • links (array|null) - Links with text, URL, and external flag

Troubleshooting

Chrome connection issues:

  • Ensure Chrome is running with --chrome_debug_port matching crawl command
  • Check if port 9222 is available: lsof -i :9222

SeleniumBase errors:

  • Update Chrome to the latest version

Memory issues with large crawls:

  • Use --use_bloom_filter to reduce memory for duplicate detection
  • Process in smaller batches with multiple crawl commands

Development

Run tests:

make test

Run all tests (including integration tests):

make test-all

Run tests with coverage:

make test-coverage

Lint and type check:

make lint
make typecheck

Format code:

make format

Build the package:

make build

Clean up build artifacts:

make clean

Run with debug logging:

poetry run rubbernecker --debug crawl tmp/urls.txt tmp/output.avro

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubbernecker-0.0.2.tar.gz (21.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubbernecker-0.0.2-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file rubbernecker-0.0.2.tar.gz.

File metadata

  • Download URL: rubbernecker-0.0.2.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.2.tar.gz
Algorithm Hash digest
SHA256 eb6cb3057d81d12dfd2482377cd5763244d257f25e943f2e201478badf9f458c
MD5 e0b1de9f6c6adfda7fe35721df8ae47f
BLAKE2b-256 555afd26f08e60cf650e9cc372484b3edbd62a85d393a04e45c75281d75c7633

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.2.tar.gz:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rubbernecker-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: rubbernecker-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6fa6fdfba473bea6fc0bdeb6251858e0a23c957208c4a094cbda6b729477ed19
MD5 1d46698b21a971fc957319c0898e4f25
BLAKE2b-256 a2563bb57e18a3ece753542af9aa4b9ed6fa33dc8af4f82756c52b241cbb26ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.2-py3-none-any.whl:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page