Web scraping engine

These details have not been verified by PyPI

Project description

Rubbernecker

A powerful web scraping engine built with Python and SeleniumBase that crawls web pages, stores raw HTML, and parses structured data. Rubbernecker supports configurable page actions, depth-based crawling, and proxy integration.

Overview

Rubbernecker provides four main commands:

chrome - Launch a Chrome browser instance with debugging capabilities
crawl - Scrape websites and save raw HTML to Avro files
parse - Extract structured data from crawled HTML
proxy - Run a local proxy server for routing requests

Installation

Prerequisites

Python 3.12+

Rubbernecker requires Python 3.12 or higher.

Google Chrome

Rubbernecker uses SeleniumBase with Chrome for web crawling.

macOS:

brew install --cask google-chrome

Fedora/RHEL (including WSL 2):

sudo dnf install -y fedora-workstation-repositories
sudo dnf config-manager setopt google-chrome.enabled=1
sudo dnf install -y google-chrome-stable

Ubuntu/Debian:

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update
sudo apt install -y google-chrome-stable

Setup

Install dependencies and set up environment:

make install

Or manually:

poetry env use python3.12
poetry install

Quick Start

See QUICKSTART.md for a step-by-step tutorial to get up and running in minutes.

Commands

`rubbernecker chrome`

Launch a Chrome browser instance with DevTools Protocol enabled.

Options:

--headless - Run Chrome in headless mode (no GUI)
--chrome_debug_port PORT - Port for Chrome DevTools Protocol (default: 9222)
--proxy_server URL - Route traffic through a proxy server

Examples:

# Launch Chrome with visual interface
poetry run rubbernecker chrome

# Launch headless Chrome on custom port
poetry run rubbernecker chrome --headless --chrome_debug_port 9223

# Launch Chrome through a proxy
poetry run rubbernecker chrome --proxy_server "http://localhost:3128"

`rubbernecker crawl`

Crawl web pages and save raw HTML to Avro files.

Syntax:

rubbernecker crawl [OPTIONS] INPUT_URL OUTPUT_URL

Arguments:

INPUT_URL - File containing URLs to crawl (text, JSON, or Avro format)
OUTPUT_URL - Path where crawled data will be saved (Avro format)

Key Options:

--input_format FORMAT - Input file format: TEXT, JSON, or AVRO
--chrome_debug_port PORT - Connect to Chrome on this port (default: 9222)
--max_depth N - Maximum crawl depth for following links (default: 0)
--max_retries N - Retry failed requests up to N times
--sleep_success SECONDS - Wait time after successful requests
--sleep_error SECONDS - Wait time after errors
--load_actions FILE - Actions to perform after page load
--crawl_actions FILE - Actions to discover and crawl additional links
--use_bloom_filter - Skip duplicate URLs (useful for large crawls)
--max_errors N - Stop after N errors
--interactive - Prompt before each crawl action

Examples:

# Basic crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro

# Crawl with depth (follow links up to 2 levels)
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --max_depth 2

# Crawl with custom actions
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --load_actions tmp/load-actions.txt \
    --crawl_actions tmp/crawl-actions.txt

# Crawl with error handling
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --max_retries 3 \
    --max_errors 10 \
    --sleep_error 5

`rubbernecker parse`

Extract structured data from crawled HTML using parsers.

Syntax:

rubbernecker parse PARSER_CLASS INPUT_URL OUTPUT_URL

Arguments:

PARSER_CLASS - Fully qualified parser class name
INPUT_URL - Avro file from crawl command
OUTPUT_URL - Path for parsed output (Avro format)

Available Parsers:

rubbernecker.parse.standard.StandardPageParser - Extracts title, headers, links, and body text

Examples:

# Parse with standard parser
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
    tmp/raw.avro tmp/parsed.avro

# View parsed results
poetry run avrokit tojson tmp/parsed.avro | jq .

`rubbernecker proxy`

Run a local proxy server to route requests through an upstream proxy.

Syntax:

rubbernecker proxy UPSTREAM [LISTEN]

Arguments:

UPSTREAM - Upstream proxy (e.g., username:password@proxy.example.com:8000)
LISTEN - Local address to listen on (default: 127.0.0.1:3128)

Example:

# Start proxy server
poetry run rubbernecker proxy "$PROXY_USER:$PROXY_PASS@proxy.example.com:8000" "127.0.0.1:3128"

# Use proxy in Chrome
poetry run rubbernecker chrome --proxy_server "http://127.0.0.1:3128" --headless

# Use proxy in crawl
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro --chrome_debug_port 9222

Action Scripts

Action scripts define automated interactions with web pages using CSS selectors.

Action Script Format

[url_pattern_regex]
ACTION_NAME selector arguments
ACTION_NAME selector arguments
...

Available Actions

SLEEP seconds - Wait for specified duration
SCROLL pixels - Scroll page vertically
INPUT selector text - Fill form input with text
CLICK selector - Click an element
CLICK_IF_EXISTS selector - Click if element is present

Example: Load Actions

Actions to perform after each page loads (use --load_actions flag):

cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 2
SCROLL 500
SLEEP 1
EOF

Example: Crawl Actions

Actions to discover additional URLs during crawling (use --crawl_actions flag):

cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF

This will click the "More" link on Hacker News to discover additional pages.

Advanced Usage

Full Crawl Example

Complete example crawling Hacker News with actions:

# Prepare directories
mkdir -p tmp

# Create URL list
cat > tmp/requests.txt << EOF
https://news.ycombinator.com/
EOF

# Create load actions (wait for page to stabilize)
cat > tmp/load-actions.txt << EOF
[news\.ycombinator\.com]
SLEEP 1
SCROLL 500
EOF

# Create crawl actions (discover more pages)
cat > tmp/crawl-actions.txt << EOF
[news\.ycombinator\.com]
CLICK a.morelink
EOF

# Start Chrome
poetry run rubbernecker chrome --headless &

# Crawl with depth 2
poetry run rubbernecker crawl tmp/requests.txt tmp/hn-raw.avro \
    --load_actions tmp/load-actions.txt \
    --crawl_actions tmp/crawl-actions.txt \
    --max_depth 2 \
    --max_retries 2 \
    --sleep_success 1

# Parse results
poetry run rubbernecker parse rubbernecker.parse.standard.StandardPageParser \
    tmp/hn-raw.avro tmp/hn-parsed.avro

# View results
poetry run avrokit tojson tmp/hn-parsed.avro | jq '.title, .links | length'

Using with Proxies

Route traffic through a commercial proxy service:

# Start local proxy server
poetry run rubbernecker proxy \
    "$PROXY_USER:$PROXY_PASS@residential.proxy.com:8000" \
    "127.0.0.1:3128" &

# Start Chrome through proxy
poetry run rubbernecker chrome \
    --proxy_server "http://127.0.0.1:3128" \
    --chrome_debug_port 9222 \
    --headless &

# Crawl through proxy
poetry run rubbernecker crawl tmp/urls.txt tmp/output.avro \
    --chrome_debug_port 9222

Output Formats

Crawl Output (Raw HTML)

Avro schema with fields:

url (string) - Crawled URL
timestamp (long) - Unix timestamp in milliseconds
body (string|null) - Raw HTML content
error (string|null) - Error message if request failed
metadata (map|null) - Custom metadata

Parse Output (StandardPageParser)

Avro schema with fields:

url (string) - Page URL
timestamp (long) - Crawl timestamp
title (string|null) - Page title
content_length (int) - HTML content length
body_text (string|null) - Extracted text content
headers (array|null) - H1-H6 headers with level and text
links (array|null) - Links with text, URL, and external flag

Troubleshooting

Chrome connection issues:

Ensure Chrome is running with --chrome_debug_port matching crawl command
Check if port 9222 is available: lsof -i :9222

SeleniumBase errors:

Update Chrome to the latest version

Memory issues with large crawls:

Use --use_bloom_filter to reduce memory for duplicate detection
Process in smaller batches with multiple crawl commands

Development

Run tests:

make test

Run all tests (including integration tests):

make test-all

Run tests with coverage:

make test-coverage

Lint and type check:

make lint
make typecheck

Format code:

make format

Build the package:

make build

Clean up build artifacts:

make clean

Run with debug logging:

poetry run rubbernecker --debug crawl tmp/urls.txt tmp/output.avro

License

Apache-2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.8

Mar 29, 2026

0.0.7

Mar 29, 2026

0.0.6

Mar 29, 2026

0.0.5

Feb 16, 2026

0.0.4

Feb 15, 2026

0.0.3

Feb 15, 2026

This version

0.0.2

Feb 14, 2026

0.0.1

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubbernecker-0.0.2.tar.gz (21.9 kB view details)

Uploaded Feb 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubbernecker-0.0.2-py3-none-any.whl (25.6 kB view details)

Uploaded Feb 14, 2026 Python 3

File details

Details for the file rubbernecker-0.0.2.tar.gz.

File metadata

Download URL: rubbernecker-0.0.2.tar.gz
Upload date: Feb 14, 2026
Size: 21.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`eb6cb3057d81d12dfd2482377cd5763244d257f25e943f2e201478badf9f458c`
MD5	`e0b1de9f6c6adfda7fe35721df8ae47f`
BLAKE2b-256	`555afd26f08e60cf650e9cc372484b3edbd62a85d393a04e45c75281d75c7633`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.2.tar.gz:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rubbernecker-0.0.2.tar.gz
- Subject digest: eb6cb3057d81d12dfd2482377cd5763244d257f25e943f2e201478badf9f458c
- Sigstore transparency entry: 953277913
- Sigstore integration time: Feb 14, 2026
Source repository:
- Permalink: brandtg/rubbernecker@212afa6fef30cb9a4c03ce6d96ce31c647fcc46e
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/brandtg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@212afa6fef30cb9a4c03ce6d96ce31c647fcc46e
- Trigger Event: release

File details

Details for the file rubbernecker-0.0.2-py3-none-any.whl.

File metadata

Download URL: rubbernecker-0.0.2-py3-none-any.whl
Upload date: Feb 14, 2026
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rubbernecker-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6fa6fdfba473bea6fc0bdeb6251858e0a23c957208c4a094cbda6b729477ed19`
MD5	`1d46698b21a971fc957319c0898e4f25`
BLAKE2b-256	`a2563bb57e18a3ece753542af9aa4b9ed6fa33dc8af4f82756c52b241cbb26ae`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rubbernecker-0.0.2-py3-none-any.whl:

Publisher: release.yml on brandtg/rubbernecker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rubbernecker-0.0.2-py3-none-any.whl
- Subject digest: 6fa6fdfba473bea6fc0bdeb6251858e0a23c957208c4a094cbda6b729477ed19
- Sigstore transparency entry: 953277915
- Sigstore integration time: Feb 14, 2026
Source repository:
- Permalink: brandtg/rubbernecker@212afa6fef30cb9a4c03ce6d96ce31c647fcc46e
- Branch / Tag: refs/tags/v0.0.2
- Owner: https://github.com/brandtg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@212afa6fef30cb9a4c03ce6d96ce31c647fcc46e
- Trigger Event: release

rubbernecker 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Rubbernecker

Overview

Installation

Prerequisites

Setup

Quick Start

Commands

rubbernecker chrome

rubbernecker crawl

rubbernecker parse

rubbernecker proxy

Action Scripts

Action Script Format

Available Actions

Example: Load Actions

Example: Crawl Actions

Advanced Usage

Full Crawl Example

Using with Proxies

Output Formats

Crawl Output (Raw HTML)

Parse Output (StandardPageParser)

Troubleshooting

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`rubbernecker chrome`

`rubbernecker crawl`

`rubbernecker parse`

`rubbernecker proxy`