Skip to main content

A powerful CLI tool for crawling websites and downloading content including HTML, images, and videos. Features multiple browsing modes including normal requests, Selenium for JavaScript rendering, and camoufox for anti-bot protection.

Project description

Web Grabber

A powerful CLI tool for crawling websites and downloading content including HTML, images, and videos. Web Grabber offers multiple browsing modes including normal requests, Selenium for JavaScript rendering, and camoufox for anti-bot protection.

Features

  • Complete Website Crawling: Download all HTML pages, images, and videos from a website
  • Anti-Bot Protection: Use camoufox to avoid detection by anti-bot mechanisms
  • Tor Integration: Route traffic through Tor network for anonymity
  • Selenium Support: Render JavaScript for dynamic websites
  • Multi-threaded Downloading: Efficiently download resources in parallel
  • Targeted Scraping: Extract specific elements using CSS selectors

Installation

Using Rye (Recommended)

# Clone the repository
git clone https://github.com/tadeasf/web-grabber.git
cd web-grabber

# Install with Rye
rye sync
rye build
pip install dist/*.whl

Alternative: Standard pip installation

# Clone the repository
git clone https://github.com/tadeasf/web-grabber.git
cd web-grabber

# Install dependencies and package
pip install .

Usage

Once installed, you can use the web-grabber command directly from your terminal:

Basic Website Crawling

Download an entire website including all HTML, images, and videos:

web-grabber grab https://example.com --output-dir ./example_site --depth 2

Alternatively, you can run the module directly:

python -m web_grabber grab https://example.com --output-dir ./example_site --depth 2

Options for grab Command

  • url: URL of the website to crawl
  • --output-dir PATH: Directory to save downloaded content (default: ./grabbed_site)
  • --depth INT: Maximum crawl depth (default: 100, effectively unlimited for most sites)
  • --tor: Route traffic through Tor network
  • --selenium: Use Selenium for JavaScript rendered content
  • --camoufox: Use camoufox for anti-bot protection (overrides --selenium)
  • --threads INT: Number of concurrent threads for crawling (default: 5)
  • --delay FLOAT: Delay between requests in seconds (default: 0.5)
  • --timeout INT: Request timeout in seconds (default: 30)
  • --user-agent TEXT: Custom user agent string
  • --verbose: Enable verbose logging

Targeted Scraping

Extract specific elements from a website using CSS selectors:

web-grabber scrape https://example.com --selector "div.product" --output-file products.json

Options for scrape Command

  • url: URL of the website to scrape
  • --selector TEXT: CSS selector to extract specific elements
  • --output-file PATH: Output file for scraped data (default: scraped_data.json)
  • --format TEXT: Output format (json, csv, txt) (default: json)
  • --tor: Route traffic through Tor network
  • --selenium: Use Selenium for JavaScript rendered content
  • --camoufox: Use camoufox for anti-bot protection (overrides --selenium)
  • --user-agent TEXT: Custom user agent string
  • --verbose: Enable verbose logging

Anti-Bot Protection

Web Grabber offers anti-bot protection using camoufox, which helps avoid detection by implementing browser fingerprint spoofing techniques:

web-grabber grab https://example.com --camoufox

Tor Integration

Route your traffic through the Tor network for anonymity:

# Make sure Tor is running on 127.0.0.1:9050
web-grabber grab https://example.com --tor

Examples

Download a website with JavaScript rendering

web-grabber grab https://example.com --selenium --depth 3

Scrape product information with anti-bot protection

web-grabber scrape https://example.com/products --selector ".product-card" --camoufox

Anonymous crawling with Tor

web-grabber grab https://example.com --tor --delay 1.0

Requirements

  • Python 3.8+
  • Required dependencies (automatically installed):
    • camoufox[geoip] (for anti-bot protection)
    • typer (for CLI interface)
    • selenium (for JavaScript rendering)
  • Tor (optional, for anonymous browsing)
  • Chrome/Chromium (for Selenium and camoufox modes)

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_grabber-0.5.45.tar.gz (38.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_grabber-0.5.45-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file web_grabber-0.5.45.tar.gz.

File metadata

  • Download URL: web_grabber-0.5.45.tar.gz
  • Upload date:
  • Size: 38.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for web_grabber-0.5.45.tar.gz
Algorithm Hash digest
SHA256 a2844cf8ac0691d30a72fa36a964289ad7a5fd7b4eb90db488e2cab33d7d691a
MD5 971cbbd3149d7a5960502afd723c524e
BLAKE2b-256 5fe7a247f8776939a15bffd03f31623db1b79de0050d4ef38c1181421a578758

See more details on using hashes here.

File details

Details for the file web_grabber-0.5.45-py3-none-any.whl.

File metadata

  • Download URL: web_grabber-0.5.45-py3-none-any.whl
  • Upload date:
  • Size: 47.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for web_grabber-0.5.45-py3-none-any.whl
Algorithm Hash digest
SHA256 7bc1847cb895bb47ee736bf16c52cdd4dba257cf25c83fcb8167d710314df113
MD5 23c3f6d83c1ddbae81fda5aa9c132178
BLAKE2b-256 7b9f653f158768d60aa81563e81c6be0922a85f2f10cccd05884460c92231c3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page