A powerful CLI tool for crawling websites and downloading content including HTML, images, and videos. Features multiple browsing modes including normal requests, Selenium for JavaScript rendering, and camoufox for anti-bot protection.

Project description

Web Grabber

A powerful CLI tool for crawling websites and downloading content including HTML, images, and videos. Web Grabber offers multiple browsing modes including normal requests, Selenium for JavaScript rendering, and camoufox for anti-bot protection.

Features

Complete Website Crawling: Download all HTML pages, images, and videos from a website
Interactive Mode: Choose where to save content with path completion
Automatic Directory Naming: Automatically creates directories named after the website domain
Anti-Bot Protection: Use camoufox to avoid detection by anti-bot mechanisms
Tor Integration: Route traffic through Tor network for anonymity
Selenium Support: Render JavaScript for dynamic websites
Multi-threaded Downloading: Efficiently download resources in parallel
Targeted Scraping: Extract specific elements using CSS selectors

Installation

Using Rye (Recommended)

# Clone the repository
git clone https://github.com/tadeasf/web-grabber.git
cd web-grabber

# Install with Rye
rye sync
rye build
pip install dist/*.whl

Alternative: Standard pip installation

# Clone the repository
git clone https://github.com/tadeasf/web-grabber.git
cd web-grabber

# Install dependencies and package
pip install .

Usage

Once installed, you can use the web-grabber command directly from your terminal:

Basic Website Crawling

Download an entire website including all HTML, images, and videos:

web-grabber grab https://example.com

Web Grabber will interactively prompt you where to save the content, with the domain name as the default directory.

For non-interactive usage with explicit output directory:

web-grabber grab https://example.com --output-dir ./example_site --non-interactive

Alternatively, you can run the module directly:

python -m web_grabber grab https://example.com

Options for `grab` Command

url: URL of the website to crawl
--output-dir PATH: Directory to save downloaded content (defaults to domain name if not specified)
--non-interactive: Run in non-interactive mode (no prompts)
--depth INT: Maximum crawl depth (default: 100, effectively unlimited for most sites)
--tor: Route traffic through Tor network
--selenium: Use Selenium for JavaScript rendered content
--camoufox: Use camoufox for anti-bot protection (overrides --selenium)
--threads INT: Number of concurrent threads for crawling (default: 5)
--delay FLOAT: Delay between requests in seconds (default: 0.5)
--timeout INT: Request timeout in seconds (default: 30)
--user-agent TEXT: Custom user agent string
--verbose: Enable verbose logging

Targeted Scraping

Extract specific elements from a website using CSS selectors:

web-grabber scrape https://example.com --selector "div.product" --output-file products.json

Options for `scrape` Command

url: URL of the website to scrape
--selector TEXT: CSS selector to extract specific elements
--output-file PATH: Output file for scraped data (default: scraped_data.json)
--format TEXT: Output format (json, csv, txt) (default: json)
--tor: Route traffic through Tor network
--selenium: Use Selenium for JavaScript rendered content
--camoufox: Use camoufox for anti-bot protection (overrides --selenium)
--user-agent TEXT: Custom user agent string
--verbose: Enable verbose logging

Anti-Bot Protection

Web Grabber offers anti-bot protection using camoufox, which helps avoid detection by implementing browser fingerprint spoofing techniques:

web-grabber grab https://example.com --camoufox

Tor Integration

Route your traffic through the Tor network for anonymity:

# Make sure Tor is running on 127.0.0.1:9050
web-grabber grab https://example.com --tor

Examples

Interactive mode with automatic domain-based directory

web-grabber grab https://example.com
# Will interactively prompt with default directory "example.com"

Download a website with JavaScript rendering

web-grabber grab https://example.com --selenium --depth 3

Non-interactive mode with explicit output directory

web-grabber grab https://example.com --output-dir ./custom_folder --non-interactive

Scrape product information with anti-bot protection

web-grabber scrape https://example.com/products --selector ".product-card" --camoufox

Anonymous crawling with Tor

web-grabber grab https://example.com --tor --delay 1.0

Requirements

Python 3.8+
Required dependencies (automatically installed):
- prompt-toolkit (for interactive CLI features)
- camoufox[geoip] (for anti-bot protection)
- typer (for CLI interface)
- selenium (for JavaScript rendering)
Tor (optional, for anonymous browsing)
Chrome/Chromium (for Selenium and camoufox modes)

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

This version

0.5.46

Mar 4, 2025

0.5.45

Mar 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_grabber-0.5.46.tar.gz (38.4 kB view details)

Uploaded Mar 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web_grabber-0.5.46-py3-none-any.whl (48.1 kB view details)

Uploaded Mar 4, 2025 Python 3

File details

Details for the file web_grabber-0.5.46.tar.gz.

File metadata

Download URL: web_grabber-0.5.46.tar.gz
Upload date: Mar 4, 2025
Size: 38.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for web_grabber-0.5.46.tar.gz
Algorithm	Hash digest
SHA256	`18ca5ee6913637657f4bdbb27344ce106b7b1bde0cba2ac5beead63a5e6a9eb5`
MD5	`3ee6410b2e10503e08b7df04f8378dd3`
BLAKE2b-256	`d6f85eddb3a3c2f73e967774598d1a14b09f2ac1c2d5ca133725540ee6a08232`

See more details on using hashes here.

File details

Details for the file web_grabber-0.5.46-py3-none-any.whl.

File metadata

Download URL: web_grabber-0.5.46-py3-none-any.whl
Upload date: Mar 4, 2025
Size: 48.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for web_grabber-0.5.46-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1380d63bbc512d4fd9eba98afff4d53d34a98c4d73fd75fd7f62fe5d9d489f03`
MD5	`4e413396f8634048efd2132bb7e05cfa`
BLAKE2b-256	`23d75a6307ea141a76f5f568b5ad011a74a02046512bc6250e9a534edc30dfde`

See more details on using hashes here.

web-grabber 0.5.46

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Web Grabber

Features

Installation

Using Rye (Recommended)

Alternative: Standard pip installation

Usage

Basic Website Crawling

Options for grab Command

Targeted Scraping

Options for scrape Command

Anti-Bot Protection

Tor Integration

Examples

Interactive mode with automatic domain-based directory

Download a website with JavaScript rendering

Non-interactive mode with explicit output directory

Scrape product information with anti-bot protection

Anonymous crawling with Tor

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Options for `grab` Command

Options for `scrape` Command