A powerful, standalone web scraping toolkit using Playwright and various parsers.
Project description
🕷️ WebScraperToolkit
A production-grade, multimodal scraping engine designed for AI Agents. Converts the web into LLM-ready assets (Markdown, JSON, PDF) with robust anti-bot evasion.
✨ Design Goals
-
LLM Native Output is optimized for context windows. Clean Markdown, semantic JSON Metadata, and noise-free text extraction.
-
Robust Failover Smart detection of anti-bot challenges (Cloudflare/403s) automatically triggers a switch from Headless to Visible browser mode to pass checks.
-
Privacy & Stealth Uses
playwright-stealthand randomized user agents to mimic human behavior. Does not leak automation headers. -
Agent Friendly Fully typed Python API that integrates defining tools for MCP Servers using
fastmcp. -
Operational Excellence
- Process Isolation: Uses
ProcessPoolExecutorto sandbox scraping tasks, preventing browser crashes from killing the main agent process. - Unified Logging: Centralized logging ensures consistent observability across CLI and Server modes.
- Process Isolation: Uses
⭐ Features
- Multimodal Extraction:
- Markdown: Clean, structured text preserving headers, lists, and tables.
- PDF: High-fidelity captures with auto-scroll enforcement for lazy-loaded assets.
- Screenshot: Full-page visual captures.
- Metadata: Extracts JSON-LD, OpenGraph, and meta tags.
- Anti-Bot Evasion:
- "Smart Fetch" logic retries blocked requests in headed mode.
- Spatial solver for Cloudflare Turnstile widgets.
- Discovery:
- Sitemap parsing (XML) to extract all navigable URLs.
- Recursive crawling for same-domain links.
- Performance:
- Parallel processing via
asyncioandProcessPoolExecutor. - Customizable concurrency and politeness delays.
- Parallel processing via
🚀 Installation
PyPI (Recommended)
pip install web-scraper-toolkit
playwright install # Required to download browser binaries
From Source
# Clone and install
git clone https://github.com/imyourboyroy/WebScraperToolkit.git
cd WebScraperToolkit
pip install -e .
playwright install
Requires Python 3.10+.
🧪 Quick Start
CLI (Global Command)
# Basic Markdown Extraction (Best for RAG)
web-scraper --url https://example.com --format markdown
# High-Fidelity PDF with Auto-Scroll
web-scraper --url https://example.com --format pdf --output-name example.pdf
# Sitemap to JSON (Site Mapping)
web-scraper --input https://example.com/sitemap.xml --site-tree --format json --output-name map.json
Standalone (No Install)
If you prefer running without full installation:
python scraper_cli.py --url https://example.com --format markdown
🛠️ CLI Reference
web-scraper [OPTIONS]
| Option | Shorthand | Description | Default |
|---|---|---|---|
--url |
-u |
Single target URL to scrape. | None |
--input |
-i |
Input file (.txt, .csv, .json, sitemap .xml) or URL. |
None |
--format |
-f |
Output: markdown, pdf, screenshot, json, html, csv. |
markdown |
--headless |
Run browser in headless mode. (Off/Visible by default for stability). | False |
|
--workers |
-w |
Number of concurrent workers. Pass max for CPU - 1. |
1 |
--merge |
-m |
Merge all outputs into a single file (e.g., one book PDF). | False |
--site-tree |
Extract URLs from sitemap input without crawling content. | False |
|
--verbose |
-v |
Enable verbose logging. | False |
🤖 Python API (for Agents/MCP)
Integrate the WebCrawler directly into your Python applications.
import asyncio
from web_scraper_toolkit import WebCrawler, ScraperConfig
async def agent_task():
# 1. Configure
config = ScraperConfig.load({
"scraper_settings": {"headless": True},
"workers": 2
})
# 2. Instantiate
crawler = WebCrawler(config=config)
# 3. Run
results = await crawler.run(
urls=["https://example.com"],
output_format="markdown",
output_dir="./memory"
)
print(results)
if __name__ == "__main__":
asyncio.run(agent_task())
✅ Verified Outputs
Data matches exactly what the test script produces.
Command:
web-scraper --url https://example.com --format markdown --headless --verbose
StdOut:
2025-12-10 11:15:00 - DEBUG - Verbose logging enabled.
========================================
Active Configuration
========================================
ScraperConfig:
{'delay': 0.0,
'scraper_settings': {'headless': True, ...},
'workers': 1}
========================================
--- Starting Single Target Scrape: https://example.com ---
Format: MARKDOWN
[1/1] Processing: https://example.com
--- Content Start ---
=== SCRAPED FROM: https://example.com/ (MARKDOWN) ===
# Example Domain
This domain is for use in documentation examples...
[Learn more](https://iana.org/domains/example)
--- Content End ---
Done.
🧰 Development
# Setup check
python run_tests.py
# Run verification suite
python scripts/verify_real_world.py
📜 License
MIT. See LICENSE.
⭐ Support
Created by: Roy Dawson IV
GitHub: https://github.com/imyourboyroy
PyPi: https://pypi.org/user/ImYourBoyRoy/
If this tool helps you, star the repo and share it. Issues and PRs welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_scraper_toolkit-0.1.1.tar.gz.
File metadata
- Download URL: web_scraper_toolkit-0.1.1.tar.gz
- Upload date:
- Size: 49.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
788836de9ec6b6f1e509f6f6fcda987e395e533f067c4f389a69595ca26f05c5
|
|
| MD5 |
a6d909cdbbfcdf12f2fed6c8f0748824
|
|
| BLAKE2b-256 |
89e172c2f0e49ec3f3651e20aca5331049328319dfa55585e73ce07dd84269c3
|
Provenance
The following attestation bundles were made for web_scraper_toolkit-0.1.1.tar.gz:
Publisher:
publish.yml on ImYourBoyRoy/WebScraperToolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_scraper_toolkit-0.1.1.tar.gz -
Subject digest:
788836de9ec6b6f1e509f6f6fcda987e395e533f067c4f389a69595ca26f05c5 - Sigstore transparency entry: 759695281
- Sigstore integration time:
-
Permalink:
ImYourBoyRoy/WebScraperToolkit@3cab344909f91cc1a79ede6ba88e0bcac727a9fb -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ImYourBoyRoy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3cab344909f91cc1a79ede6ba88e0bcac727a9fb -
Trigger Event:
push
-
Statement type:
File details
Details for the file web_scraper_toolkit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: web_scraper_toolkit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 48.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76c0974262c867354a77a811f42730f3a422d94c84334d23ac8f0bc62f8f6802
|
|
| MD5 |
a21ba3b90c3043f4d65378d7af9d09ec
|
|
| BLAKE2b-256 |
075da10345496dfa5ccd289337c31466936a3b4523ef305db8103e2e4c02c445
|
Provenance
The following attestation bundles were made for web_scraper_toolkit-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ImYourBoyRoy/WebScraperToolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_scraper_toolkit-0.1.1-py3-none-any.whl -
Subject digest:
76c0974262c867354a77a811f42730f3a422d94c84334d23ac8f0bc62f8f6802 - Sigstore transparency entry: 759695289
- Sigstore integration time:
-
Permalink:
ImYourBoyRoy/WebScraperToolkit@3cab344909f91cc1a79ede6ba88e0bcac727a9fb -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ImYourBoyRoy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3cab344909f91cc1a79ede6ba88e0bcac727a9fb -
Trigger Event:
push
-
Statement type: