A powerful, standalone web scraping toolkit using Playwright and various parsers.
Project description
Web Scraper Toolkit (Agentic + MCP Ready)
web-scraper-toolkit is a modular scraping and crawling toolkit designed for:
- Agentic runtimes (MCP clients like Claude Desktop, Cursor, custom orchestrators)
- Programmatic Python usage
- CLI workflows for scripts and batch pipelines
It focuses on robust extraction, safe automation, dynamic concurrency, and configurable runtime behavior without hardcoded operational limits.
Operational deployment guide: INSTRUCTIONS.md (Ubuntu/Windows/macOS service and remote runbooks).
Why this toolkit
- Agent-first MCP envelopes (status/meta/data JSON response shape)
- Dynamic concurrency (CLI + MCP + crawler workers scale by host capacity/config)
- Adaptive timeout profiles (
fast,standard,research,long) - Async job lifecycle for long tasks (
start_job,poll_job,cancel_job,list_jobs) - Remote MCP hosting support (
stdio,http,sse,streamable-http) - Optional API-key middleware for remote MCP endpoints
- Path safety rails for file-writing tools (screenshot/pdf/download)
Installation
pip install web-scraper-toolkit
playwright install
From source:
git clone https://github.com/imyourboyroy/WebScraperToolkit.git
cd WebScraperToolkit
pip install -e .
playwright install
Runtime config hierarchy (important)
Effective precedence:
- CLI arguments
- Environment variables (
WST_*) - Local cfg override (
settings.local.cfgorsettings.cfg) config.json- Built-in defaults
Use settings.example.cfg as your local override template.
Standalone usage (CLI)
Entry point:
web-scraper --help
Core examples:
# Single URL
web-scraper --url https://example.com --format markdown --export
# Batch input with dynamic workers
web-scraper --input urls.txt --format text --workers auto --merge --output-name merged.txt
# Sitemap tree extraction only
web-scraper --input https://example.com/sitemap.xml --site-tree --format json --output-name sitemap_tree.json
# Use custom config files
web-scraper --config ./config.json --local-config ./settings.local.cfg --url https://example.com
Key CLI options:
--url,--input,--crawl--format(markdown,text,html,metadata,screenshot,pdf,json,xml,csv)--workers(auto|max|dynamic|<int>)--delay--export,--merge,--output-dir,--temp-dir,--output-name,--clean--contacts--playbook--config,--local-config--headless,--verbose,--diagnostics
MCP server usage (agentic integration)
Entry point:
web-scraper-server --help
Local stdio (recommended for desktop agents)
web-scraper-server --stdio
Remote HTTP/streamable-http
web-scraper-server \
--transport streamable-http \
--host 0.0.0.0 \
--port 8000 \
--path /mcp
With API key:
set WST_MCP_API_KEY=your-secret-key
web-scraper-server --transport streamable-http --host 0.0.0.0 --port 8000 --path /mcp
Or:
web-scraper-server --transport streamable-http --api-key your-secret-key
Recommended remote deployment shape (best practice)
- Run MCP server as a system service on Ubuntu.
- Put Nginx/Caddy in front for TLS termination.
- Keep
require_api_key=truefor remote access. - Tune concurrency in
settings.local.cfg. - Use
start_job/poll_jobfor large workloads.
Example systemd unit:
[Unit]
Description=Web Scraper Toolkit MCP Server
After=network.target
[Service]
User=ubuntu
WorkingDirectory=/opt/webscraper
Environment=WST_MCP_API_KEY=change-me
ExecStart=/usr/bin/web-scraper-server --transport streamable-http --host 127.0.0.1 --port 8000 --path /mcp --config /opt/webscraper/config.json --local-config /opt/webscraper/settings.local.cfg
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
MCP tools
Scraping
scrape_url(url, selector?, max_length?, format?, timeout_profile?)batch_scrape(urls, format?, timeout_profile?, workers?)screenshot(url, path, timeout_profile?)save_pdf(url, path, timeout_profile?)get_metadata(url, timeout_profile?)
Discovery
get_sitemap(url, keywords?, limit?, timeout_profile?)crawl_site(url, timeout_profile?)extract_contacts(url, timeout_profile?)batch_contacts(urls, timeout_profile?)extract_links(url, filter_external?, timeout_profile?)search_web(query, timeout_profile?)deep_research(query, timeout_profile?)
Forms / utility
fill_form(url, fields, submit_selector?, save_session?, session_name?, timeout_profile?)extract_tables(url, table_selector?, timeout_profile?)click_element(url, selector, timeout_profile?)health_check()validate_url(url, timeout_profile?)detect_content_type(url, timeout_profile?)download_file(url, path, timeout_profile?)
Content
chunk_text(text, max_chunk_size?, overlap?)get_token_count(text, model?)truncate_text(text, max_tokens?, model?)
Management + runtime
configure_scraper(headless?, browser_type?, timeout_ms?)configure_stealth(respect_robots?, stealth_mode?)configure_runtime(overrides_json)reload_runtime_config(config_path?, local_config_path?)get_config()configure_retry(max_attempts?, initial_delay?, max_delay?)clear_cache(),get_cache_stats()clear_session(session_id?),new_session(),list_sessions()get_history(limit?),clear_history()run_playbook(playbook_json, proxies_json?, timeout_profile?)
Async job lifecycle (long-running tasks)
start_job(job_type, payload_json, timeout_profile?)poll_job(job_id, include_result?)cancel_job(job_id)list_jobs(limit?)
Supported start_job types:
batch_scrapedeep_researchrun_playbookbatch_contacts
Concurrency and timeout model
Concurrency
- CLI workers resolve dynamically from host capacity when using
auto. - MCP process workers and inflight limits are dynamic/configurable.
- Batch operations use dedicated, configurable worker limits.
- Crawler defaults can be tuned globally in runtime config.
Timeout profiles
Built-in profiles:
faststandardresearchlong
Profiles include:
soft_secondshard_secondsextension_secondsallow_extension
Timeouts are scaled by work units for batch/heavier calls.
Fast result handling pattern (remote horsepower, local control)
For high parallel workloads (e.g., 40–80+ concurrent tasks on server hardware):
- Call
start_job(...)from your local agent/runtime. - Poll with
poll_job(job_id)until terminal state. - Pull structured result payload into local memory/store.
This avoids long blocking calls and keeps laptop resources light.
Remote file output strategy
If your goal is “compute remotely, consume locally,” prefer:
scrape_url,batch_scrape,extract_contacts,deep_research- async jobs (
start_job/poll_job)
Use remote file tools only when explicitly needed:
screenshot,save_pdf,download_file
All file writes are constrained to runtime.safe_output_root.
Set safe_output_root to an isolated directory if remote files are required.
Config files
config.json
Use the runtime section for dynamic behavior:
{
"runtime": {
"default_timeout_profile": "standard",
"safe_output_root": "./output",
"concurrency": {
"cli_workers_default": "auto",
"mcp_process_workers": 0,
"mcp_inflight_limit": 0,
"mcp_batch_workers": 0,
"crawler_default_workers": 0
},
"server": {
"transport": "stdio",
"host": "127.0.0.1",
"port": 8000,
"path": "/mcp",
"require_api_key": false,
"api_key_env": "WST_MCP_API_KEY"
}
}
}
settings.local.cfg / settings.cfg
Use for machine/local overrides.
See: settings.example.cfg.
Environment variables
Common runtime env vars:
WST_CONFIG_JSONWST_LOCAL_CFGWST_TIMEOUT_PROFILEWST_MCP_PROCESS_WORKERSWST_MCP_INFLIGHT_LIMITWST_MCP_BATCH_WORKERSWST_CLI_WORKERS_DEFAULTWST_SERVER_TRANSPORTWST_SERVER_HOSTWST_SERVER_PORTWST_SERVER_PATHWST_SERVER_REQUIRE_API_KEYWST_SERVER_API_KEY_ENVWST_MCP_API_KEYWST_SAFE_OUTPUT_ROOT
Agent integration snippets
Claude Desktop / Cursor style (stdio)
{
"mcpServers": {
"web-scraper": {
"command": "web-scraper-server",
"args": ["--stdio"]
}
}
}
Remote MCP endpoint
Point your client to:
http://<host>:<port>/<path>
with x-api-key header (or Bearer token) if enabled.
Python client example:
import asyncio
from fastmcp import Client
async def main():
async with Client("https://mcp.example.com/mcp", auth="YOUR_API_KEY") as client:
result = await client.call_tool("start_job", {
"job_type": "batch_scrape",
"payload_json": "{\"urls\": [\"https://readyforus.app\", \"https://claragurney.com\"], \"format\": \"markdown\"}",
"timeout_profile": "research"
})
print(result.data)
asyncio.run(main())
Remote integration testing
Smoke script (recommended)
python verify_remote_mcp.py --remote-url https://mcp.example.com/mcp --targets https://readyforus.app https://claragurney.com
Environment-based variant:
export WST_REMOTE_MCP_URL=https://mcp.example.com/mcp
export WST_REMOTE_MCP_API_KEY=your-secret-key
python verify_remote_mcp.py
Pytest remote suite (optional)
export WST_REMOTE_MCP_URL=https://mcp.example.com/mcp
export WST_REMOTE_MCP_API_KEY=your-secret-key
pytest -q tests/test_remote_mcp_integration.py
These tests are skipped unless WST_REMOTE_MCP_URL is set.
Notes
- No local machine paths, private hostnames, or private IPs should be committed.
- Keep secrets in environment variables or local cfg files ignored by git.
- For heavy remote deployments, tune concurrency + timeout profiles together.
Author
Created by Roy Dawson IV
GitHub: https://github.com/imyourboyroy
PyPI: https://pypi.org/user/ImYourBoyRoy/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_scraper_toolkit-0.2.2.tar.gz.
File metadata
- Download URL: web_scraper_toolkit-0.2.2.tar.gz
- Upload date:
- Size: 124.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ae1a2f46cde42b3d6bbb70dbfba08d25f5bb4ca02748601c8b65a77983b6c7c
|
|
| MD5 |
17755b2e2b1c841ece56de6a9cf99f4b
|
|
| BLAKE2b-256 |
d9e867d9e9f08606aaaefe6477ae6b25c749fde2773606196f366265194b7144
|
Provenance
The following attestation bundles were made for web_scraper_toolkit-0.2.2.tar.gz:
Publisher:
publish.yml on ImYourBoyRoy/WebScraperToolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_scraper_toolkit-0.2.2.tar.gz -
Subject digest:
6ae1a2f46cde42b3d6bbb70dbfba08d25f5bb4ca02748601c8b65a77983b6c7c - Sigstore transparency entry: 1004173901
- Sigstore integration time:
-
Permalink:
ImYourBoyRoy/WebScraperToolkit@fc97029d8191fda1ba4d677cd9dd06ddb7653322 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/ImYourBoyRoy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fc97029d8191fda1ba4d677cd9dd06ddb7653322 -
Trigger Event:
push
-
Statement type:
File details
Details for the file web_scraper_toolkit-0.2.2-py3-none-any.whl.
File metadata
- Download URL: web_scraper_toolkit-0.2.2-py3-none-any.whl
- Upload date:
- Size: 139.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd95b39d8ce141bdfd22bd687fec0589f90da1f1a5f0af18614cb126459f8c6b
|
|
| MD5 |
85fd0f4320d2542eb330a92d8ef1daef
|
|
| BLAKE2b-256 |
bace60220f330d5c22e90beb820b8c7fbdf87bf32a6112fde83d696a3273fceb
|
Provenance
The following attestation bundles were made for web_scraper_toolkit-0.2.2-py3-none-any.whl:
Publisher:
publish.yml on ImYourBoyRoy/WebScraperToolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web_scraper_toolkit-0.2.2-py3-none-any.whl -
Subject digest:
dd95b39d8ce141bdfd22bd687fec0589f90da1f1a5f0af18614cb126459f8c6b - Sigstore transparency entry: 1004173920
- Sigstore integration time:
-
Permalink:
ImYourBoyRoy/WebScraperToolkit@fc97029d8191fda1ba4d677cd9dd06ddb7653322 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/ImYourBoyRoy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fc97029d8191fda1ba4d677cd9dd06ddb7653322 -
Trigger Event:
push
-
Statement type: