A powerful, standalone web scraping toolkit using Playwright and various parsers.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ImYourBoyRoy

These details have not been verified by PyPI

Project description

Web Scraper Toolkit (Agentic + MCP Ready)

web-scraper-toolkit is a modular scraping and crawling toolkit designed for:

Agentic runtimes (MCP clients like Claude Desktop, Cursor, custom orchestrators)
Programmatic Python usage
CLI workflows for scripts and batch pipelines

It focuses on robust extraction, safe automation, dynamic concurrency, and configurable runtime behavior without hardcoded operational limits.

Operational deployment guide: INSTRUCTIONS.md (Ubuntu/Windows/macOS service and remote runbooks).

Why this toolkit

Agent-first MCP envelopes (status/meta/data JSON response shape)
Dynamic concurrency (CLI + MCP + crawler workers scale by host capacity/config)
Adaptive timeout profiles (fast, standard, research, long)
Async job lifecycle for long tasks (start_job, poll_job, cancel_job, list_jobs)
Remote MCP hosting support (stdio, http, sse, streamable-http)
Optional API-key middleware for remote MCP endpoints
Path safety rails for file-writing tools (screenshot/pdf/download)

Installation

pip install web-scraper-toolkit
playwright install

From source:

git clone https://github.com/imyourboyroy/WebScraperToolkit.git
cd WebScraperToolkit
pip install -e .
playwright install

Runtime config hierarchy (important)

Effective precedence:

CLI arguments
Environment variables (WST_*)
Local cfg override (settings.local.cfg or settings.cfg)
config.json
Built-in defaults

Use settings.example.cfg as your local override template.

Standalone usage (CLI)

Entry point:

web-scraper --help

Core examples:

# Single URL
web-scraper --url https://example.com --format markdown --export

# Batch input with dynamic workers
web-scraper --input urls.txt --format text --workers auto --merge --output-name merged.txt

# Sitemap tree extraction only
web-scraper --input https://example.com/sitemap.xml --site-tree --format json --output-name sitemap_tree.json

# Use custom config files
web-scraper --config ./config.json --local-config ./settings.local.cfg --url https://example.com

Key CLI options:

--url, --input, --crawl
--format (markdown, text, html, metadata, screenshot, pdf, json, xml, csv)
--workers (auto|max|dynamic|<int>)
--delay
--export, --merge, --output-dir, --temp-dir, --output-name, --clean
--contacts
--playbook
--config, --local-config
--headless, --verbose, --diagnostics

MCP server usage (agentic integration)

Entry point:

web-scraper-server --help

Local stdio (recommended for desktop agents)

web-scraper-server --stdio

Remote HTTP/streamable-http

web-scraper-server \
  --transport streamable-http \
  --host 0.0.0.0 \
  --port 8000 \
  --path /mcp

With API key:

set WST_MCP_API_KEY=your-secret-key
web-scraper-server --transport streamable-http --host 0.0.0.0 --port 8000 --path /mcp

Or:

web-scraper-server --transport streamable-http --api-key your-secret-key

Recommended remote deployment shape (best practice)

Run MCP server as a system service on Ubuntu.
Put Nginx/Caddy in front for TLS termination.
Keep require_api_key=true for remote access.
Tune concurrency in settings.local.cfg.
Use start_job/poll_job for large workloads.

Example systemd unit:

[Unit]
Description=Web Scraper Toolkit MCP Server
After=network.target

[Service]
User=ubuntu
WorkingDirectory=/opt/webscraper
Environment=WST_MCP_API_KEY=change-me
ExecStart=/usr/bin/web-scraper-server --transport streamable-http --host 127.0.0.1 --port 8000 --path /mcp --config /opt/webscraper/config.json --local-config /opt/webscraper/settings.local.cfg
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

MCP tools

Scraping

scrape_url(url, selector?, max_length?, format?, timeout_profile?)
batch_scrape(urls, format?, timeout_profile?, workers?)
screenshot(url, path, timeout_profile?)
save_pdf(url, path, timeout_profile?)
get_metadata(url, timeout_profile?)

Discovery

get_sitemap(url, keywords?, limit?, timeout_profile?)
crawl_site(url, timeout_profile?)
extract_contacts(url, timeout_profile?)
batch_contacts(urls, timeout_profile?)
extract_links(url, filter_external?, timeout_profile?)
search_web(query, timeout_profile?)
deep_research(query, timeout_profile?)

Forms / utility

fill_form(url, fields, submit_selector?, save_session?, session_name?, timeout_profile?)
extract_tables(url, table_selector?, timeout_profile?)
click_element(url, selector, timeout_profile?)
health_check()
validate_url(url, timeout_profile?)
detect_content_type(url, timeout_profile?)
download_file(url, path, timeout_profile?)

Content

chunk_text(text, max_chunk_size?, overlap?)
get_token_count(text, model?)
truncate_text(text, max_tokens?, model?)

Management + runtime

configure_scraper(headless?, browser_type?, timeout_ms?)
configure_stealth(respect_robots?, stealth_mode?)
configure_runtime(overrides_json)
reload_runtime_config(config_path?, local_config_path?)
get_config()
configure_retry(max_attempts?, initial_delay?, max_delay?)
clear_cache(), get_cache_stats()
clear_session(session_id?), new_session(), list_sessions()
get_history(limit?), clear_history()
run_playbook(playbook_json, proxies_json?, timeout_profile?)

Async job lifecycle (long-running tasks)

start_job(job_type, payload_json, timeout_profile?)
poll_job(job_id, include_result?)
cancel_job(job_id)
list_jobs(limit?)

Supported start_job types:

batch_scrape
deep_research
run_playbook
batch_contacts

Concurrency and timeout model

Concurrency

CLI workers resolve dynamically from host capacity when using auto.
MCP process workers and inflight limits are dynamic/configurable.
Batch operations use dedicated, configurable worker limits.
Crawler defaults can be tuned globally in runtime config.

Timeout profiles

Built-in profiles:

fast
standard
research
long

Profiles include:

soft_seconds
hard_seconds
extension_seconds
allow_extension

Timeouts are scaled by work units for batch/heavier calls.

Fast result handling pattern (remote horsepower, local control)

For high parallel workloads (e.g., 40–80+ concurrent tasks on server hardware):

Call start_job(...) from your local agent/runtime.
Poll with poll_job(job_id) until terminal state.
Pull structured result payload into local memory/store.

This avoids long blocking calls and keeps laptop resources light.

Remote file output strategy

If your goal is “compute remotely, consume locally,” prefer:

scrape_url, batch_scrape, extract_contacts, deep_research
async jobs (start_job/poll_job)

Use remote file tools only when explicitly needed:

screenshot, save_pdf, download_file

All file writes are constrained to runtime.safe_output_root. Set safe_output_root to an isolated directory if remote files are required.

Config files

`config.json`

Use the runtime section for dynamic behavior:

{
  "runtime": {
    "default_timeout_profile": "standard",
    "safe_output_root": "./output",
    "concurrency": {
      "cli_workers_default": "auto",
      "mcp_process_workers": 0,
      "mcp_inflight_limit": 0,
      "mcp_batch_workers": 0,
      "crawler_default_workers": 0
    },
    "server": {
      "transport": "stdio",
      "host": "127.0.0.1",
      "port": 8000,
      "path": "/mcp",
      "require_api_key": false,
      "api_key_env": "WST_MCP_API_KEY"
    }
  }
}

`settings.local.cfg` / `settings.cfg`

Use for machine/local overrides.
See: settings.example.cfg.

Environment variables

Common runtime env vars:

WST_CONFIG_JSON
WST_LOCAL_CFG
WST_TIMEOUT_PROFILE
WST_MCP_PROCESS_WORKERS
WST_MCP_INFLIGHT_LIMIT
WST_MCP_BATCH_WORKERS
WST_CLI_WORKERS_DEFAULT
WST_SERVER_TRANSPORT
WST_SERVER_HOST
WST_SERVER_PORT
WST_SERVER_PATH
WST_SERVER_REQUIRE_API_KEY
WST_SERVER_API_KEY_ENV
WST_MCP_API_KEY
WST_SAFE_OUTPUT_ROOT

Agent integration snippets

Claude Desktop / Cursor style (stdio)

{
  "mcpServers": {
    "web-scraper": {
      "command": "web-scraper-server",
      "args": ["--stdio"]
    }
  }
}

Remote MCP endpoint

Point your client to:

http://<host>:<port>/<path>

with x-api-key header (or Bearer token) if enabled.

Python client example:

import asyncio
from fastmcp import Client

async def main():
    async with Client("https://mcp.example.com/mcp", auth="YOUR_API_KEY") as client:
        result = await client.call_tool("start_job", {
            "job_type": "batch_scrape",
            "payload_json": "{\"urls\": [\"https://readyforus.app\", \"https://claragurney.com\"], \"format\": \"markdown\"}",
            "timeout_profile": "research"
        })
        print(result.data)

asyncio.run(main())

Remote integration testing

Smoke script (recommended)

python verify_remote_mcp.py --remote-url https://mcp.example.com/mcp --targets https://readyforus.app https://claragurney.com

Environment-based variant:

export WST_REMOTE_MCP_URL=https://mcp.example.com/mcp
export WST_REMOTE_MCP_API_KEY=your-secret-key
python verify_remote_mcp.py

Pytest remote suite (optional)

export WST_REMOTE_MCP_URL=https://mcp.example.com/mcp
export WST_REMOTE_MCP_API_KEY=your-secret-key
pytest -q tests/test_remote_mcp_integration.py

These tests are skipped unless WST_REMOTE_MCP_URL is set.

Notes

No local machine paths, private hostnames, or private IPs should be committed.
Keep secrets in environment variables or local cfg files ignored by git.
For heavy remote deployments, tune concurrency + timeout profiles together.

Author

Created by Roy Dawson IV
GitHub: https://github.com/imyourboyroy
PyPI: https://pypi.org/user/ImYourBoyRoy/

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ImYourBoyRoy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.5

Mar 12, 2026

0.3.4

Mar 12, 2026

0.3.3

Mar 8, 2026

0.3.1

Mar 6, 2026

0.3.0

Mar 6, 2026

This version

0.2.4

Mar 1, 2026

0.2.3

Feb 27, 2026

0.2.2

Feb 27, 2026

0.2.1

Feb 27, 2026

0.2.0

Feb 27, 2026

0.1.7

Dec 21, 2025

0.1.6

Dec 18, 2025

0.1.5

Dec 13, 2025

0.1.4

Dec 13, 2025

0.1.3

Dec 13, 2025

0.1.2

Dec 11, 2025

0.1.1

Dec 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scraper_toolkit-0.2.4.tar.gz (127.0 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web_scraper_toolkit-0.2.4-py3-none-any.whl (141.5 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file web_scraper_toolkit-0.2.4.tar.gz.

File metadata

Download URL: web_scraper_toolkit-0.2.4.tar.gz
Upload date: Mar 1, 2026
Size: 127.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_scraper_toolkit-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`cf37e37bccaed2d3bb9193461c18ae4ba4f49e3fb9ffcb1aa74ee8bdb656e4ff`
MD5	`0238a3442020791b8aa67381624b99e1`
BLAKE2b-256	`4d2b7179e7de2aa14c2c28194ac53895a5c059869ee768c9842be1847ef2a13c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_scraper_toolkit-0.2.4.tar.gz:

Publisher: publish.yml on ImYourBoyRoy/WebScraperToolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web_scraper_toolkit-0.2.4.tar.gz
- Subject digest: cf37e37bccaed2d3bb9193461c18ae4ba4f49e3fb9ffcb1aa74ee8bdb656e4ff
- Sigstore transparency entry: 1007130944
- Sigstore integration time: Mar 1, 2026
Source repository:
- Permalink: ImYourBoyRoy/WebScraperToolkit@4b92ae1f1ab855be075669dcfbc92cb388065345
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/ImYourBoyRoy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b92ae1f1ab855be075669dcfbc92cb388065345
- Trigger Event: push

File details

Details for the file web_scraper_toolkit-0.2.4-py3-none-any.whl.

File metadata

Download URL: web_scraper_toolkit-0.2.4-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 141.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for web_scraper_toolkit-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e1ac6d3e7ed8951801377d6a05018b437cf353d7b1dc59a5c0241f47780f825`
MD5	`3d4e0cc63a1775577fb4201f1b7154f8`
BLAKE2b-256	`c3c36c2794860a795634977e038db1622a6c4d146c111142b695fa85f79d761d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web_scraper_toolkit-0.2.4-py3-none-any.whl:

Publisher: publish.yml on ImYourBoyRoy/WebScraperToolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web_scraper_toolkit-0.2.4-py3-none-any.whl
- Subject digest: 3e1ac6d3e7ed8951801377d6a05018b437cf353d7b1dc59a5c0241f47780f825
- Sigstore transparency entry: 1007130957
- Sigstore integration time: Mar 1, 2026
Source repository:
- Permalink: ImYourBoyRoy/WebScraperToolkit@4b92ae1f1ab855be075669dcfbc92cb388065345
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/ImYourBoyRoy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b92ae1f1ab855be075669dcfbc92cb388065345
- Trigger Event: push

web-scraper-toolkit 0.2.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Web Scraper Toolkit (Agentic + MCP Ready)

Why this toolkit

Installation

Runtime config hierarchy (important)

Standalone usage (CLI)

MCP server usage (agentic integration)

Local stdio (recommended for desktop agents)

Remote HTTP/streamable-http

Recommended remote deployment shape (best practice)

MCP tools

Scraping

Discovery

Forms / utility

Content

Management + runtime

Async job lifecycle (long-running tasks)

Concurrency and timeout model

Concurrency

Timeout profiles

Fast result handling pattern (remote horsepower, local control)

Remote file output strategy

Config files

config.json

settings.local.cfg / settings.cfg

Environment variables

Agent integration snippets

Claude Desktop / Cursor style (stdio)

Remote MCP endpoint

Remote integration testing

Smoke script (recommended)

Pytest remote suite (optional)

Notes

Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`config.json`

`settings.local.cfg` / `settings.cfg`