Reusable web-scraping toolkit — Pattern A/B/C/D ladder, TLS-impersonation fallback chain, deterministic fixture-replay testing, and an optional MCP server for LLM agents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ValeroK

These details have not been verified by PyPI

Project description

scrapper-tool

A reusable Python web-scraping toolkit — production-grade primitives, anti-bot ladder, fixture-replay testing.

Built from the scraping core behind PartsPilot, extracted as an open-source library so other projects (and LLM agents) can pick up the same patterns without redoing the reverse-engineering work.

Quickstart · Settings · Pattern E (LLM agent) · MCP integration · Docker · Changelog

Status (2026-05-02): stable (v1.0.0). The public Python API and MCP tool surface are SemVer-stable. v0.1.0 covered the core pattern ladder, anti-bot helpers, and deterministic fixture-replay testing. v0.2.0 added an MCP server for LLM agents. v1.0.0 adds Pattern E — local-LLM-driven scraping for any protected site, via Camoufox + browser-use + Crawl4AI + Ollama (zero API cost), and graduates the project out of alpha. See docs/patterns/e-llm-agent.md.

Why scrapper-tool
The five scraping patterns
Architecture
Install
Quickstart
Run as an MCP server
Run as an HTTP REST sidecar
Run with Docker
Settings
Documentation
Why these tools?
Roadmap
Contributing
Contributors
Acknowledgements
License

Why scrapper-tool

Most scrapers are written from scratch every time, even though 90% of the work is the same: pick the right extraction pattern, survive the TLS fingerprint, retry/backoff sanely, and write tests that don't drift the moment a site updates.

scrapper-tool packages the parts that don't change per vendor, so you only write the parts that do.

Pattern-first design. Five named, documented extraction patterns (A–E) — pick the one DevTools points at, skip the rest.
Anti-bot ladder built in. Auto-walks chrome133a → chrome124 → safari18_0 → firefox135 when a profile gets fingerprinted.
Deterministic tests. Fixture-replay (FakeCurlSession, replay_fixture, golden snapshots) — no live HTTP in CI.
Optional hostile mode. Cloudflare Turnstile / Akamai EVA defeat path via Scrapling — opt-in extra, no Playwright bloat by default.
LLM-agent ready. v0.2.0+ ships an MCP server so Claude, AutoGen, LangChain, etc. can drive the scraper directly.
Local-LLM scraping for any protected site (v1.0.0+). Pattern E adds Camoufox + browser-use + Crawl4AI + Ollama — zero API cost, two modes (agent_extract for fast 1-call extraction, agent_browse for interactive multi-step tasks). Auto-cascade captcha solver (Camoufox auto-pass → Theyka → optional paid). Humanlike-behavior layer defeats DataDome.
Boring stack. httpx, curl_cffi, selectolax, extruct. No managed SaaS bundled — your code, your egress.

The five scraping patterns

Web scraping in 2026 is dominated by five recurring patterns. This lib gives each pattern a documented helper plus the surrounding infrastructure (HTTP client with TLS-impersonation fallback, retry/backoff, fixture-replay testing) so you don't reinvent them per vendor.

Pattern	When to use	Helper	Cost
A — JSON API	DevTools shows an XHR returning the price-bearing JSON. Anonymous or OAuth.	`vendor_client()` + your own response model	Lowest — parse, validate, done.
B — Embedded JSON	Document HTML carries `<script type="application/ld+json">`, `__NEXT_DATA__`, `__NUXT__`, or `self.__next_f.push(...)`.	`patterns.b.extract_product_offer()` (via `extruct`)	Low — one call, broad markup coverage.
C — CSS / microdata	Price visible in HTML, no embedded JSON. Prefer `itemprop="price"` schema.org microdata.	`patterns.c.extract_microdata_price()` (via `selectolax`)	Medium — selectors break on ancestor reshuffles.
D — Hostile	Cloudflare Turnstile, Akamai EVA, etc. defeat both default `httpx` and `curl_cffi`.	`patterns.d.hostile_client()` (via Scrapling) — `pip install scrapper-tool[hostile]`	High — Playwright runtime, ≈400 MB image bloat.
E — LLM agent (v1.0.0+)	Pattern D still gets blocked, OR the page needs interaction (login, multi-step nav, dynamic forms), OR there's no stable selector.	`agent_extract()` (Crawl4AI + Ollama) and `agent_browse()` (browser-use + Camoufox + Ollama) — `pip install scrapper-tool[llm-agent]`	Highest — local-LLM latency. Free at run-time (no API). See Pattern E docs.

Plus a four-profile anti-bot ladder (chrome133a → chrome124 → safari18_0 → firefox135) that auto-walks when a profile gets fingerprinted, and a scrapper-tool canary CLI for nightly fingerprint-health probes.

Architecture

flowchart TD
    A[Your scraper code or LLM agent] --> B[vendor_client / request_with_retry]
    B --> C{TLS-sensitive?}
    C -- no --> D[httpx]
    C -- yes --> E[curl_cffi ladder]
    E --> E1[chrome133a] --> E2[chrome124] --> E3[safari18_0] --> E4[firefox135]
    D --> F[Response]
    E4 --> F
    F --> G{Pattern}
    G -- A --> H[JSON API model]
    G -- B --> I[extruct: ld+json / next_data / nuxt]
    G -- C --> J[selectolax: microdata / CSS]
    G -- D --> K["Scrapling (Playwright + Turnstile)"]
    G -- "BlockedError + interactive" --> M["Pattern E: agent_extract / agent_browse"]
    M --> M1["Stealth browser (Camoufox / Patchright / Zendriver)"]
    M1 --> M2["Local LLM (Ollama, qwen3-vl:8b)"]
    M2 --> M3["Captcha cascade (Camoufox auto → Theyka → paid)"]
    M3 --> L[Validated product data]
    H --> L
    I --> L
    J --> L
    K --> L

Install

Recommended — all five patterns in one install (uv):

uv pip install scrapper-tool[full,agent]    # Pattern A/B/C/D/E + MCP server
camoufox fetch                              # ~300 MB — best-stealth Firefox (Pattern E)
patchright install chromium                 # ~250 MB — fast-mode Chromium (Pattern E)
ollama pull qwen3-vl:8b                     # default model (16 GB VRAM); use qwen3-vl:4b on 8 GB

[full] bundles [hostile] + [llm-agent] + [turnstile-solver] so every pattern works in one environment. It's uv-only because Scrapling pins lxml>=6 and Crawl4AI pins lxml~=5.3, and only uv honors the [tool.uv] override-dependencies = ["lxml>=6.0.3"] declared in pyproject.toml. The override is safe — both libraries use the stable lxml.html/XPath surface that's compatible across lxml 5/6.

À la carte (when you don't need everything)

pip install scrapper-tool                   # core: httpx + curl_cffi + selectolax + extruct
pip install scrapper-tool[agent]            # adds the MCP server
pip install scrapper-tool[hostile]          # Pattern D — Scrapling
pip install scrapper-tool[llm-agent]        # Pattern E — Camoufox + browser-use + Crawl4AI + Ollama

[hostile] and [llm-agent] are mutually exclusive under plain pip (lxml conflict). For both in one env, use uv pip install scrapper-tool[full,agent] above, or pip with a constraints file pinning lxml>=6.0.3.

Quickstart

import asyncio
from scrapper_tool import vendor_client, request_with_retry
from scrapper_tool.patterns.b import extract_product_offer

async def main() -> None:
    async with vendor_client() as client:
        resp = await request_with_retry(client, "GET", "https://example-shop.test/product/123")
        product = extract_product_offer(resp.text, base_url=str(resp.url))
        print(product)

asyncio.run(main())

For TLS-sensitive vendors, flip one switch:

async with vendor_client(use_curl_cffi=True) as client:
    ...   # walks chrome133a → chrome124 → safari → firefox until one returns 200

For protected sites (Cloudflare, DataDome, Akamai) where Pattern D fails, escalate to Pattern E:

import asyncio
from scrapper_tool.agent import agent_extract, agent_browse

# E1 — fast extraction-after-render. 1 LLM call, default for "scrape this data".
result = asyncio.run(
    agent_extract(
        "https://quotes.toscrape.com/",
        schema={
            "type": "object",
            "properties": {
                "quotes": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "text": {"type": "string"},
                            "author": {"type": "string"},
                        },
                    },
                }
            },
        },
    )
)
print(result.data)

# E2 — multi-step interactive task (login, paginate, fill forms).
result = asyncio.run(
    agent_browse(
        "https://example.com/login",
        instruction="Log in with username 'demo' and password 'demo123', "
                    "then return the user's email shown on the dashboard.",
    )
)

See docs/quickstart.md for a 5-minute on-ramp covering all five patterns and docs/patterns/e-llm-agent.md for Pattern E specifics (when to use which mode, hardware sizing, captcha cascade, ToS notes).

Run as an MCP server

scrapper-tool ships an MCP server that exposes every pattern as a tool any MCP-aware client (Claude Desktop, Claude Code, OpenClaw, Hermes Agent, AutoGen, LangChain) can call.

Tools exposed

Tool	Purpose
`auto_scrape(url, schema_json, instruction, model, browser, timeout_s)` (v1.1.0+)	Recommended first tool. Auto-escalating ladder A/B/C → E1 → E2 in a single call. Returns `pattern_used`.
`fetch_with_ladder(url, method, use_curl_cffi, extract_structured)`	HTTP fetch through the TLS-impersonation ladder. With `extract_structured=True` (v1.1.0+) also runs Pattern B + C.
`extract_product(html, base_url)`	Pattern B — schema.org Product+Offer parser.
`extract_microdata_price(html)`	Pattern C — `<meta itemprop="price">` parser.
`canary(url, profiles)`	Walk the impersonation ladder and report which profile won.
`agent_extract(url, schema_json, instruction, model, browser, headful, timeout_s)`	Pattern E1 — render with a stealth browser, 1 LLM call to extract structured JSON. Requires `[llm-agent]` extra.
`agent_browse(url, instruction, schema_json, model, browser, max_steps, headful, timeout_s)`	Pattern E2 — multi-step browser-use agent loop for interactive tasks. Requires `[llm-agent]` extra.

How it runs

The server speaks three transports — pick the one your client supports:

Transport	Used by	How
stdio (default)	Claude Desktop, Claude Code (local)	Client spawns `scrapper-tool-mcp` as a subprocess; JSON-RPC over stdin/stdout.
streamable-http	Cursor, Claude Code (remote), mcp-use, any 2026 MCP-aware app	Long-running service; client connects via `url:` config.
sse	Older clients still on Server-Sent Events	Same as streamable-http but at `/sse`.

pip install scrapper-tool[agent]            # MCP only
pip install scrapper-tool[agent,llm-agent]  # MCP + Pattern E

scrapper-tool-mcp                           # stdio (default)
scrapper-tool-mcp --transport streamable-http --host 0.0.0.0 --port 8765
scrapper-tool-mcp --help                    # full flag reference

Or via Docker (recommended — bundles all five patterns):

# HTTP service on host port 8765 — ready for Cursor / Claude Code / mcp-use:
SCRAPPER_TOOL_MCP_PORT=8765 \
SCRAPPER_TOOL_AGENT_LLM=openai_compat \
SCRAPPER_TOOL_AGENT_OLLAMA_URL=http://host.docker.internal:1234 \
SCRAPPER_TOOL_AGENT_MODEL=qwen3-vl-8b-instruct \
docker compose --profile http up -d scrapper-tool-mcp-http

Wire into Claude Code / Cursor / Claude Desktop

Recommended — point at the Docker HTTP service

Once docker compose --profile http up -d scrapper-tool-mcp-http is running, any URL-aware MCP client connects with one line:

// Cursor — Settings → MCP → Add Server, OR ~/.cursor/mcp.json
{
  "mcpServers": {
    "scrapper-tool": {
      "url": "http://localhost:8765/mcp",
      "type": "http"
    }
  }
}

// Claude Code — .mcp.json (project) or claude_desktop_config.json (global)
{
  "mcpServers": {
    "scrapper-tool": {
      "url": "http://localhost:8765/mcp"
    }
  }
}

This is the production shape: one warm container, many concurrent agents, clean URL config, no per-call cold-start. Restart-as-a-service via docker compose --profile http restart scrapper-tool-mcp-http.

Local-binary stdio (Claude Desktop pattern)

If your client only supports the spawn-a-binary pattern:

{
  "mcpServers": {
    "scrapper-tool": {
      "command": "scrapper-tool-mcp",
      "args": [],
      "env": {
        "SCRAPPER_TOOL_AGENT_BROWSER": "patchright",
        "SCRAPPER_TOOL_AGENT_MODEL": "qwen3-vl:8b",
        "SCRAPPER_TOOL_AGENT_OLLAMA_URL": "http://localhost:11434"
      }
    }
  }
}

Or spawn the Docker container per call (Pattern E works on Windows hosts this way because the agent runs Linux-side):

{
  "mcpServers": {
    "scrapper-tool": {
      "command": "docker",
      "args": [
        "compose", "-f", "/abs/path/to/scrapper-tool/docker-compose.yml",
        "run", "--rm", "-T", "scrapper-tool"
      ]
    }
  }
}

For framework-specific wiring (AutoGen, LangChain, mcp-use, OpenClaw, Hermes Agent), see docs/agent-integration.md.

Run as an HTTP REST sidecar

Available since v1.1.0.

When the consumer is a service (not an LLM agent) — for example the affiliate service, a Node/Go backend, or a Python worker that already speaks HTTP — spawn the REST sidecar on port 5792:

pip install 'scrapper-tool[http]'
scrapper-tool-serve

Or via Docker (bundles all five patterns):

docker compose --profile rest up -d scrapper-tool-rest
curl http://localhost:5792/health    # {"status": "ok"}

The primary endpoint is POST /scrape — it runs the full A/B/C → E1 → E2 escalation ladder server-side so callers don't need per-pattern decision logic:

curl -s -X POST http://localhost:5792/scrape \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/product/123"}'

Endpoint	Purpose
`POST /scrape`	Primary. Auto-escalating ladder A/B/C → E1 → E2. Returns `pattern_used`.
`POST /fetch`	Pattern A/B/C with optional Pattern B/C structured extraction.
`POST /extract`	Pattern E1 direct (Crawl4AI + LLM, 1 call).
`POST /browse`	Pattern E2 direct (browser-use multi-step agent).
`GET /health`	Liveness probe — always 200.
`GET /ready`	Readiness with detailed component checks (Ollama, model, browser).
`GET /version`	Version + installed-extras info.
`GET /docs`	Swagger UI.
`GET /openapi.json`	Raw OpenAPI 3.1 spec — for typed-client codegen.

Optional X-API-Key auth via SCRAPPER_TOOL_HTTP_API_KEY. Full reference and examples in docs/http-sidecar.md; static OpenAPI spec at docs/openapi/openapi.yaml for generating typed clients (Python via openapi-python-client, TypeScript via openapi-typescript-codegen).

Run with Docker

The repository ships one image — Dockerfile — that bundles all five patterns (A/B/C/D/E + MCP server): Scrapling, Camoufox-ready, Patchright, Crawl4AI, browser-use, captcha solvers. Built on the [full] extra.

The image does NOT bundle an LLM. You bring your own — Ollama, LM Studio, llama.cpp, vLLM — running on the host (or a remote server) and the container talks to it over host.docker.internal (Mac/Windows Docker Desktop maps this natively; on Linux the compose file declares extra_hosts).

One-liner — assuming Ollama on host

ollama pull qwen3-vl:8b                           # one-time on the host
docker compose run --rm scrapper-tool python -c "
import asyncio
from scrapper_tool.agent import agent_extract
print(asyncio.run(agent_extract(
    'https://quotes.toscrape.com/',
    schema={'type':'object','properties':{'quotes':{'type':'array'}}},
)))
"

The container resolves SCRAPPER_TOOL_AGENT_OLLAMA_URL=http://host.docker.internal:11434 by default. Override in .env or environment to point elsewhere — see the external LLM section below.

What's in the image

Capability	Status
Pattern A (JSON API), B (embedded JSON), C (CSS / microdata)	✅ always
Pattern D (Scrapling hostile-site fetcher)	✅ pre-installed
Pattern E1 (`agent_extract`)	✅ pre-installed
Pattern E2 (`agent_browse`)	✅ pre-installed
Browser: Patchright (Pattern E "fast mode")	✅ pre-installed
Browser: Playwright Chromium (Pattern D Scrapling)	✅ pre-installed
Browser: Camoufox (Pattern E best-stealth)	optional via `--build-arg INSTALL_CAMOUFOX=1` (+300 MB)
Browser: Zendriver / Botasaurus	rebuild with the matching `--extra ...-backend`
LLM: external Ollama / LM Studio / llama.cpp / vLLM	✅ via `host.docker.internal` (see below). The image does NOT bundle an LLM.
Captcha Tier 0 (Camoufox auto-pass)	✅ when `INSTALL_CAMOUFOX=1`
Captcha Tier 1 (Theyka)	✅ pre-installed
Captcha Tier 2 (CapSolver / NopeCHA / 2Captcha)	✅ via env key
MCP server (stdio JSON-RPC)	✅ default entrypoint
Canary CLI (`scrapper-tool`)	✅

Why this works — the `[full]` extra and the lxml override

Scrapling pins lxml>=6.0.3 and Crawl4AI pins lxml~=5.3. These are conservative pins, not real API breakage — both libraries use the stable lxml.html / XPath surface that's compatible across lxml 5/6. pyproject.toml declares [tool.uv] override-dependencies = ["lxml>=6.0.3"], which forces a single resolved lxml across both packages. Verified in CI: 238 tests pass with both extras installed simultaneously.

If you prefer plain pip (which doesn't honor [tool.uv] overrides), use uv instead, or pass pip install --constraint constraints.txt scrapper-tool[full] with lxml>=6.0.3 in constraints.txt.

Pull the published image

Tagged releases are published to GitHub Container Registry. Pull the latest:

docker pull ghcr.io/valerok/scrapper-tool:latest
# or pin to a specific version
docker pull ghcr.io/valerok/scrapper-tool:1.0.0

Tags published per release: <major>.<minor>.<patch>, <major>.<minor>, and latest (only on non-prerelease tags).

Build options (local / fork)

# All five patterns in one image (~1.6 GB).
docker build -t scrapper-tool .
# Or via compose: docker compose build scrapper-tool

# Plus Camoufox baked in (~+300 MB; highest-stealth backend).
docker build --build-arg INSTALL_CAMOUFOX=1 -t scrapper-tool:camoufox .

External LLMs (LM Studio, llama.cpp, vLLM, remote Ollama)

The image talks to whichever LLM server you run, on the host or remotely. Set the right SCRAPPER_TOOL_AGENT_* env vars in your .env next to docker-compose.yml:

Server	`SCRAPPER_TOOL_AGENT_LLM`	`SCRAPPER_TOOL_AGENT_OLLAMA_URL`
Ollama on host (default)	`ollama`	`http://host.docker.internal:11434`
LM Studio on host	`openai_compat`	`http://host.docker.internal:1234`
llama.cpp `server` on host	`llama_cpp`	`http://host.docker.internal:8080`
vLLM on host	`vllm`	`http://host.docker.internal:8000`
Remote Ollama / OpenAI-compat	`ollama` / `openai_compat`	`https://my-llm.example/v1` etc.

LM Studio example:

LM Studio → Developer / Local Server tab → Start Server (port 1234 by default).
Note the model name shown there (e.g. qwen3-vl-8b-instruct).

.env:

SCRAPPER_TOOL_AGENT_LLM=openai_compat
SCRAPPER_TOOL_AGENT_OLLAMA_URL=http://host.docker.internal:1234
SCRAPPER_TOOL_AGENT_MODEL=qwen3-vl-8b-instruct

docker compose run --rm -T scrapper-tool.

The compose file already declares extra_hosts: ["host.docker.internal:host-gateway"] so host.docker.internal resolves on Linux too (Mac/Windows Docker Desktop maps it natively).

Run as MCP server in Docker

The image's default entrypoint is scrapper-tool-mcp (stdio MCP server). Wire your MCP client to invoke docker compose run --rm -T scrapper-tool and you're done — see the JSON example above. The -T flag keeps stdio attached cleanly.

Live integration tests inside Docker

docker compose --profile live up canary    # runs tests/integration/test_agent_live.py

Settings

scrapper-tool is configured via SCRAPPER_TOOL_* environment variables, an AgentConfig Python object, or per-call kwargs.

Resolution order (highest first): explicit kwargs → config=AgentConfig(...) → env vars → built-in defaults.

Where do settings go when used as a library?

You have three valid places to put them. Pick whichever fits your deployment.

Option A — env vars in your shell or process manager (simplest, deployment-friendly):

export SCRAPPER_TOOL_AGENT_BROWSER=patchright
export SCRAPPER_TOOL_AGENT_MODEL=qwen3-vl:8b
export SCRAPPER_TOOL_CAPTCHA_KEY=sk_capsolver_xxx
python my_scraper.py

In Python, just call AgentConfig.from_env() (or use the bare functions — they do this automatically when you don't pass config=):

from scrapper_tool.agent import agent_extract

# Reads SCRAPPER_TOOL_* env at call time. No setup needed.
result = await agent_extract("https://example.com", schema={"type": "object"})

Option B — a .env file loaded by your app (great for local dev):

scrapper-tool itself does not auto-load .env. Either let your runner do it (uv run --env-file .env python my_scraper.py, docker compose, or your process manager), or load it explicitly in your entry point with python-dotenv:

# my_scraper.py
from dotenv import load_dotenv
load_dotenv()                      # MUST be called BEFORE importing scrapper_tool

import asyncio
from scrapper_tool.agent import agent_extract

result = asyncio.run(
    agent_extract("https://example.com", schema={"type": "object"})
)

Copy .env.example → .env and edit. The example file documents every supported variable with safe defaults.

Option C — pass an AgentConfig in code (most explicit, ideal for tests):

from scrapper_tool.agent import AgentConfig, agent_extract, agent_session
from pydantic import SecretStr

cfg = AgentConfig(
    browser="patchright",
    model="qwen3-vl:8b",
    ollama_url="http://localhost:11434",
    behavior="humanlike",
    captcha_solver="auto",
    captcha_api_key=SecretStr("sk_capsolver_xxx"),
    timeout_s=180,
)

# Per-call:
result = await agent_extract(url, schema=..., config=cfg)

# Or hold a session for many calls (warm browser + LLM context):
async with agent_session(config=cfg) as s:
    a = await s.extract(url_a, schema=...)
    b = await s.browse(url_b, "log in and ...")

Per-call overrides layer on top of any of the above:

# cfg.model is "qwen3-vl:8b" but THIS call uses qwen3-coder:30b.
result = await agent_extract(url, schema=..., config=cfg, model="qwen3-coder:30b")

Reference

docs/SETTINGS.md — every variable, default, choice list, and recommendation.
.env.example — drop-in starter file with every documented variable annotated.

Documentation


Quickstart	5-minute on-ramp.
Settings reference	Every env var, default, choice list. (v1.0.0+)
`.env.example`	Drop-in starter file with every variable annotated.
E2E test plan	Operator-runnable end-to-end suite — library / Docker / MCP modes against LM Studio. (v1.0.0+)
`scripts/e2e/`	Runnable test scripts referenced by the E2E plan.
Recon playbook	DevTools-driven reverse-engineering of a new vendor site.
Pattern A — JSON API	Vendor exposes an XHR / JSON endpoint.
Pattern B — Embedded JSON	`ld+json`, `__NEXT_DATA__`, `__NUXT__`, RSC payloads.
Pattern C — CSS / microdata	`itemprop="price"`, fallback selectors.
Pattern D — Hostile	Cloudflare Turnstile, Akamai EVA.
Pattern E — LLM agent	Local-LLM-driven scraping for any protected site. (v1.0.0+)
Anti-bot ladder reference	How the ladder walks, when to bump the primary profile.
Test helpers	`FakeCurlSession`, `replay_fixture`, golden-snapshot pattern.
Agent integration	MCP wiring for Claude, OpenClaw, Hermes Agent, AutoGen, LangChain. (v0.2.0+)
2026-04-30 landscape research	Why these tools, sourced.

Why these tools?

Short version: curl_cffi is the only actively-maintained TLS-impersonation lib with chrome131+/chrome133a/chrome142/chrome146 profiles; puppeteer-stealth and playwright-extra were deprecated in 2025-02; Scrapling is the only OSS Playwright-based stack with a working Turnstile auto-solve as of 2026; managed SaaS (Firecrawl, ZenRows, Bright Data) is deliberately not bundled.

Full sourced rationale: docs/research/2026-04-30-landscape.md.

Roadmap

v0.1.0 — Core HTTP client, retry/backoff, anti-bot ladder, patterns A–D, fixture-replay test helpers.
v0.2.0 — MCP server for LLM agents; canary CLI for nightly fingerprint-health probes.
v1.0.0 — Pattern E: local-LLM-driven scraping (Camoufox + browser-use + Crawl4AI + Ollama), captcha cascade, humanlike-behavior layer, full Docker stack. Public API + MCP tool surface stable under SemVer.
v1.1.0 — Pluggable rate-limit / robots.txt policies; per-vendor profile presets; agent_session() warm-browser pooling; broader Pattern E backends.

See CHANGELOG.md for landed changes and open issues for what's in flight.

Contributing

PRs and issues are welcome. Every PR that meaningfully changes how we scrape lands a CHANGELOG.md row.

Read CONTRIBUTING.md for the maintenance contract.
Read CODE_OF_CONDUCT.md before opening a discussion.
Good first issues live under the good first issue label.

Contributors

Want to see your avatar here? Check CONTRIBUTING.md and open a PR.

Acknowledgements

scrapper-tool stands on the shoulders of these projects:

httpx — async HTTP client
curl_cffi — TLS / JA3 impersonation
selectolax — fast HTML parsing
extruct — ld+json, microdata, RDFa extraction
Scrapling — Playwright-based hostile-site backend

License

MIT © scrapper-tool contributors.

If scrapper-tool saves you time, consider starring the repo — it helps others find it.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ValeroK

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.2

May 6, 2026

This version

1.1.2

May 3, 2026

1.1.1

May 2, 2026

1.1.0

May 2, 2026

1.0.0

May 2, 2026

0.2.0

May 1, 2026

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapper_tool-1.1.2.tar.gz (201.8 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapper_tool-1.1.2-py3-none-any.whl (97.5 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file scrapper_tool-1.1.2.tar.gz.

File metadata

Download URL: scrapper_tool-1.1.2.tar.gz
Upload date: May 3, 2026
Size: 201.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-1.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f0200b84ad92ec8b48b9fbc857f28f9327c4f0274695b866b75d62db12f3cc6c`
MD5	`8481b8fe850e46dc08ac312d22f3ff31`
BLAKE2b-256	`4cfc67778baf2002c359b63c1ec6a28d98855295b248fa61dec7aede36a2e118`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-1.1.2.tar.gz:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapper_tool-1.1.2.tar.gz
- Subject digest: f0200b84ad92ec8b48b9fbc857f28f9327c4f0274695b866b75d62db12f3cc6c
- Sigstore transparency entry: 1436075501
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: ValeroK/scrapper-tool@3548ad92e7c5415452b53ebcac44ba3440576dc4
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/ValeroK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@3548ad92e7c5415452b53ebcac44ba3440576dc4
- Trigger Event: push

File details

Details for the file scrapper_tool-1.1.2-py3-none-any.whl.

File metadata

Download URL: scrapper_tool-1.1.2-py3-none-any.whl
Upload date: May 3, 2026
Size: 97.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-1.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4a84b5e86ed1ece10e40217e7aec6979cfbfe053647bc66af14cb9aa9f12483`
MD5	`6ad6da9c1d6e998ff913125e3059f44b`
BLAKE2b-256	`a55cb4ccf8dd66ac91d05061d5caf07cf623fd087744d6b3fc8e69c70f6bae29`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-1.1.2-py3-none-any.whl:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapper_tool-1.1.2-py3-none-any.whl
- Subject digest: b4a84b5e86ed1ece10e40217e7aec6979cfbfe053647bc66af14cb9aa9f12483
- Sigstore transparency entry: 1436075506
- Sigstore integration time: May 3, 2026
Source repository:
- Permalink: ValeroK/scrapper-tool@3548ad92e7c5415452b53ebcac44ba3440576dc4
- Branch / Tag: refs/tags/v1.1.2
- Owner: https://github.com/ValeroK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@3548ad92e7c5415452b53ebcac44ba3440576dc4
- Trigger Event: push

scrapper-tool 1.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

scrapper-tool

Table of contents

Why scrapper-tool

The five scraping patterns

Architecture

Install

À la carte (when you don't need everything)

Quickstart

Run as an MCP server

Tools exposed

How it runs

Wire into Claude Code / Cursor / Claude Desktop

Recommended — point at the Docker HTTP service

Local-binary stdio (Claude Desktop pattern)

Run as an HTTP REST sidecar

Run with Docker

One-liner — assuming Ollama on host

What's in the image

Why this works — the [full] extra and the lxml override

Pull the published image

Build options (local / fork)

External LLMs (LM Studio, llama.cpp, vLLM, remote Ollama)

Run as MCP server in Docker

Live integration tests inside Docker

Settings

Where do settings go when used as a library?

Reference

Documentation

Why these tools?

Roadmap

Contributing

Contributors

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Why this works — the `[full]` extra and the lxml override