Skip to main content

Reusable web-scraping toolkit — Pattern A/B/C/D ladder, TLS-impersonation fallback chain, deterministic fixture-replay testing, and an optional MCP server for LLM agents.

Project description

scrapper-tool

A reusable Python web-scraping toolkit — production-grade primitives, anti-bot ladder, fixture-replay testing.

Built from the scraping core behind PartsPilot, extracted as an open-source library so other projects (and LLM agents) can pick up the same patterns without redoing the reverse-engineering work.


CI PyPI version Python versions Downloads License: MIT Code style: ruff Type-checked: mypy PRs Welcome GitHub Stars GitHub Forks

Quickstart · Documentation · Recon playbook · Changelog · Contributing


Status (2026-04-30): alpha. v0.1.0 covers the core pattern ladder, anti-bot helpers, and deterministic fixture-replay testing. v0.2.0 adds an MCP server for LLM agents (Claude, OpenClaw, Hermes Agent, AutoGen, LangChain).

Table of contents

Why scrapper-tool

Most scrapers are written from scratch every time, even though 90% of the work is the same: pick the right extraction pattern, survive the TLS fingerprint, retry/backoff sanely, and write tests that don't drift the moment a site updates.

scrapper-tool packages the parts that don't change per vendor, so you only write the parts that do.

  • Pattern-first design. Four named, documented extraction patterns (A–D) — pick the one DevTools points at, skip the rest.
  • Anti-bot ladder built in. Auto-walks chrome133a → chrome124 → safari18_0 → firefox135 when a profile gets fingerprinted.
  • Deterministic tests. Fixture-replay (FakeCurlSession, replay_fixture, golden snapshots) — no live HTTP in CI.
  • Optional hostile mode. Cloudflare Turnstile / Akamai EVA defeat path via Scrapling — opt-in extra, no Playwright bloat by default.
  • LLM-agent ready. v0.2.0+ ships an MCP server so Claude, AutoGen, LangChain, etc. can drive the scraper directly.
  • Boring stack. httpx, curl_cffi, selectolax, extruct. No managed SaaS bundled — your code, your egress.

The four scraping patterns

Web scraping in 2026 is dominated by four recurring patterns. This lib gives each pattern a documented helper plus the surrounding infrastructure (HTTP client with TLS-impersonation fallback, retry/backoff, fixture-replay testing) so you don't reinvent them per vendor.

Pattern When to use Helper Cost
A — JSON API DevTools shows an XHR returning the price-bearing JSON. Anonymous or OAuth. vendor_client() + your own response model Lowest — parse, validate, done.
B — Embedded JSON Document HTML carries <script type="application/ld+json">, __NEXT_DATA__, __NUXT__, or self.__next_f.push(...). patterns.b.extract_product_offer() (via extruct) Low — one call, broad markup coverage.
C — CSS / microdata Price visible in HTML, no embedded JSON. Prefer itemprop="price" schema.org microdata. patterns.c.extract_microdata_price() (via selectolax) Medium — selectors break on ancestor reshuffles.
D — Hostile Cloudflare Turnstile, Akamai EVA, etc. defeat both default httpx and curl_cffi. patterns.d.hostile_client() (via Scrapling) — pip install scrapper-tool[hostile] Highest — Playwright runtime, ≈400 MB image bloat.

Plus a four-profile anti-bot ladder (chrome133a → chrome124 → safari18_0 → firefox135) that auto-walks when a profile gets fingerprinted, and a scrapper-tool canary CLI for nightly fingerprint-health probes.

Architecture

flowchart TD
    A[Your scraper code] --> B[vendor_client / request_with_retry]
    B --> C{TLS-sensitive?}
    C -- no --> D[httpx]
    C -- yes --> E[curl_cffi ladder]
    E --> E1[chrome133a] --> E2[chrome124] --> E3[safari18_0] --> E4[firefox135]
    D --> F[Response]
    E4 --> F
    F --> G{Pattern}
    G -- A --> H[JSON API model]
    G -- B --> I[extruct: ld+json / next_data / nuxt]
    G -- C --> J[selectolax: microdata / CSS]
    G -- D --> K["Scrapling (Playwright + Turnstile)"]
    H --> L[Validated product data]
    I --> L
    J --> L
    K --> L

Install

pip install scrapper-tool                # core: httpx + curl_cffi + selectolax + extruct
pip install scrapper-tool[hostile]       # adds Scrapling for Cloudflare Turnstile
pip install scrapper-tool[agent]         # adds the MCP server (v0.2.0+) for LLM agents

Tip. The [hostile] extra pulls Playwright (~400 MB). Don't install it unless you actually need pattern D.

Quickstart

import asyncio
from scrapper_tool import vendor_client, request_with_retry
from scrapper_tool.patterns.b import extract_product_offer

async def main() -> None:
    async with vendor_client() as client:
        resp = await request_with_retry(client, "GET", "https://example-shop.test/product/123")
        product = extract_product_offer(resp.text, base_url=str(resp.url))
        print(product)

asyncio.run(main())

For TLS-sensitive vendors, flip one switch:

async with vendor_client(use_curl_cffi=True) as client:
    ...   # walks chrome133a → chrome124 → safari → firefox until one returns 200

See docs/quickstart.md for a 5-minute on-ramp covering all four patterns.

Documentation

Quickstart 5-minute on-ramp.
Recon playbook DevTools-driven reverse-engineering of a new vendor site.
Pattern A — JSON API Vendor exposes an XHR / JSON endpoint.
Pattern B — Embedded JSON ld+json, __NEXT_DATA__, __NUXT__, RSC payloads.
Pattern C — CSS / microdata itemprop="price", fallback selectors.
Pattern D — Hostile Cloudflare Turnstile, Akamai EVA.
Anti-bot ladder reference How the ladder walks, when to bump the primary profile.
Test helpers FakeCurlSession, replay_fixture, golden-snapshot pattern.
Agent integration MCP wiring for Claude, OpenClaw, Hermes Agent, AutoGen, LangChain. (v0.2.0+)
2026-04-30 landscape research Why these tools, sourced.

Why these tools?

Short version: curl_cffi is the only actively-maintained TLS-impersonation lib with chrome131+/chrome133a/chrome142/chrome146 profiles; puppeteer-stealth and playwright-extra were deprecated in 2025-02; Scrapling is the only OSS Playwright-based stack with a working Turnstile auto-solve as of 2026; managed SaaS (Firecrawl, ZenRows, Bright Data) is deliberately not bundled.

Full sourced rationale: docs/research/2026-04-30-landscape.md.

Roadmap

  • v0.1.0 — Core HTTP client, retry/backoff, anti-bot ladder, patterns A–D, fixture-replay test helpers.
  • v0.2.0 — MCP server for LLM agents; canary CLI for nightly fingerprint-health probes.
  • v0.3.0 — Pluggable rate-limit / robots.txt policies; per-vendor profile presets.
  • v1.0.0 — API stability guarantee; broader pattern-D backends.

See CHANGELOG.md for landed changes and open issues for what's in flight.

Contributing

PRs and issues are welcome. Every PR that meaningfully changes how we scrape lands a CHANGELOG.md row.

Contributors

Contributors

Want to see your avatar here? Check CONTRIBUTING.md and open a PR.

Acknowledgements

scrapper-tool stands on the shoulders of these projects:

  • httpx — async HTTP client
  • curl_cffi — TLS / JA3 impersonation
  • selectolax — fast HTML parsing
  • extructld+json, microdata, RDFa extraction
  • Scrapling — Playwright-based hostile-site backend

License

MIT © scrapper-tool contributors.

If scrapper-tool saves you time, consider starring the repo — it helps others find it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapper_tool-0.1.0.tar.gz (55.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapper_tool-0.1.0-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapper_tool-0.1.0.tar.gz.

File metadata

  • Download URL: scrapper_tool-0.1.0.tar.gz
  • Upload date:
  • Size: 55.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a99b92affe77ea622f86d8fac4f2e0f906b41c9b2287431e88fd8da13fe6d7c8
MD5 caef3ed95f215d7fcc7e8b1a470c212e
BLAKE2b-256 b6a5eaec6fed996ce5734aa50f3cf4b4f48437b6c3d68f453e7c85e3ef5c05e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-0.1.0.tar.gz:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapper_tool-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapper_tool-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0fd1ceaff27da38b79727351a32d31bb2b516af68c7e1240106303f235148dcd
MD5 5b0934486873b5cc585c3204e9fae64a
BLAKE2b-256 b431c11d030d2f1baa2b5a47b57a5f6410259b210318d0f7aa67295c63502888

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-0.1.0-py3-none-any.whl:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page