Reusable web-scraping toolkit — Pattern A/B/C/D ladder, TLS-impersonation fallback chain, deterministic fixture-replay testing, and an optional MCP server for LLM agents.
Project description
scrapper-tool
A reusable Python web-scraping toolkit — production-grade primitives, anti-bot ladder, fixture-replay testing.
Built from the scraping core behind PartsPilot, extracted as an open-source library so other projects (and LLM agents) can pick up the same patterns without redoing the reverse-engineering work.
Quickstart · Documentation · Recon playbook · Changelog · Contributing
Status (2026-04-30): alpha.
v0.1.0covers the core pattern ladder, anti-bot helpers, and deterministic fixture-replay testing.v0.2.0adds an MCP server for LLM agents (Claude, OpenClaw, Hermes Agent, AutoGen, LangChain).
Table of contents
- Why scrapper-tool
- The four scraping patterns
- Architecture
- Install
- Quickstart
- Documentation
- Why these tools?
- Roadmap
- Contributing
- Contributors
- Acknowledgements
- License
Why scrapper-tool
Most scrapers are written from scratch every time, even though 90% of the work is the same: pick the right extraction pattern, survive the TLS fingerprint, retry/backoff sanely, and write tests that don't drift the moment a site updates.
scrapper-tool packages the parts that don't change per vendor, so you only write the parts that do.
- Pattern-first design. Four named, documented extraction patterns (A–D) — pick the one DevTools points at, skip the rest.
- Anti-bot ladder built in. Auto-walks
chrome133a → chrome124 → safari18_0 → firefox135when a profile gets fingerprinted. - Deterministic tests. Fixture-replay (
FakeCurlSession,replay_fixture, golden snapshots) — no live HTTP in CI. - Optional hostile mode. Cloudflare Turnstile / Akamai EVA defeat path via Scrapling — opt-in extra, no Playwright bloat by default.
- LLM-agent ready.
v0.2.0+ships an MCP server so Claude, AutoGen, LangChain, etc. can drive the scraper directly. - Boring stack.
httpx,curl_cffi,selectolax,extruct. No managed SaaS bundled — your code, your egress.
The four scraping patterns
Web scraping in 2026 is dominated by four recurring patterns. This lib gives each pattern a documented helper plus the surrounding infrastructure (HTTP client with TLS-impersonation fallback, retry/backoff, fixture-replay testing) so you don't reinvent them per vendor.
| Pattern | When to use | Helper | Cost |
|---|---|---|---|
| A — JSON API | DevTools shows an XHR returning the price-bearing JSON. Anonymous or OAuth. | vendor_client() + your own response model |
Lowest — parse, validate, done. |
| B — Embedded JSON | Document HTML carries <script type="application/ld+json">, __NEXT_DATA__, __NUXT__, or self.__next_f.push(...). |
patterns.b.extract_product_offer() (via extruct) |
Low — one call, broad markup coverage. |
| C — CSS / microdata | Price visible in HTML, no embedded JSON. Prefer itemprop="price" schema.org microdata. |
patterns.c.extract_microdata_price() (via selectolax) |
Medium — selectors break on ancestor reshuffles. |
| D — Hostile | Cloudflare Turnstile, Akamai EVA, etc. defeat both default httpx and curl_cffi. |
patterns.d.hostile_client() (via Scrapling) — pip install scrapper-tool[hostile] |
Highest — Playwright runtime, ≈400 MB image bloat. |
Plus a four-profile anti-bot ladder (chrome133a → chrome124 → safari18_0 → firefox135) that auto-walks when a profile gets fingerprinted, and a scrapper-tool canary CLI for nightly fingerprint-health probes.
Architecture
flowchart TD
A[Your scraper code] --> B[vendor_client / request_with_retry]
B --> C{TLS-sensitive?}
C -- no --> D[httpx]
C -- yes --> E[curl_cffi ladder]
E --> E1[chrome133a] --> E2[chrome124] --> E3[safari18_0] --> E4[firefox135]
D --> F[Response]
E4 --> F
F --> G{Pattern}
G -- A --> H[JSON API model]
G -- B --> I[extruct: ld+json / next_data / nuxt]
G -- C --> J[selectolax: microdata / CSS]
G -- D --> K["Scrapling (Playwright + Turnstile)"]
H --> L[Validated product data]
I --> L
J --> L
K --> L
Install
pip install scrapper-tool # core: httpx + curl_cffi + selectolax + extruct
pip install scrapper-tool[hostile] # adds Scrapling for Cloudflare Turnstile
pip install scrapper-tool[agent] # adds the MCP server (v0.2.0+) for LLM agents
Tip. The
[hostile]extra pulls Playwright (~400 MB). Don't install it unless you actually need pattern D.
Quickstart
import asyncio
from scrapper_tool import vendor_client, request_with_retry
from scrapper_tool.patterns.b import extract_product_offer
async def main() -> None:
async with vendor_client() as client:
resp = await request_with_retry(client, "GET", "https://example-shop.test/product/123")
product = extract_product_offer(resp.text, base_url=str(resp.url))
print(product)
asyncio.run(main())
For TLS-sensitive vendors, flip one switch:
async with vendor_client(use_curl_cffi=True) as client:
... # walks chrome133a → chrome124 → safari → firefox until one returns 200
See docs/quickstart.md for a 5-minute on-ramp covering all four patterns.
Documentation
| Quickstart | 5-minute on-ramp. |
| Recon playbook | DevTools-driven reverse-engineering of a new vendor site. |
| Pattern A — JSON API | Vendor exposes an XHR / JSON endpoint. |
| Pattern B — Embedded JSON | ld+json, __NEXT_DATA__, __NUXT__, RSC payloads. |
| Pattern C — CSS / microdata | itemprop="price", fallback selectors. |
| Pattern D — Hostile | Cloudflare Turnstile, Akamai EVA. |
| Anti-bot ladder reference | How the ladder walks, when to bump the primary profile. |
| Test helpers | FakeCurlSession, replay_fixture, golden-snapshot pattern. |
| Agent integration | MCP wiring for Claude, OpenClaw, Hermes Agent, AutoGen, LangChain. (v0.2.0+) |
| 2026-04-30 landscape research | Why these tools, sourced. |
Why these tools?
Short version: curl_cffi is the only actively-maintained TLS-impersonation lib with chrome131+/chrome133a/chrome142/chrome146 profiles; puppeteer-stealth and playwright-extra were deprecated in 2025-02; Scrapling is the only OSS Playwright-based stack with a working Turnstile auto-solve as of 2026; managed SaaS (Firecrawl, ZenRows, Bright Data) is deliberately not bundled.
Full sourced rationale: docs/research/2026-04-30-landscape.md.
Roadmap
- v0.1.0 — Core HTTP client, retry/backoff, anti-bot ladder, patterns A–D, fixture-replay test helpers.
- v0.2.0 — MCP server for LLM agents; canary CLI for nightly fingerprint-health probes.
- v0.3.0 — Pluggable rate-limit / robots.txt policies; per-vendor profile presets.
- v1.0.0 — API stability guarantee; broader pattern-D backends.
See CHANGELOG.md for landed changes and open issues for what's in flight.
Contributing
PRs and issues are welcome. Every PR that meaningfully changes how we scrape lands a CHANGELOG.md row.
- Read
CONTRIBUTING.mdfor the maintenance contract. - Read
CODE_OF_CONDUCT.mdbefore opening a discussion. - Good first issues live under the
good first issuelabel.
Contributors
Want to see your avatar here? Check CONTRIBUTING.md and open a PR.
Acknowledgements
scrapper-tool stands on the shoulders of these projects:
httpx— async HTTP clientcurl_cffi— TLS / JA3 impersonationselectolax— fast HTML parsingextruct—ld+json, microdata, RDFa extractionScrapling— Playwright-based hostile-site backend
License
MIT © scrapper-tool contributors.
If scrapper-tool saves you time, consider starring the repo — it helps others find it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapper_tool-0.1.0.tar.gz.
File metadata
- Download URL: scrapper_tool-0.1.0.tar.gz
- Upload date:
- Size: 55.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a99b92affe77ea622f86d8fac4f2e0f906b41c9b2287431e88fd8da13fe6d7c8
|
|
| MD5 |
caef3ed95f215d7fcc7e8b1a470c212e
|
|
| BLAKE2b-256 |
b6a5eaec6fed996ce5734aa50f3cf4b4f48437b6c3d68f453e7c85e3ef5c05e6
|
Provenance
The following attestation bundles were made for scrapper_tool-0.1.0.tar.gz:
Publisher:
release.yml on ValeroK/scrapper-tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapper_tool-0.1.0.tar.gz -
Subject digest:
a99b92affe77ea622f86d8fac4f2e0f906b41c9b2287431e88fd8da13fe6d7c8 - Sigstore transparency entry: 1412863599
- Sigstore integration time:
-
Permalink:
ValeroK/scrapper-tool@b487b7e604d580011147332d5cfb4aefc97f95f7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ValeroK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b487b7e604d580011147332d5cfb4aefc97f95f7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapper_tool-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrapper_tool-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fd1ceaff27da38b79727351a32d31bb2b516af68c7e1240106303f235148dcd
|
|
| MD5 |
5b0934486873b5cc585c3204e9fae64a
|
|
| BLAKE2b-256 |
b431c11d030d2f1baa2b5a47b57a5f6410259b210318d0f7aa67295c63502888
|
Provenance
The following attestation bundles were made for scrapper_tool-0.1.0-py3-none-any.whl:
Publisher:
release.yml on ValeroK/scrapper-tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapper_tool-0.1.0-py3-none-any.whl -
Subject digest:
0fd1ceaff27da38b79727351a32d31bb2b516af68c7e1240106303f235148dcd - Sigstore transparency entry: 1412863714
- Sigstore integration time:
-
Permalink:
ValeroK/scrapper-tool@b487b7e604d580011147332d5cfb4aefc97f95f7 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/ValeroK
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b487b7e604d580011147332d5cfb4aefc97f95f7 -
Trigger Event:
push
-
Statement type: