Reusable web-scraping toolkit — Pattern A/B/C/D ladder, TLS-impersonation fallback chain, deterministic fixture-replay testing, and an optional MCP server for LLM agents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ValeroK

These details have not been verified by PyPI

Project description

scrapper-tool

A reusable Python web-scraping toolkit — production-grade primitives, anti-bot ladder, fixture-replay testing.

Built from the scraping core behind PartsPilot, extracted as an open-source library so other projects (and LLM agents) can pick up the same patterns without redoing the reverse-engineering work.

Quickstart · Documentation · Recon playbook · Changelog · Contributing

Status (2026-04-30): alpha. v0.1.0 covers the core pattern ladder, anti-bot helpers, and deterministic fixture-replay testing. v0.2.0 adds an MCP server for LLM agents (Claude, OpenClaw, Hermes Agent, AutoGen, LangChain).

Why scrapper-tool
The four scraping patterns
Architecture
Install
Quickstart
Documentation
Why these tools?
Roadmap
Contributing
Contributors
Acknowledgements
License

Why scrapper-tool

Most scrapers are written from scratch every time, even though 90% of the work is the same: pick the right extraction pattern, survive the TLS fingerprint, retry/backoff sanely, and write tests that don't drift the moment a site updates.

scrapper-tool packages the parts that don't change per vendor, so you only write the parts that do.

Pattern-first design. Four named, documented extraction patterns (A–D) — pick the one DevTools points at, skip the rest.
Anti-bot ladder built in. Auto-walks chrome133a → chrome124 → safari18_0 → firefox135 when a profile gets fingerprinted.
Deterministic tests. Fixture-replay (FakeCurlSession, replay_fixture, golden snapshots) — no live HTTP in CI.
Optional hostile mode. Cloudflare Turnstile / Akamai EVA defeat path via Scrapling — opt-in extra, no Playwright bloat by default.
LLM-agent ready. v0.2.0+ ships an MCP server so Claude, AutoGen, LangChain, etc. can drive the scraper directly.
Boring stack. httpx, curl_cffi, selectolax, extruct. No managed SaaS bundled — your code, your egress.

The four scraping patterns

Web scraping in 2026 is dominated by four recurring patterns. This lib gives each pattern a documented helper plus the surrounding infrastructure (HTTP client with TLS-impersonation fallback, retry/backoff, fixture-replay testing) so you don't reinvent them per vendor.

Pattern	When to use	Helper	Cost
A — JSON API	DevTools shows an XHR returning the price-bearing JSON. Anonymous or OAuth.	`vendor_client()` + your own response model	Lowest — parse, validate, done.
B — Embedded JSON	Document HTML carries `<script type="application/ld+json">`, `__NEXT_DATA__`, `__NUXT__`, or `self.__next_f.push(...)`.	`patterns.b.extract_product_offer()` (via `extruct`)	Low — one call, broad markup coverage.
C — CSS / microdata	Price visible in HTML, no embedded JSON. Prefer `itemprop="price"` schema.org microdata.	`patterns.c.extract_microdata_price()` (via `selectolax`)	Medium — selectors break on ancestor reshuffles.
D — Hostile	Cloudflare Turnstile, Akamai EVA, etc. defeat both default `httpx` and `curl_cffi`.	`patterns.d.hostile_client()` (via Scrapling) — `pip install scrapper-tool[hostile]`	Highest — Playwright runtime, ≈400 MB image bloat.

Plus a four-profile anti-bot ladder (chrome133a → chrome124 → safari18_0 → firefox135) that auto-walks when a profile gets fingerprinted, and a scrapper-tool canary CLI for nightly fingerprint-health probes.

Architecture

flowchart TD
    A[Your scraper code] --> B[vendor_client / request_with_retry]
    B --> C{TLS-sensitive?}
    C -- no --> D[httpx]
    C -- yes --> E[curl_cffi ladder]
    E --> E1[chrome133a] --> E2[chrome124] --> E3[safari18_0] --> E4[firefox135]
    D --> F[Response]
    E4 --> F
    F --> G{Pattern}
    G -- A --> H[JSON API model]
    G -- B --> I[extruct: ld+json / next_data / nuxt]
    G -- C --> J[selectolax: microdata / CSS]
    G -- D --> K["Scrapling (Playwright + Turnstile)"]
    H --> L[Validated product data]
    I --> L
    J --> L
    K --> L

Install

pip install scrapper-tool                # core: httpx + curl_cffi + selectolax + extruct
pip install scrapper-tool[hostile]       # adds Scrapling for Cloudflare Turnstile
pip install scrapper-tool[agent]         # adds the MCP server (v0.2.0+) for LLM agents

Tip. The [hostile] extra pulls Playwright (~400 MB). Don't install it unless you actually need pattern D.

Quickstart

import asyncio
from scrapper_tool import vendor_client, request_with_retry
from scrapper_tool.patterns.b import extract_product_offer

async def main() -> None:
    async with vendor_client() as client:
        resp = await request_with_retry(client, "GET", "https://example-shop.test/product/123")
        product = extract_product_offer(resp.text, base_url=str(resp.url))
        print(product)

asyncio.run(main())

For TLS-sensitive vendors, flip one switch:

async with vendor_client(use_curl_cffi=True) as client:
    ...   # walks chrome133a → chrome124 → safari → firefox until one returns 200

See docs/quickstart.md for a 5-minute on-ramp covering all four patterns.

Documentation


Quickstart	5-minute on-ramp.
Recon playbook	DevTools-driven reverse-engineering of a new vendor site.
Pattern A — JSON API	Vendor exposes an XHR / JSON endpoint.
Pattern B — Embedded JSON	`ld+json`, `__NEXT_DATA__`, `__NUXT__`, RSC payloads.
Pattern C — CSS / microdata	`itemprop="price"`, fallback selectors.
Pattern D — Hostile	Cloudflare Turnstile, Akamai EVA.
Anti-bot ladder reference	How the ladder walks, when to bump the primary profile.
Test helpers	`FakeCurlSession`, `replay_fixture`, golden-snapshot pattern.
Agent integration	MCP wiring for Claude, OpenClaw, Hermes Agent, AutoGen, LangChain. (v0.2.0+)
2026-04-30 landscape research	Why these tools, sourced.

Why these tools?

Short version: curl_cffi is the only actively-maintained TLS-impersonation lib with chrome131+/chrome133a/chrome142/chrome146 profiles; puppeteer-stealth and playwright-extra were deprecated in 2025-02; Scrapling is the only OSS Playwright-based stack with a working Turnstile auto-solve as of 2026; managed SaaS (Firecrawl, ZenRows, Bright Data) is deliberately not bundled.

Full sourced rationale: docs/research/2026-04-30-landscape.md.

Roadmap

v0.1.0 — Core HTTP client, retry/backoff, anti-bot ladder, patterns A–D, fixture-replay test helpers.
v0.2.0 — MCP server for LLM agents; canary CLI for nightly fingerprint-health probes.
v0.3.0 — Pluggable rate-limit / robots.txt policies; per-vendor profile presets.
v1.0.0 — API stability guarantee; broader pattern-D backends.

See CHANGELOG.md for landed changes and open issues for what's in flight.

Contributing

PRs and issues are welcome. Every PR that meaningfully changes how we scrape lands a CHANGELOG.md row.

Read CONTRIBUTING.md for the maintenance contract.
Read CODE_OF_CONDUCT.md before opening a discussion.
Good first issues live under the good first issue label.

Contributors

Want to see your avatar here? Check CONTRIBUTING.md and open a PR.

Acknowledgements

scrapper-tool stands on the shoulders of these projects:

httpx — async HTTP client
curl_cffi — TLS / JA3 impersonation
selectolax — fast HTML parsing
extruct — ld+json, microdata, RDFa extraction
Scrapling — Playwright-based hostile-site backend

License

MIT © scrapper-tool contributors.

If scrapper-tool saves you time, consider starring the repo — it helps others find it.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ValeroK

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

May 1, 2026

This version

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapper_tool-0.1.0.tar.gz (55.5 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapper_tool-0.1.0-py3-none-any.whl (35.4 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file scrapper_tool-0.1.0.tar.gz.

File metadata

Download URL: scrapper_tool-0.1.0.tar.gz
Upload date: Apr 30, 2026
Size: 55.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a99b92affe77ea622f86d8fac4f2e0f906b41c9b2287431e88fd8da13fe6d7c8`
MD5	`caef3ed95f215d7fcc7e8b1a470c212e`
BLAKE2b-256	`b6a5eaec6fed996ce5734aa50f3cf4b4f48437b6c3d68f453e7c85e3ef5c05e6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-0.1.0.tar.gz:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapper_tool-0.1.0.tar.gz
- Subject digest: a99b92affe77ea622f86d8fac4f2e0f906b41c9b2287431e88fd8da13fe6d7c8
- Sigstore transparency entry: 1412863599
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: ValeroK/scrapper-tool@b487b7e604d580011147332d5cfb4aefc97f95f7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ValeroK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b487b7e604d580011147332d5cfb4aefc97f95f7
- Trigger Event: push

File details

Details for the file scrapper_tool-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapper_tool-0.1.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 35.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scrapper_tool-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0fd1ceaff27da38b79727351a32d31bb2b516af68c7e1240106303f235148dcd`
MD5	`5b0934486873b5cc585c3204e9fae64a`
BLAKE2b-256	`b431c11d030d2f1baa2b5a47b57a5f6410259b210318d0f7aa67295c63502888`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapper_tool-0.1.0-py3-none-any.whl:

Publisher: release.yml on ValeroK/scrapper-tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapper_tool-0.1.0-py3-none-any.whl
- Subject digest: 0fd1ceaff27da38b79727351a32d31bb2b516af68c7e1240106303f235148dcd
- Sigstore transparency entry: 1412863714
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: ValeroK/scrapper-tool@b487b7e604d580011147332d5cfb4aefc97f95f7
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/ValeroK
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b487b7e604d580011147332d5cfb4aefc97f95f7
- Trigger Event: push

scrapper-tool 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

scrapper-tool

Table of contents

Why scrapper-tool

The four scraping patterns

Architecture

Install

Quickstart

Documentation

Why these tools?

Roadmap

Contributing

Contributors

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance