fetch, but it tells you the truth: a verified-fetch primitive that returns a portable trust verdict, not just bytes.
Project description
veriscrape
fetch, but it tells you the truth. A verified-fetch primitive for web scraping: every fetch returns the bytes plus a portable trust verdict, so you know the moment your data is silently wrong, not three days later through a broken downstream report.
import veriscrape
r = veriscrape.get("https://example.com")
r.verdict # OK | BLOCKED | CHALLENGE | HONEYPOT | SOFT_404 | LOGIN_WALL | EMPTY_SHELL | UNVERIFIED
r.cause # "cloudflare_challenge" | "datadome" | "js_app_shell" | ...
r.confidence # 0.0 to 1.0
r.ok # True only when the content is positively real
The problem
Every scraping tool hands you bytes and a 200 and calls it success. In 2026 a 200 OK is no longer
ground truth: it is often a challenge page, a login wall, a soft-404, or an empty JS shell.
Status-code retry logic (the industry default) never notices, so the corruption is stored as data and
surfaces days later. veriscrape classifies the response deterministically (no LLM) into a
verdict, with the evidence and a confidence score.
Verdicts
| verdict | meaning |
|---|---|
OK |
genuine origin content |
BLOCKED |
a hard anti-bot deny |
CHALLENGE |
a JS / CAPTCHA interstitial (solvable, not content) |
HONEYPOT |
a decoy / AI-Labyrinth trap |
SOFT_404 |
a "not found" served as 200 |
LOGIN_WALL |
a sign-in / paywall gate instead of the data |
EMPTY_SHELL |
a JS app skeleton with no server-rendered content |
UNVERIFIED |
couldn't tell, abstains rather than guess |
Detection is two-key and conservative: it would rather abstain (UNVERIFIED) than emit a
confident wrong OK, because a silent false OK is the exact failure the tool exists to prevent.
Today it detects BLOCKED, CHALLENGE, SOFT_404, LOGIN_WALL, and EMPTY_SHELL across Cloudflare,
DataDome, Akamai, and vendor-agnostic signals. (HONEYPOT and a positive OK confirmation are on the
roadmap.)
CLI
$ veriscrape check https://discord.com/app
https://discord.com/app
!! EMPTY_SHELL (js_app_shell) confidence=0.97
HTTP 200
The exit code is pipeline-friendly: 0 when content looks fine (OK / UNVERIFIED), 1 when a
problem is detected. Drop it into CI to fail a job that silently scraped a wall. veriscrape check --file response.html classifies a saved response with no network; --json emits the record.
The finding
We ran popular fetchers against protected sites and used veriscrape to classify what they actually
got back (benchmark/, dated 2026-06-07, 9 targets × 3 requests, every result stable):
requests/curl_cffi/scraplingreturned HTTP 200 "success" where the content was actually junk (a JS app-shell, a login wall). Scrapling, which markets "blocked request detection," was the worst (33%): its browser-impersonating fetch got a200past a DataDome block, but that 200 was a login gate it reported as success. Status-code-only detection cannot see it. veriscrape flagged every one.
Reproduce: uv run --extra benchmark python -m benchmark.run.
Use it with your existing stack
veriscrape.get() is the drop-in for requests.get, but you don't have to switch fetchers. Add
the verdict to what you already have:
from veriscrape.adapters import from_requests, from_response
record = from_requests(requests.get(url)) # a requests.Response
record = from_response(status, headers, body, url=url) # any stack (httpx, Playwright, ...)
Scrapy: add veriscrape.adapters.VeriscrapeMiddleware to DOWNLOADER_MIDDLEWARES, then read
response.meta["veriscrape"] in your spider. Same verdict object everywhere.
Why a verdict, not just bytes
The FetchRecord verdict is portable JSON you own: the same shape travels across stacks
(requests / Scrapy / Playwright) and trends per-domain over time. Every fetch emits one; that shared
object is the spine. Deterministic-first by design: verdicts are computed from status / headers /
cookies / body, dated and reproducible, never a black box.
Status
Pre-alpha · deterministic-first · Apache-2.0 · drop-in for requests.get.
$ uv sync # from a clone (not yet on PyPI)
$ uv run pytest # 70 tests
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file veriscrape-0.1.0.tar.gz.
File metadata
- Download URL: veriscrape-0.1.0.tar.gz
- Upload date:
- Size: 139.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c8fcbfa16d192c3ab6860c7baf2747db81c797f6e522b478800aa99565ef5ca
|
|
| MD5 |
8dc42d51c78b3a0b70d2d74d1d974689
|
|
| BLAKE2b-256 |
126a735d5648dd6080ba54375fa40e553b1fcde40248e0a2af2eaaf20af37d7b
|
Provenance
The following attestation bundles were made for veriscrape-0.1.0.tar.gz:
Publisher:
publish.yml on san64777/veriscrape
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
veriscrape-0.1.0.tar.gz -
Subject digest:
5c8fcbfa16d192c3ab6860c7baf2747db81c797f6e522b478800aa99565ef5ca - Sigstore transparency entry: 1752254791
- Sigstore integration time:
-
Permalink:
san64777/veriscrape@6dc595c4bcc51c4f23e3dbfb73618359daf71137 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/san64777
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6dc595c4bcc51c4f23e3dbfb73618359daf71137 -
Trigger Event:
release
-
Statement type:
File details
Details for the file veriscrape-0.1.0-py3-none-any.whl.
File metadata
- Download URL: veriscrape-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59ecdfb75592ea485b42938e3f32cd000c9fef96e6742a4ac4744676b8948286
|
|
| MD5 |
6a7906d6f74416d6ee91342acc6fbadb
|
|
| BLAKE2b-256 |
f6b1f74e3e7fb71b319151fb2e7977c009f945d7fdc781f13c1e06d0674b8edc
|
Provenance
The following attestation bundles were made for veriscrape-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on san64777/veriscrape
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
veriscrape-0.1.0-py3-none-any.whl -
Subject digest:
59ecdfb75592ea485b42938e3f32cd000c9fef96e6742a4ac4744676b8948286 - Sigstore transparency entry: 1752254810
- Sigstore integration time:
-
Permalink:
san64777/veriscrape@6dc595c4bcc51c4f23e3dbfb73618359daf71137 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/san64777
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6dc595c4bcc51c4f23e3dbfb73618359daf71137 -
Trigger Event:
release
-
Statement type: