Skip to main content

fetch, but it tells you the truth: a verified-fetch primitive that returns a portable trust verdict, not just bytes.

Project description

veriscrape

CI PyPI License: Apache 2.0 Python

fetch, but it tells you the truth. A verified-fetch primitive for web scraping: every fetch returns the bytes plus a portable trust verdict, so you know the moment your data is silently wrong, not three days later through a broken downstream report.

import veriscrape

r = veriscrape.get("https://example.com")
r.verdict      # OK | BLOCKED | CHALLENGE | HONEYPOT | SOFT_404 | LOGIN_WALL | EMPTY_SHELL | UNVERIFIED
r.cause        # "cloudflare_challenge" | "datadome" | "js_app_shell" | ...
r.confidence   # 0.0 to 1.0
r.ok           # True only when the content is positively real

The problem

Every scraping tool hands you bytes and a 200 and calls it success. In 2026 a 200 OK is no longer ground truth: it is often a challenge page, a login wall, a soft-404, or an empty JS shell. Status-code retry logic (the industry default) never notices, so the corruption is stored as data and surfaces days later. veriscrape classifies the response deterministically (no LLM) into a verdict, with the evidence and a confidence score.

Verdicts

verdict meaning
OK genuine origin content
BLOCKED a hard anti-bot deny
CHALLENGE a JS / CAPTCHA interstitial (solvable, not content)
HONEYPOT a decoy / AI-Labyrinth trap
SOFT_404 a "not found" served as 200
LOGIN_WALL a sign-in / paywall gate instead of the data
EMPTY_SHELL a JS app skeleton with no server-rendered content
UNVERIFIED couldn't tell, abstains rather than guess

Detection is two-key and conservative: it would rather abstain (UNVERIFIED) than emit a confident wrong OK, because a silent false OK is the exact failure the tool exists to prevent. Today it detects BLOCKED, CHALLENGE, SOFT_404, LOGIN_WALL, and EMPTY_SHELL across Cloudflare, DataDome, Akamai, and vendor-agnostic signals. (HONEYPOT and a positive OK confirmation are on the roadmap.)

CLI

$ veriscrape check https://discord.com/app
https://discord.com/app
  !! EMPTY_SHELL (js_app_shell)  confidence=0.97
  HTTP 200

The exit code is pipeline-friendly: 0 when content looks fine (OK / UNVERIFIED), 1 when a problem is detected. Drop it into CI to fail a job that silently scraped a wall. veriscrape check --file response.html classifies a saved response with no network; --json emits the record.

The finding

We ran popular fetchers against protected sites and used veriscrape to classify what they actually got back (benchmark/, dated 2026-06-07, 9 targets × 3 requests, every result stable):

requests / curl_cffi / scrapling returned HTTP 200 "success" where the content was actually junk (a JS app-shell, a login wall). Scrapling, which markets "blocked request detection," was the worst (33%): its browser-impersonating fetch got a 200 past a DataDome block, but that 200 was a login gate it reported as success. Status-code-only detection cannot see it. veriscrape flagged every one.

Reproduce: uv run --extra benchmark python -m benchmark.run.

Use it with your existing stack

veriscrape.get() is the drop-in for requests.get, but you don't have to switch fetchers. Add the verdict to what you already have:

from veriscrape.adapters import from_requests, from_response

record = from_requests(requests.get(url))          # a requests.Response
record = from_response(status, headers, body, url=url)   # any stack (httpx, Playwright, ...)

Scrapy: add veriscrape.adapters.VeriscrapeMiddleware to DOWNLOADER_MIDDLEWARES, then read response.meta["veriscrape"] in your spider. Same verdict object everywhere.

Why a verdict, not just bytes

The FetchRecord verdict is portable JSON you own: the same shape travels across stacks (requests / Scrapy / Playwright) and trends per-domain over time. Every fetch emits one; that shared object is the spine. Deterministic-first by design: verdicts are computed from status / headers / cookies / body, dated and reproducible, never a black box.

Status

Pre-alpha · deterministic-first · Apache-2.0 · drop-in for requests.get.

$ uv sync          # from a clone (not yet on PyPI)
$ uv run pytest    # 70 tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veriscrape-0.1.0.tar.gz (139.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

veriscrape-0.1.0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file veriscrape-0.1.0.tar.gz.

File metadata

  • Download URL: veriscrape-0.1.0.tar.gz
  • Upload date:
  • Size: 139.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for veriscrape-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5c8fcbfa16d192c3ab6860c7baf2747db81c797f6e522b478800aa99565ef5ca
MD5 8dc42d51c78b3a0b70d2d74d1d974689
BLAKE2b-256 126a735d5648dd6080ba54375fa40e553b1fcde40248e0a2af2eaaf20af37d7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for veriscrape-0.1.0.tar.gz:

Publisher: publish.yml on san64777/veriscrape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file veriscrape-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: veriscrape-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for veriscrape-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59ecdfb75592ea485b42938e3f32cd000c9fef96e6742a4ac4744676b8948286
MD5 6a7906d6f74416d6ee91342acc6fbadb
BLAKE2b-256 f6b1f74e3e7fb71b319151fb2e7977c009f945d7fdc781f13c1e06d0674b8edc

See more details on using hashes here.

Provenance

The following attestation bundles were made for veriscrape-0.1.0-py3-none-any.whl:

Publisher: publish.yml on san64777/veriscrape

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page