Skip to main content

Anti-blok web scraping stack: 4-tier decision tree for Cloudflare Turnstile, x.com login walls, ISP DNS poisoning

Project description

๐ŸŒ unblock-web

Anti-blok web scraping stack for AI agents

Cloudflare Turnstile ยท ISP DNS poison ยท X.com login walls โ€” solved.

License: MIT Docker Python Patchright Cloudflare Bypass CI

๐ŸŒ English ยท ๐Ÿ‡ฎ๐Ÿ‡ฉ Bahasa Indonesia

๐Ÿš€ Quick Start ยท ๐Ÿ“– Decision Tree ยท ๐Ÿ›ก๏ธ Tiers ยท ๐Ÿงช Verified Targets ยท ๐Ÿค Contributing


๐ŸŽฏ What This Solves

You hit a URL. It returns junk:

โŒ "Please enable JavaScript"   โ† x.com tweets, SPAs
โŒ "Checking your browser..."   โ† Cloudflare Turnstile
โŒ HTTP 403 / 503               โ† bot detection
โŒ "internet-positif.info"      โ† ISP DNS poison (๐Ÿ‡ฎ๐Ÿ‡ฉ)
โŒ "Sign in to view"            โ† login walls

unblock-web is a decision tree + verified scripts that pick the right tool for each block class. Drop it into any AI agent (Claude, Hermes, Cursor, Aider, your own) and stop guessing with raw curl/wget/playwright.

Status (May 2026): All 4 tiers verified working on Ubuntu 26.04 + WSL2.


โœจ Features

๐ŸŽจ What Why it matters
๐Ÿ›ก๏ธ 4-tier escalation Right tool per block class โ€” no shotgun retries
๐Ÿšซ Cloudflare Turnstile bypass Patchright stealth, no paid SaaS
๐Ÿฆ X.com tweets without login DOM captured before login modal mounts
๐ŸŒ ISP DNS bypass Geo-proxy via TinyFish (free unlimited)
๐Ÿ”ง Self-healing One script reinstalls Chromium when an update wipes it
๐Ÿฉบ Built-in canary 3-tier health probe, drops into your CI or session-start hook
๐Ÿ“ฆ Zero paid services Local Chromium + free TinyFish API + free aggregator mirrors
๐Ÿ Python stdlib only No requests, no httpx, no extras for the canary itself

๐Ÿš€ Quick Start

Pick your favorite install method. All four work right now.

โšก One-liner (zero-config)

curl -fsSL https://raw.githubusercontent.com/kevinnft/unblock-web/main/scripts/install.sh | bash

Picks a working Python (3.11โ€“3.13), creates an isolated venv at ~/.unblock-web, installs Chromium via heal, and symlinks unblock-web into ~/.local/bin. Reversible: rm -rf ~/.unblock-web ~/.local/bin/unblock-web.

๐Ÿ pip

pip install 'unblock-web[stealth] @ git+https://github.com/kevinnft/unblock-web.git'
unblock-web heal              # one-time: auto-detects OS, installs Chromium
unblock-web verify            # 3-tier health check
unblock-web fetch https://x.com/elonmusk/status/123456789

We're on git-install while we wait for PyPI. The git URL works the same as PyPI would. See docs/publishing.md.

๐Ÿณ Docker (zero-install)

docker run --rm ghcr.io/kevinnft/unblock-web:latest fetch https://example.com

# With TinyFish (Tier 2 geo-proxy)
docker run --rm \
  -e TINYFISH_API_KEY=$TINYFISH_API_KEY \
  ghcr.io/kevinnft/unblock-web:latest fetch https://blocked.com --proxy US

๐Ÿ“ฆ From source

git clone https://github.com/kevinnft/unblock-web.git
cd unblock-web
pip install -e '.[stealth]'
unblock-web heal
unblock-web verify --verbose

๐Ÿ› ๏ธ Library

from unblock_web import fetch

# Auto-pilot โ€” picks the right tier per URL
page = fetch("https://x.com/seelffff/status/2055155782367187375")
print(page.text)
print(f"Used tier: {page.tier}")

# Force ISP/geo bypass
page = fetch("https://web3.okx.com", proxy_country="US")

# Force a specific tier
page = fetch("https://target.com", tier="T1", wait=8000)

๐Ÿ”Œ In an AI agent

Hermes Agent example โ€” drop the canary into session-start:

# ~/.hermes/config.yaml
hooks:
  on_session_start:
    - command: "unblock-web verify"
      timeout: 30

๐Ÿ“– Decision Tree

flowchart TD
    A[๐ŸŒ URL incoming] --> B{What kind of block?}

    B -->|Plain blog/docs/<br/>GitHub README| T0[โšก Tier 0: scrapling.get<br/>fastest, no browser]
    B -->|JS-rendered SPA<br/>React/Next/Vue| T1[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>+ network_idle + wait]
    B -->|Cloudflare Turnstile<br/>'Checking browser'| T1B[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>+ solve_cloudflare=True]
    B -->|x.com tweet body| T1C[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>captures DOM pre-modal]
    B -->|x.com replies/thread| T3[๐Ÿชž Tier 3: xcancel.com mirror<br/>via Tier 1 stealth]
    B -->|๐Ÿ‡ฎ๐Ÿ‡ฉ ISP DNS block<br/>internet-positif| T2[๐ŸŒ Tier 2: TinyFish<br/>--proxy US]
    B -->|Geo-locked content| T2B[๐ŸŒ Tier 2: TinyFish<br/>--proxy XX]
    B -->|Login required<br/>DMs/private/paywall| T4[๐Ÿ”‘ Tier 4: xurl + bearer<br/>or cookie injection]

    T0 --> R[โœ… Markdown out]
    T1 --> R
    T1B --> R
    T1C --> R
    T3 --> R
    T2 --> R
    T2B --> R
    T4 --> R

    style T0 fill:#10b981,stroke:#059669,color:#fff
    style T1 fill:#f59e0b,stroke:#d97706,color:#fff
    style T1B fill:#f59e0b,stroke:#d97706,color:#fff
    style T1C fill:#f59e0b,stroke:#d97706,color:#fff
    style T2 fill:#06b6d4,stroke:#0891b2,color:#fff
    style T2B fill:#06b6d4,stroke:#0891b2,color:#fff
    style T3 fill:#a855f7,stroke:#9333ea,color:#fff
    style T4 fill:#ef4444,stroke:#dc2626,color:#fff
    style R fill:#22c55e,stroke:#16a34a,color:#fff

๐Ÿ›ก๏ธ The 4-Tier Stack

โšก Tier 0: Plain HTTP

Tool scrapling.Fetcher().get(url)
Cost Free, ~100ms
Use for Static HTML, GitHub READMEs, JSON APIs, blogs without anti-bot
Fails on Anything client-rendered

๐Ÿ›ก๏ธ Tier 1: Scrapling Stealth (PRIMARY)

Tool mcp_scrapling_stealthy_fetch / StealthyFetcher.fetch()
Engine Patchright (anti-fingerprint Chromium fork)
Cost Free, local CPU, ~5-15s
Use for x.com tweets ยท Cloudflare Turnstile ยท React/Next/Vue SPAs ยท 99% of "hard" pages
Killer flags solve_cloudflare=True, network_idle=True, wait=5000
StealthyFetcher.fetch(
    url,
    network_idle=True,        # wait for XHR settle
    solve_cloudflare=True,    # auto-handle Turnstile JS
    wait=5000,                # ms โ€” let SPA hydrate
)

๐Ÿ“š Full param reference: docs/tier-1-scrapling.md

๐ŸŒ Tier 2: TinyFish (geo-proxy)

Tool scripts/tinyfish_fetch.py
Engine Remote browser farm via REST API
Cost Free unlimited (no credit card, no rate limit advertised)
Use for ISP DNS blocks (๐Ÿ‡ฎ๐Ÿ‡ฉ Internet Positif) ยท geo-locked content ยท second opinion ยท when local Chromium is busy
Fails on x.com tweets (their SSR drops out before x.com's React boots), login walls
python3 scripts/tinyfish_fetch.py "https://blocked-site.com" --proxy US
python3 scripts/tinyfish_fetch.py --search "your query"  # bonus: free search API

๐Ÿ“š Setup + edge cases: docs/tier-2-tinyfish.md

๐Ÿชž Tier 3: Aggregator Mirrors

Tool Tier 1 stealth โ†’ xcancel.com/<user>/status/<id>
Cost Free
Use for X/Twitter replies, threads, full conversation context that won't render unauthenticated
Bonus Multilingual replies preserved (verified: EN/JP/CN/VI/IT in one fetch)

๐Ÿ“š Mirror rotation tips: docs/tier-3-mirrors.md

๐Ÿ”‘ Tier 4: Authenticated APIs

Tool xurl + bearer token
Cost Free tier (1500 reads/mo on X)
Use for DMs ยท private accounts ยท POST operations ยท paywalled content
Setup One-time signup at developer.x.com

๐Ÿ“š Step-by-step bearer setup: docs/tier-4-authenticated.md


๐Ÿงช Verified Targets

Stack was tested against these (May 2026) โ€” every result is reproducible:

๐ŸŽฏ Target ๐Ÿ› ๏ธ Tier ๐Ÿ“ฆ Result
๐Ÿฆ x.com/<user>/status/<id> (no auth) T1 + wait=5000 โœ… Full tweet body + meta + view count + quoted tweet
๐Ÿ›ก๏ธ nowsecure.nl (Cloudflare anti-bot test) T1 + solve_cloudflare=True โœ… Returns "NOWSECURE / by nodriver" (only served to humans)
๐Ÿชž xcancel.com/<user>/status/<id> (CF-protected) T1 + solve_cloudflare=True โœ… Tweet + 11 replies (multilingual)
๐Ÿ‡ฎ๐Ÿ‡ฉ web3.okx.com (Indonesian ISP block) T2 + --proxy US โœ… Full JS render + prize pool data
๐Ÿ“š GitHub README T0 โœ… Markdown extract
๐Ÿ“ฐ News-site SPA (React) T1 + wait=8000 โœ… Article body

Reproduce these: see examples/ for runnable scripts.


๐Ÿฉบ Health Monitoring

Three layers, no cron required (built for laptops that sleep):

๐Ÿšฆ Session-start canary

Drop into any agent's session-start hook. Silent on healthy state, alert on regression:

# Hermes Agent example (~/.hermes/config.yaml)
hooks:
  on_session_start:
    - command: "/path/to/scripts/verify-stack.py"
      timeout: 30

๐Ÿ”ง Self-heal on Chromium loss

When stealthy_fetch errors with Executable doesn't exist (after a venv recreate), auto-run:

bash scripts/heal-chromium.sh

Idempotent. Safe to run anytime.

๐Ÿ‘€ On-demand audit

python3 scripts/verify-stack.py --verbose

๐ŸŽจ Why "Anti-Blok"?

Because every "scraping tutorial" online stops at:

"Just install Playwright! Just use Selenium! Just pay for ScrapingBee!"

Then you hit the real world:

  • ๐Ÿ‡ฎ๐Ÿ‡ฉ ISP poisoning your DNS
  • ๐Ÿ‡จ๐Ÿ‡ณ GFW dropping your packets
  • โ˜๏ธ Cloudflare upgrading Turnstile every quarter
  • ๐Ÿฆ X.com adding login walls overnight
  • ๐Ÿง Ubuntu 26.04 breaking Playwright install

unblock-web is the field-tested decision tree from those battles. Free tools only. No API keys hoarded. Reproducible against listed targets.


๐Ÿ“ Repository Structure

unblock-web/
โ”œโ”€โ”€ ๐Ÿ“– README.md              โ† you are here
โ”œโ”€โ”€ ๐Ÿ“œ LICENSE                 โ† MIT
โ”œโ”€โ”€ ๐Ÿ“š docs/                   โ† per-tier deep dives
โ”‚   โ”œโ”€โ”€ tier-1-scrapling.md
โ”‚   โ”œโ”€โ”€ tier-2-tinyfish.md
โ”‚   โ”œโ”€โ”€ tier-3-mirrors.md
โ”‚   โ”œโ”€โ”€ tier-4-authenticated.md
โ”‚   โ””โ”€โ”€ ubuntu-26-04-fix.md
โ”œโ”€โ”€ ๐Ÿ› ๏ธ scripts/                โ† drop-in tools
โ”‚   โ”œโ”€โ”€ verify-stack.py        โ† 3-tier canary
โ”‚   โ”œโ”€โ”€ heal-chromium.sh       โ† Ubuntu 26.04 fix
โ”‚   โ””โ”€โ”€ tinyfish_fetch.py      โ† Tier 2 wrapper
โ”œโ”€โ”€ ๐Ÿงช examples/               โ† reproducible cases
โ”‚   โ”œโ”€โ”€ x_com_tweet.py
โ”‚   โ”œโ”€โ”€ cloudflare_bypass.py
โ”‚   โ”œโ”€โ”€ indonesian_isp_bypass.py
โ”‚   โ””โ”€โ”€ xcancel_replies.py
โ”œโ”€โ”€ โš™๏ธ  .github/workflows/      โ† CI canary
โ”‚   โ””โ”€โ”€ canary.yml
โ””โ”€โ”€ ๐ŸŽจ assets/                 โ† logo + diagrams

๐Ÿค Contributing

Found a target the stack can't crack? Open an issue with:

  1. โ“ The URL (or pattern)
  2. ๐Ÿ“‹ What each tier returned (paste the failure)
  3. ๐Ÿค” Hypothesis (login? CF v3? new anti-bot?)

Or send a PR to docs/known-targets.md when you find a workaround.


โš–๏ธ Ethics & Legal

This stack is for reading publicly accessible content:

โœ… Public tweets, blogs, docs, GitHub
โœ… Content you're entitled to read in a browser
โœ… APIs you have keys for

โŒ Don't use it to:

  • Scrape behind authentication you don't own
  • Violate site Terms of Service
  • Mass-extract copyrighted content
  • Build credential-harvesting / phishing tools

Respect robots.txt. Respect rate limits. Be a good citizen of the open web.


๐Ÿ™ Credits

Stack composed from:

  • ๐Ÿ›ก๏ธ Scrapling โ€” the unified scraping library
  • ๐Ÿฅท Patchright โ€” anti-fingerprint Playwright fork
  • ๐ŸŸ TinyFish โ€” free fetch + search API for AI agents
  • ๐Ÿชž xcancel.com โ€” Twitter content mirror that survives
  • ๐Ÿค xurl โ€” official X CLI

Built with ๐Ÿฅท by @kevinnft
Field-tested in Indonesian internet conditions.

โฌ† Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unblock_web-0.2.2.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unblock_web-0.2.2-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file unblock_web-0.2.2.tar.gz.

File metadata

  • Download URL: unblock_web-0.2.2.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unblock_web-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c891f981bfb24e364ece45761c63a533af7381f9db1daa94efe160124b8b8931
MD5 02d7f284ef82fa05077b17cd046bd2c3
BLAKE2b-256 47851715691752af4712b91c1e0cff686cf60f56e824bb45dc7d007de7255d42

See more details on using hashes here.

Provenance

The following attestation bundles were made for unblock_web-0.2.2.tar.gz:

Publisher: publish.yml on kevinnft/unblock-web

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unblock_web-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: unblock_web-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unblock_web-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 04f3e7401b4362b1f442df401573a7b4016fbd4bfbd480ce2f9e2da211eaf9bf
MD5 80adfc902788adbae4fb4f67c93fd3bf
BLAKE2b-256 a9a585c0d755e9f1da9811a4dce5cd1f618d6c9bbb563c20924dcb8b363daa27

See more details on using hashes here.

Provenance

The following attestation bundles were made for unblock_web-0.2.2-py3-none-any.whl:

Publisher: publish.yml on kevinnft/unblock-web

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page