Skip to main content

Anti-blok web scraping stack: 4-tier decision tree for Cloudflare Turnstile, x.com login walls, ISP DNS poisoning

Project description

๐ŸŒ unblock-web

Anti-blok web scraping stack for AI agents

Cloudflare Turnstile ยท ISP DNS poison ยท X.com login walls โ€” solved.

License: MIT PyPI Docker Python Patchright Cloudflare Bypass CI

๐ŸŒ English ยท ๐Ÿ‡ฎ๐Ÿ‡ฉ Bahasa Indonesia

๐Ÿš€ Quick Start ยท ๐Ÿ“– Decision Tree ยท ๐Ÿ›ก๏ธ Tiers ยท ๐Ÿงช Verified Targets ยท ๐Ÿค Contributing


๐ŸŽฏ What This Solves

You hit a URL. It returns junk:

โŒ "Please enable JavaScript"   โ† x.com tweets, SPAs
โŒ "Checking your browser..."   โ† Cloudflare Turnstile
โŒ HTTP 403 / 503               โ† bot detection
โŒ "internet-positif.info"      โ† ISP DNS poison (๐Ÿ‡ฎ๐Ÿ‡ฉ)
โŒ "Sign in to view"            โ† login walls

unblock-web is a decision tree + verified scripts that pick the right tool for each block class. Drop it into any AI agent (Claude, Hermes, Cursor, Aider, your own) and stop guessing with raw curl/wget/playwright.

Status (May 2026): All 4 tiers verified working on Ubuntu 26.04 + WSL2.


โœจ Features

๐ŸŽจ What Why it matters
๐Ÿ›ก๏ธ 4-tier escalation Right tool per block class โ€” no shotgun retries
๐Ÿšซ Cloudflare Turnstile bypass Patchright stealth, no paid SaaS
๐Ÿฆ X.com tweets without login DOM captured before login modal mounts
๐ŸŒ ISP DNS bypass Geo-proxy via TinyFish (free unlimited)
๐Ÿ”ง Self-healing One script reinstalls Chromium when an update wipes it
๐Ÿฉบ Built-in canary 3-tier health probe, drops into your CI or session-start hook
๐Ÿ“ฆ Zero paid services Local Chromium + free TinyFish API + free aggregator mirrors
๐Ÿ Python stdlib only No requests, no httpx, no extras for the canary itself

๐Ÿš€ Quick Start

Pick your favorite install method. All four work right now.

โšก One-liner (zero-config)

curl -fsSL https://raw.githubusercontent.com/kevinnft/unblock-web/main/scripts/install.sh | bash

Picks a working Python (3.11โ€“3.13), creates an isolated venv at ~/.unblock-web, installs Chromium via heal, and symlinks unblock-web into ~/.local/bin. Reversible: rm -rf ~/.unblock-web ~/.local/bin/unblock-web.

๐Ÿ pip

pip install 'unblock-web[stealth]'
unblock-web heal              # one-time: auto-detects OS, installs Chromium
unblock-web verify            # 3-tier health check
unblock-web fetch https://x.com/elonmusk/status/123456789

๐Ÿณ Docker (zero-install)

docker run --rm ghcr.io/kevinnft/unblock-web:latest fetch https://example.com

# With TinyFish (Tier 2 geo-proxy)
docker run --rm \
  -e TINYFISH_API_KEY=$TINYFISH_API_KEY \
  ghcr.io/kevinnft/unblock-web:latest fetch https://blocked.com --proxy US

๐Ÿ“ฆ From source

git clone https://github.com/kevinnft/unblock-web.git
cd unblock-web
pip install -e '.[stealth]'
unblock-web heal
unblock-web verify --verbose

๐Ÿ› ๏ธ Library

from unblock_web import fetch

# Auto-pilot โ€” picks the right tier per URL
page = fetch("https://x.com/seelffff/status/2055155782367187375")
print(page.text)
print(f"Used tier: {page.tier}")

# Force ISP/geo bypass
page = fetch("https://web3.okx.com", proxy_country="US")

# Force a specific tier
page = fetch("https://target.com", tier="T1", wait=8000)

๐Ÿ”Œ In an AI agent

Hermes Agent example โ€” drop the canary into session-start:

# ~/.hermes/config.yaml
hooks:
  on_session_start:
    - command: "unblock-web verify"
      timeout: 30

๐Ÿ“– Decision Tree

flowchart TD
    A[๐ŸŒ URL incoming] --> B{What kind of block?}

    B -->|Plain blog/docs/<br/>GitHub README| T0[โšก Tier 0: scrapling.get<br/>fastest, no browser]
    B -->|JS-rendered SPA<br/>React/Next/Vue| T1[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>+ network_idle + wait]
    B -->|Cloudflare Turnstile<br/>'Checking browser'| T1B[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>+ solve_cloudflare=True]
    B -->|x.com tweet body| T1C[๐Ÿ›ก๏ธ Tier 1: stealthy_fetch<br/>captures DOM pre-modal]
    B -->|x.com replies/thread| T3[๐Ÿชž Tier 3: xcancel.com mirror<br/>via Tier 1 stealth]
    B -->|๐Ÿ‡ฎ๐Ÿ‡ฉ ISP DNS block<br/>internet-positif| T2[๐ŸŒ Tier 2: TinyFish<br/>--proxy US]
    B -->|Geo-locked content| T2B[๐ŸŒ Tier 2: TinyFish<br/>--proxy XX]
    B -->|Login required<br/>DMs/private/paywall| T4[๐Ÿ”‘ Tier 4: xurl + bearer<br/>or cookie injection]

    T0 --> R[โœ… Markdown out]
    T1 --> R
    T1B --> R
    T1C --> R
    T3 --> R
    T2 --> R
    T2B --> R
    T4 --> R

    style T0 fill:#10b981,stroke:#059669,color:#fff
    style T1 fill:#f59e0b,stroke:#d97706,color:#fff
    style T1B fill:#f59e0b,stroke:#d97706,color:#fff
    style T1C fill:#f59e0b,stroke:#d97706,color:#fff
    style T2 fill:#06b6d4,stroke:#0891b2,color:#fff
    style T2B fill:#06b6d4,stroke:#0891b2,color:#fff
    style T3 fill:#a855f7,stroke:#9333ea,color:#fff
    style T4 fill:#ef4444,stroke:#dc2626,color:#fff
    style R fill:#22c55e,stroke:#16a34a,color:#fff

๐Ÿ›ก๏ธ The 4-Tier Stack

โšก Tier 0: Plain HTTP

Tool scrapling.Fetcher().get(url)
Cost Free, ~100ms
Use for Static HTML, GitHub READMEs, JSON APIs, blogs without anti-bot
Fails on Anything client-rendered

๐Ÿ›ก๏ธ Tier 1: Scrapling Stealth (PRIMARY)

Tool mcp_scrapling_stealthy_fetch / StealthyFetcher.fetch()
Engine Patchright (anti-fingerprint Chromium fork)
Cost Free, local CPU, ~5-15s
Use for x.com tweets ยท Cloudflare Turnstile ยท React/Next/Vue SPAs ยท 99% of "hard" pages
Killer flags solve_cloudflare=True, network_idle=True, wait=5000
StealthyFetcher.fetch(
    url,
    network_idle=True,        # wait for XHR settle
    solve_cloudflare=True,    # auto-handle Turnstile JS
    wait=5000,                # ms โ€” let SPA hydrate
)

๐Ÿ“š Full param reference: docs/tier-1-scrapling.md

๐ŸŒ Tier 2: TinyFish (geo-proxy)

Tool scripts/tinyfish_fetch.py
Engine Remote browser farm via REST API
Cost Free unlimited (no credit card, no rate limit advertised)
Use for ISP DNS blocks (๐Ÿ‡ฎ๐Ÿ‡ฉ Internet Positif) ยท geo-locked content ยท second opinion ยท when local Chromium is busy
Fails on x.com tweets (their SSR drops out before x.com's React boots), login walls
python3 scripts/tinyfish_fetch.py "https://blocked-site.com" --proxy US
python3 scripts/tinyfish_fetch.py --search "your query"  # bonus: free search API

๐Ÿ“š Setup + edge cases: docs/tier-2-tinyfish.md

๐Ÿชž Tier 3: Aggregator Mirrors

Tool Tier 1 stealth โ†’ xcancel.com/<user>/status/<id>
Cost Free
Use for X/Twitter replies, threads, full conversation context that won't render unauthenticated
Bonus Multilingual replies preserved (verified: EN/JP/CN/VI/IT in one fetch)

๐Ÿ“š Mirror rotation tips: docs/tier-3-mirrors.md

๐Ÿ”‘ Tier 4: Authenticated APIs

Tool xurl + bearer token
Cost Free tier (1500 reads/mo on X)
Use for DMs ยท private accounts ยท POST operations ยท paywalled content
Setup One-time signup at developer.x.com

๐Ÿ“š Step-by-step bearer setup: docs/tier-4-authenticated.md


๐Ÿงช Verified Targets

Stack was tested against these (May 2026) โ€” every result is reproducible:

๐ŸŽฏ Target ๐Ÿ› ๏ธ Tier ๐Ÿ“ฆ Result
๐Ÿฆ x.com/<user>/status/<id> (no auth) T1 + wait=5000 โœ… Full tweet body + meta + view count + quoted tweet
๐Ÿ›ก๏ธ nowsecure.nl (Cloudflare anti-bot test) T1 + solve_cloudflare=True โœ… Returns "NOWSECURE / by nodriver" (only served to humans)
๐Ÿชž xcancel.com/<user>/status/<id> (CF-protected) T1 + solve_cloudflare=True โœ… Tweet + 11 replies (multilingual)
๐Ÿ‡ฎ๐Ÿ‡ฉ web3.okx.com (Indonesian ISP block) T2 + --proxy US โœ… Full JS render + prize pool data
๐Ÿ“š GitHub README T0 โœ… Markdown extract
๐Ÿ“ฐ News-site SPA (React) T1 + wait=8000 โœ… Article body

Reproduce these: see examples/ for runnable scripts.


๐Ÿฉบ Health Monitoring

Three layers, no cron required (built for laptops that sleep):

๐Ÿšฆ Session-start canary

Drop into any agent's session-start hook. Silent on healthy state, alert on regression:

# Hermes Agent example (~/.hermes/config.yaml)
hooks:
  on_session_start:
    - command: "/path/to/scripts/verify-stack.py"
      timeout: 30

๐Ÿ”ง Self-heal on Chromium loss

When stealthy_fetch errors with Executable doesn't exist (after a venv recreate), auto-run:

bash scripts/heal-chromium.sh

Idempotent. Safe to run anytime.

๐Ÿ‘€ On-demand audit

python3 scripts/verify-stack.py --verbose

๐ŸŽจ Why "Anti-Blok"?

Because every "scraping tutorial" online stops at:

"Just install Playwright! Just use Selenium! Just pay for ScrapingBee!"

Then you hit the real world:

  • ๐Ÿ‡ฎ๐Ÿ‡ฉ ISP poisoning your DNS
  • ๐Ÿ‡จ๐Ÿ‡ณ GFW dropping your packets
  • โ˜๏ธ Cloudflare upgrading Turnstile every quarter
  • ๐Ÿฆ X.com adding login walls overnight
  • ๐Ÿง Ubuntu 26.04 breaking Playwright install

unblock-web is the field-tested decision tree from those battles. Free tools only. No API keys hoarded. Reproducible against listed targets.


๐Ÿ“ Repository Structure

unblock-web/
โ”œโ”€โ”€ ๐Ÿ“– README.md              โ† you are here
โ”œโ”€โ”€ ๐Ÿ“œ LICENSE                 โ† MIT
โ”œโ”€โ”€ ๐Ÿ“š docs/                   โ† per-tier deep dives
โ”‚   โ”œโ”€โ”€ tier-1-scrapling.md
โ”‚   โ”œโ”€โ”€ tier-2-tinyfish.md
โ”‚   โ”œโ”€โ”€ tier-3-mirrors.md
โ”‚   โ”œโ”€โ”€ tier-4-authenticated.md
โ”‚   โ””โ”€โ”€ ubuntu-26-04-fix.md
โ”œโ”€โ”€ ๐Ÿ› ๏ธ scripts/                โ† drop-in tools
โ”‚   โ”œโ”€โ”€ verify-stack.py        โ† 3-tier canary
โ”‚   โ”œโ”€โ”€ heal-chromium.sh       โ† Ubuntu 26.04 fix
โ”‚   โ””โ”€โ”€ tinyfish_fetch.py      โ† Tier 2 wrapper
โ”œโ”€โ”€ ๐Ÿงช examples/               โ† reproducible cases
โ”‚   โ”œโ”€โ”€ x_com_tweet.py
โ”‚   โ”œโ”€โ”€ cloudflare_bypass.py
โ”‚   โ”œโ”€โ”€ indonesian_isp_bypass.py
โ”‚   โ””โ”€โ”€ xcancel_replies.py
โ”œโ”€โ”€ โš™๏ธ  .github/workflows/      โ† CI canary
โ”‚   โ””โ”€โ”€ canary.yml
โ””โ”€โ”€ ๐ŸŽจ assets/                 โ† logo + diagrams

๐Ÿค Contributing

Found a target the stack can't crack? Open an issue with:

  1. โ“ The URL (or pattern)
  2. ๐Ÿ“‹ What each tier returned (paste the failure)
  3. ๐Ÿค” Hypothesis (login? CF v3? new anti-bot?)

Or send a PR to docs/known-targets.md when you find a workaround.


โš–๏ธ Ethics & Legal

This stack is for reading publicly accessible content:

โœ… Public tweets, blogs, docs, GitHub
โœ… Content you're entitled to read in a browser
โœ… APIs you have keys for

โŒ Don't use it to:

  • Scrape behind authentication you don't own
  • Violate site Terms of Service
  • Mass-extract copyrighted content
  • Build credential-harvesting / phishing tools

Respect robots.txt. Respect rate limits. Be a good citizen of the open web.


๐Ÿ™ Credits

Stack composed from:

  • ๐Ÿ›ก๏ธ Scrapling โ€” the unified scraping library
  • ๐Ÿฅท Patchright โ€” anti-fingerprint Playwright fork
  • ๐ŸŸ TinyFish โ€” free fetch + search API for AI agents
  • ๐Ÿชž xcancel.com โ€” Twitter content mirror that survives
  • ๐Ÿค xurl โ€” official X CLI

Built with ๐Ÿฅท by @kevinnft
Field-tested in Indonesian internet conditions.

โฌ† Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unblock_web-0.2.3.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unblock_web-0.2.3-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file unblock_web-0.2.3.tar.gz.

File metadata

  • Download URL: unblock_web-0.2.3.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unblock_web-0.2.3.tar.gz
Algorithm Hash digest
SHA256 eff5220f49ed0a23a7d0ba03d052bfd4493a1277fd1a7b63037482abf32a7689
MD5 8edc461940a5b0700f8c60c5d2ec1fd2
BLAKE2b-256 d65ce624369ec22c5a3563af2e6b0aec961770f9822c04299aa087e1c7f90b1a

See more details on using hashes here.

Provenance

The following attestation bundles were made for unblock_web-0.2.3.tar.gz:

Publisher: publish.yml on kevinnft/unblock-web

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unblock_web-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: unblock_web-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unblock_web-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a964d723917f2cf33d4fa0c786dae777c97d60ffa9951470f964a6a9ed7bea38
MD5 8c3afbc74589698eecc7fbe71088fa1e
BLAKE2b-256 83a2283cee422a78d85281ce03bc89efe4cd55084180d68c7363b2a73b935087

See more details on using hashes here.

Provenance

The following attestation bundles were made for unblock_web-0.2.3-py3-none-any.whl:

Publisher: publish.yml on kevinnft/unblock-web

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page