Anti-blok web scraping stack: 4-tier decision tree for Cloudflare Turnstile, x.com login walls, ISP DNS poisoning
Project description
๐ unblock-web
Anti-blok web scraping stack for AI agents
Cloudflare Turnstile ยท ISP DNS poison ยท X.com login walls โ solved.
๐ English ยท ๐ฎ๐ฉ Bahasa Indonesia
๐ Quick Start ยท ๐ Decision Tree ยท ๐ก๏ธ Tiers ยท ๐งช Verified Targets ยท ๐ค Contributing
๐ฏ What This Solves
You hit a URL. It returns junk:
โ "Please enable JavaScript" โ x.com tweets, SPAs
โ "Checking your browser..." โ Cloudflare Turnstile
โ HTTP 403 / 503 โ bot detection
โ "internet-positif.info" โ ISP DNS poison (๐ฎ๐ฉ)
โ "Sign in to view" โ login walls
unblock-web is a decision tree + verified scripts that pick the right tool for each block class. Drop it into any AI agent (Claude, Hermes, Cursor, Aider, your own) and stop guessing with raw curl/wget/playwright.
Status (May 2026): All 4 tiers verified working on Ubuntu 26.04 + WSL2.
โจ Features
| ๐จ | What | Why it matters |
|---|---|---|
| ๐ก๏ธ | 4-tier escalation | Right tool per block class โ no shotgun retries |
| ๐ซ | Cloudflare Turnstile bypass | Patchright stealth, no paid SaaS |
| ๐ฆ | X.com tweets without login | DOM captured before login modal mounts |
| ๐ | ISP DNS bypass | Geo-proxy via TinyFish (free unlimited) |
| ๐ง | Self-healing | One script reinstalls Chromium when an update wipes it |
| ๐ฉบ | Built-in canary | 3-tier health probe, drops into your CI or session-start hook |
| ๐ฆ | Zero paid services | Local Chromium + free TinyFish API + free aggregator mirrors |
| ๐ | Python stdlib only | No requests, no httpx, no extras for the canary itself |
๐ Quick Start
Pick your favorite install method. All four work right now.
โก One-liner (zero-config)
curl -fsSL https://raw.githubusercontent.com/kevinnft/unblock-web/main/scripts/install.sh | bash
Picks a working Python (3.11โ3.13), creates an isolated venv at ~/.unblock-web, installs Chromium via heal, and symlinks unblock-web into ~/.local/bin. Reversible: rm -rf ~/.unblock-web ~/.local/bin/unblock-web.
๐ pip
pip install 'unblock-web[stealth]'
unblock-web heal # one-time: auto-detects OS, installs Chromium
unblock-web verify # 3-tier health check
unblock-web fetch https://x.com/elonmusk/status/123456789
๐ณ Docker (zero-install)
docker run --rm ghcr.io/kevinnft/unblock-web:latest fetch https://example.com
# With TinyFish (Tier 2 geo-proxy)
docker run --rm \
-e TINYFISH_API_KEY=$TINYFISH_API_KEY \
ghcr.io/kevinnft/unblock-web:latest fetch https://blocked.com --proxy US
๐ฆ From source
git clone https://github.com/kevinnft/unblock-web.git
cd unblock-web
pip install -e '.[stealth]'
unblock-web heal
unblock-web verify --verbose
๐ ๏ธ Library
from unblock_web import fetch
# Auto-pilot โ picks the right tier per URL
page = fetch("https://x.com/seelffff/status/2055155782367187375")
print(page.text)
print(f"Used tier: {page.tier}")
# Force ISP/geo bypass
page = fetch("https://web3.okx.com", proxy_country="US")
# Force a specific tier
page = fetch("https://target.com", tier="T1", wait=8000)
๐ In an AI agent
Hermes Agent example โ drop the canary into session-start:
# ~/.hermes/config.yaml
hooks:
on_session_start:
- command: "unblock-web verify"
timeout: 30
๐ Decision Tree
flowchart TD
A[๐ URL incoming] --> B{What kind of block?}
B -->|Plain blog/docs/<br/>GitHub README| T0[โก Tier 0: scrapling.get<br/>fastest, no browser]
B -->|JS-rendered SPA<br/>React/Next/Vue| T1[๐ก๏ธ Tier 1: stealthy_fetch<br/>+ network_idle + wait]
B -->|Cloudflare Turnstile<br/>'Checking browser'| T1B[๐ก๏ธ Tier 1: stealthy_fetch<br/>+ solve_cloudflare=True]
B -->|x.com tweet body| T1C[๐ก๏ธ Tier 1: stealthy_fetch<br/>captures DOM pre-modal]
B -->|x.com replies/thread| T3[๐ช Tier 3: xcancel.com mirror<br/>via Tier 1 stealth]
B -->|๐ฎ๐ฉ ISP DNS block<br/>internet-positif| T2[๐ Tier 2: TinyFish<br/>--proxy US]
B -->|Geo-locked content| T2B[๐ Tier 2: TinyFish<br/>--proxy XX]
B -->|Login required<br/>DMs/private/paywall| T4[๐ Tier 4: xurl + bearer<br/>or cookie injection]
T0 --> R[โ
Markdown out]
T1 --> R
T1B --> R
T1C --> R
T3 --> R
T2 --> R
T2B --> R
T4 --> R
style T0 fill:#10b981,stroke:#059669,color:#fff
style T1 fill:#f59e0b,stroke:#d97706,color:#fff
style T1B fill:#f59e0b,stroke:#d97706,color:#fff
style T1C fill:#f59e0b,stroke:#d97706,color:#fff
style T2 fill:#06b6d4,stroke:#0891b2,color:#fff
style T2B fill:#06b6d4,stroke:#0891b2,color:#fff
style T3 fill:#a855f7,stroke:#9333ea,color:#fff
style T4 fill:#ef4444,stroke:#dc2626,color:#fff
style R fill:#22c55e,stroke:#16a34a,color:#fff
๐ก๏ธ The 4-Tier Stack
โก Tier 0: Plain HTTP
| Tool | scrapling.Fetcher().get(url) |
| Cost | Free, ~100ms |
| Use for | Static HTML, GitHub READMEs, JSON APIs, blogs without anti-bot |
| Fails on | Anything client-rendered |
๐ก๏ธ Tier 1: Scrapling Stealth (PRIMARY)
| Tool | mcp_scrapling_stealthy_fetch / StealthyFetcher.fetch() |
| Engine | Patchright (anti-fingerprint Chromium fork) |
| Cost | Free, local CPU, ~5-15s |
| Use for | x.com tweets ยท Cloudflare Turnstile ยท React/Next/Vue SPAs ยท 99% of "hard" pages |
| Killer flags | solve_cloudflare=True, network_idle=True, wait=5000 |
StealthyFetcher.fetch(
url,
network_idle=True, # wait for XHR settle
solve_cloudflare=True, # auto-handle Turnstile JS
wait=5000, # ms โ let SPA hydrate
)
๐ Full param reference: docs/tier-1-scrapling.md
๐ Tier 2: TinyFish (geo-proxy)
| Tool | scripts/tinyfish_fetch.py |
| Engine | Remote browser farm via REST API |
| Cost | Free unlimited (no credit card, no rate limit advertised) |
| Use for | ISP DNS blocks (๐ฎ๐ฉ Internet Positif) ยท geo-locked content ยท second opinion ยท when local Chromium is busy |
| Fails on | x.com tweets (their SSR drops out before x.com's React boots), login walls |
python3 scripts/tinyfish_fetch.py "https://blocked-site.com" --proxy US
python3 scripts/tinyfish_fetch.py --search "your query" # bonus: free search API
๐ Setup + edge cases: docs/tier-2-tinyfish.md
๐ช Tier 3: Aggregator Mirrors
| Tool | Tier 1 stealth โ xcancel.com/<user>/status/<id> |
| Cost | Free |
| Use for | X/Twitter replies, threads, full conversation context that won't render unauthenticated |
| Bonus | Multilingual replies preserved (verified: EN/JP/CN/VI/IT in one fetch) |
๐ Mirror rotation tips: docs/tier-3-mirrors.md
๐ Tier 4: Authenticated APIs
| Tool | xurl + bearer token |
| Cost | Free tier (1500 reads/mo on X) |
| Use for | DMs ยท private accounts ยท POST operations ยท paywalled content |
| Setup | One-time signup at developer.x.com |
๐ Step-by-step bearer setup: docs/tier-4-authenticated.md
๐งช Verified Targets
Stack was tested against these (May 2026) โ every result is reproducible:
| ๐ฏ Target | ๐ ๏ธ Tier | ๐ฆ Result |
|---|---|---|
๐ฆ x.com/<user>/status/<id> (no auth) |
T1 + wait=5000 |
โ Full tweet body + meta + view count + quoted tweet |
๐ก๏ธ nowsecure.nl (Cloudflare anti-bot test) |
T1 + solve_cloudflare=True |
โ Returns "NOWSECURE / by nodriver" (only served to humans) |
๐ช xcancel.com/<user>/status/<id> (CF-protected) |
T1 + solve_cloudflare=True |
โ Tweet + 11 replies (multilingual) |
๐ฎ๐ฉ web3.okx.com (Indonesian ISP block) |
T2 + --proxy US |
โ Full JS render + prize pool data |
| ๐ GitHub README | T0 | โ Markdown extract |
| ๐ฐ News-site SPA (React) | T1 + wait=8000 |
โ Article body |
Reproduce these: see
examples/for runnable scripts.
๐ฉบ Health Monitoring
Three layers, no cron required (built for laptops that sleep):
๐ฆ Session-start canary
Drop into any agent's session-start hook. Silent on healthy state, alert on regression:
# Hermes Agent example (~/.hermes/config.yaml)
hooks:
on_session_start:
- command: "/path/to/scripts/verify-stack.py"
timeout: 30
๐ง Self-heal on Chromium loss
When stealthy_fetch errors with Executable doesn't exist (after a venv recreate), auto-run:
bash scripts/heal-chromium.sh
Idempotent. Safe to run anytime.
๐ On-demand audit
python3 scripts/verify-stack.py --verbose
๐จ Why "Anti-Blok"?
Because every "scraping tutorial" online stops at:
"Just install Playwright! Just use Selenium! Just pay for ScrapingBee!"
Then you hit the real world:
- ๐ฎ๐ฉ ISP poisoning your DNS
- ๐จ๐ณ GFW dropping your packets
- โ๏ธ Cloudflare upgrading Turnstile every quarter
- ๐ฆ X.com adding login walls overnight
- ๐ง Ubuntu 26.04 breaking Playwright install
unblock-web is the field-tested decision tree from those battles. Free tools only. No API keys hoarded. Reproducible against listed targets.
๐ Repository Structure
unblock-web/
โโโ ๐ README.md โ you are here
โโโ ๐ LICENSE โ MIT
โโโ ๐ docs/ โ per-tier deep dives
โ โโโ tier-1-scrapling.md
โ โโโ tier-2-tinyfish.md
โ โโโ tier-3-mirrors.md
โ โโโ tier-4-authenticated.md
โ โโโ ubuntu-26-04-fix.md
โโโ ๐ ๏ธ scripts/ โ drop-in tools
โ โโโ verify-stack.py โ 3-tier canary
โ โโโ heal-chromium.sh โ Ubuntu 26.04 fix
โ โโโ tinyfish_fetch.py โ Tier 2 wrapper
โโโ ๐งช examples/ โ reproducible cases
โ โโโ x_com_tweet.py
โ โโโ cloudflare_bypass.py
โ โโโ indonesian_isp_bypass.py
โ โโโ xcancel_replies.py
โโโ โ๏ธ .github/workflows/ โ CI canary
โ โโโ canary.yml
โโโ ๐จ assets/ โ logo + diagrams
๐ค Contributing
Found a target the stack can't crack? Open an issue with:
- โ The URL (or pattern)
- ๐ What each tier returned (paste the failure)
- ๐ค Hypothesis (login? CF v3? new anti-bot?)
Or send a PR to docs/known-targets.md when you find a workaround.
โ๏ธ Ethics & Legal
This stack is for reading publicly accessible content:
โ
Public tweets, blogs, docs, GitHub
โ
Content you're entitled to read in a browser
โ
APIs you have keys for
โ Don't use it to:
- Scrape behind authentication you don't own
- Violate site Terms of Service
- Mass-extract copyrighted content
- Build credential-harvesting / phishing tools
Respect robots.txt. Respect rate limits. Be a good citizen of the open web.
๐ Credits
Stack composed from:
- ๐ก๏ธ Scrapling โ the unified scraping library
- ๐ฅท Patchright โ anti-fingerprint Playwright fork
- ๐ TinyFish โ free fetch + search API for AI agents
- ๐ช xcancel.com โ Twitter content mirror that survives
- ๐ค xurl โ official X CLI
Built with ๐ฅท by @kevinnft
Field-tested in Indonesian internet conditions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unblock_web-0.2.3.tar.gz.
File metadata
- Download URL: unblock_web-0.2.3.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eff5220f49ed0a23a7d0ba03d052bfd4493a1277fd1a7b63037482abf32a7689
|
|
| MD5 |
8edc461940a5b0700f8c60c5d2ec1fd2
|
|
| BLAKE2b-256 |
d65ce624369ec22c5a3563af2e6b0aec961770f9822c04299aa087e1c7f90b1a
|
Provenance
The following attestation bundles were made for unblock_web-0.2.3.tar.gz:
Publisher:
publish.yml on kevinnft/unblock-web
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unblock_web-0.2.3.tar.gz -
Subject digest:
eff5220f49ed0a23a7d0ba03d052bfd4493a1277fd1a7b63037482abf32a7689 - Sigstore transparency entry: 1552597671
- Sigstore integration time:
-
Permalink:
kevinnft/unblock-web@b0a0d43745730af2bd5594fdffe15e1bafd3ac27 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/kevinnft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b0a0d43745730af2bd5594fdffe15e1bafd3ac27 -
Trigger Event:
push
-
Statement type:
File details
Details for the file unblock_web-0.2.3-py3-none-any.whl.
File metadata
- Download URL: unblock_web-0.2.3-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a964d723917f2cf33d4fa0c786dae777c97d60ffa9951470f964a6a9ed7bea38
|
|
| MD5 |
8c3afbc74589698eecc7fbe71088fa1e
|
|
| BLAKE2b-256 |
83a2283cee422a78d85281ce03bc89efe4cd55084180d68c7363b2a73b935087
|
Provenance
The following attestation bundles were made for unblock_web-0.2.3-py3-none-any.whl:
Publisher:
publish.yml on kevinnft/unblock-web
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
unblock_web-0.2.3-py3-none-any.whl -
Subject digest:
a964d723917f2cf33d4fa0c786dae777c97d60ffa9951470f964a6a9ed7bea38 - Sigstore transparency entry: 1552597682
- Sigstore integration time:
-
Permalink:
kevinnft/unblock-web@b0a0d43745730af2bd5594fdffe15e1bafd3ac27 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/kevinnft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b0a0d43745730af2bd5594fdffe15e1bafd3ac27 -
Trigger Event:
push
-
Statement type: