Advanced ML-Guided Anti-Bot Evasion and Stealth Scraping Framework with AI Extraction
Project description
ghost_bypass
Advanced ML-Guided Anti-Bot Evasion and Stealth Scraping Framework
Scrape any website. Works on Cloudflare-protected sites, WAFs, anti-bot systems, GDPR walls, and plain HTTP sites — automatically choosing the right technique.
✨ What makes it different
| Feature | ghost_bypass |
|---|---|
| ML level selection | UCB1 bandit remembers which bypass level works per domain |
| Domain-aware proxies | Proxy A banned on site-X ≠ Proxy A banned on site-Y |
| 12 bypass levels (L0–L11) | Auto-escalates from fast → stealthy → headful browser |
| CF jump logic | Detects Cloudflare → immediately promotes to headful UC |
| Versatile extraction | Returns HTML, text, links, images, meta — works on any site |
| Custom extractors | Pass your own fn(html, url) to get structured data in one call |
| Zero config | Works out of the box with BypassEngine() (raises clear errors if optional extras are missing) |
Installation
# Minimum (requests only — L0, L11)
pip install ghost-bypass
# With playwright + selenium + TLS fingerprinting (recommended)
pip install "ghost-bypass[full]"
# Specific extras
pip install "ghost-bypass[playwright]" # L3–L6
pip install "ghost-bypass[selenium]" # L7–L8
pip install "ghost-bypass[tls]" # L1–L2
After installing Playwright extras:
playwright install chromium
Quick start
from ghost_bypass import BypassEngine
engine = BypassEngine()
result = engine.scrape("https://any-website.com/page/")
print(result['success']) # True
print(result['method']) # "L0:L0_requests_basic"
print(result['html']) # full page HTML
print(result['links']) # all absolute links
print(result['images']) # all image URLs
print(result['title']) # page title
Full ML stack (recommended)
from ghost_bypass import BypassEngine, SiteLearner, MLProxyManager
engine = BypassEngine(
proxy_manager=MLProxyManager(), # domain-aware UCB proxy rotation
site_learner=SiteLearner(), # per-domain level memory
)
result = engine.scrape("https://cloudflare-protected-site.com/")
First run → tries L0, L1, L2… until success. Second run → jumps directly to what worked (e.g. L3), skipping slower levels. CF detected → immediately jumps to L8 (headful UC with turnstile support).
Bypass levels (L0 → L11)
| Level | Name | Technology | CF bypass |
|---|---|---|---|
| L0 | requests_basic |
requests + real headers |
❌ |
| L1 | requests_tls |
curl_cffi Chrome TLS fingerprint |
⚠️ partial |
| L2 | httpx_http2 |
httpx HTTP/2 |
❌ |
| L3 | playwright_stealth |
Playwright headless + stealth JS | ⚠️ partial |
| L4 | playwright_headful |
Playwright visible + stealth JS | ✅ most sites |
| L5 | playwright_mobile_headless |
Mobile emulation, headless | ⚠️ |
| L6 | playwright_mobile_headful |
Mobile emulation, visible | ✅ |
| L7 | uc_headless |
Undetected ChromeDriver headless | ✅ |
| L8 | uc_headful |
Undetected ChromeDriver visible + Turnstile | ✅✅ best |
| L9 | drission |
DrissionPage Chromium hybrid | ✅ |
| L10 | requests_html |
pyppeteer JS rendering | ⚠️ partial |
| L11 | mechanize |
Classic HTTP (legacy sites) | ❌ |
Result dict
result = engine.scrape(url)
result['success'] # bool
result['url'] # final URL after all redirects
result['status_code'] # HTTP status (or None for browser methods)
result['html'] # full page HTML
result['text'] # plain text (stripped HTML)
result['title'] # <title> tag content
result['meta'] # {name: content} for all <meta> tags
result['links'] # deduplicated list of absolute <a href> links
result['images'] # deduplicated list of absolute <img src> URLs
result['scripts'] # absolute <script src> URLs
result['cookies'] # {name: value} dict
result['headers'] # response headers dict
result['method'] # e.g. "L3:L3_playwright_stealth" (format: "L{n}:{level_name}")
result['level'] # integer 0–11
result['cf_detected'] # True if Cloudflare was detected on any attempt
result['duration'] # total seconds across all attempts
result['attempts'] # list of per-attempt detail dicts
result['data'] # custom extractor output (if extractor= provided)
result['error'] # error message if failed, else None
Domain-aware proxy rotation
from ghost_bypass import MLProxyManager
mgr = MLProxyManager()
# Add your own proxies
mgr.add_proxies([
"http://1.2.3.4:8080",
"http://5.6.7.8:3128",
], tier="custom")
# Optionally fetch free public proxies (commented out because free proxies
# are unreliable against Cloudflare — use your own paid proxies for CF sites)
# mgr.fetch_free_proxies()
# Get best proxy for a specific domain
proxy = mgr.get_best_proxy(domain="example.com")
# Report outcome (feeds the UCB model)
mgr.report_result(
proxy=proxy,
domain="example.com",
success=True,
latency=1.2,
cloudflare_blocked=False,
)
# Proxy reports
print(mgr.pool_summary())
print(mgr.best_for_domain("example.com", top_n=5))
print(mgr.get_banned_proxies())
print(mgr.get_banned_proxies(domain="example.com"))
# Unban manually
mgr.unban_proxy("http://1.2.3.4:8080") # global
mgr.unban_proxy("http://1.2.3.4:8080", domain="site.com") # domain only
How domain-aware banning works
Proxy "http://1.2.3.4:8080"
├── global: healthy (success_rate=0.85)
├── example.com: healthy (3 successes, 0 failures)
├── cloudflare-site.com: CF-BANNED for 1h (got 403)
└── slow-site.org: domain-banned for 15m (< 15% success)
A proxy banned on cloudflare-site.com is still available for example.com.
Site memory (SiteLearner)
from ghost_bypass import SiteLearner
sl = SiteLearner()
# What does it know about a domain?
print(sl.domain_summary("example.com"))
# {
# "domain": "example.com",
# "cf_detected": false,
# "js_required": false,
# "last_success_method": "L0_requests_basic", # level_name format
# "last_seen": 1716823456.0,
# "methods_tried": 3
# }
# Get the ML-ranked level chain for a domain
print(sl.get_chain("example.com"))
# ["L0_requests_basic", "L1_requests_tls", "L3_playwright_stealth", ...]
# ^ Uses level_name format (no "L3:" prefix). The "L3:L3_xxx" format
# appears only in result['method'] after scraping.
# CF-incapable methods are automatically filtered if CF was previously detected
# All domains with stored memory
print(sl.all_domains())
# Erase memory for a domain (reset its chain)
sl.forget_domain("example.com")
Thread-Safe Proxy Leasing & Dynamic Delays
To scale high-throughput concurrent scraping without triggering IP blocks or rate limits, ghost_bypass implements advanced ML-driven concurrency controls.
1. Concurrent Proxy Leasing
When multiple workers run concurrently (e.g. in scrape_many), they must not make requests to the same target domain using the same proxy IP at the same time. The MLProxyManager enforces a lease mechanism:
- Lease Acquisition: When a worker attempts a bypass level, it borrows a highly rated proxy specifically leased for that target domain.
- Exclusion: Concurrent workers requesting the same domain will automatically bypass the leased proxy and select the next-best alternative.
- Lease Release: The proxy is guaranteed to release back to the pool inside a
finallyblock once the request succeeds or fails.
You can toggle proxy leasing off if desired:
engine.scrape(url, lease_proxies=False)
2. Adaptive Rate Limit Pacing
SiteLearner monitors target domains for HTTP 429 (Too Many Requests) rate-limiting responses.
- Automatic Backoff: If a 429 is encountered,
SiteLearnerinstantly raises the recommended delay for that domain. - Decay: Over successful cycles, the pacing delay naturally decays back to the minimum configured baseline.
- Worker Sync: Concurrent workers in
scrape_manyautomatically coordinate using a per-domain thread lock and respect the maximum of either:- User-specified custom/random delays (e.g.,
domain_delay=(2, 5)) SiteLearner's adaptive backoff delay.
- User-specified custom/random delays (e.g.,
To invoke concurrent scraping with dynamic pacing:
urls = ["https://site.com/p1", "https://site.com/p2", "https://other.com/p1"]
# Scraping concurrent with 5 workers, custom delay range, and ML pacing
results = engine.scrape_many(
urls,
workers=5,
domain_delay=(2.0, 5.0) # Random delays between 2 and 5 seconds per domain
)
3-Tier Extraction
Extract structured data from pages immediately, with or without coding.
Tier 1: CSS selector dictionary
engine = BypassEngine()
result = engine.scrape("https://shop.example.com/product/", extract={
"price": ".price",
"title": "h1"
})
print(result['data']) # {"price": "$19.99", "title": "Cool Widget"}
Tier 2: Custom Python function
from bs4 import BeautifulSoup
def my_extractor(html: str, url: str) -> dict:
soup = BeautifulSoup(html, "html.parser")
return {"stock": soup.select_one("#stock").text}
result = engine.scrape(url, extractor=my_extractor)
Tier 3: AI-powered extraction (Requires ghost-bypass[ai])
Pass a plain English prompt. Auto-detects local models (Ollama, LM Studio) or uses OpenAI/Anthropic/Gemini keys.
result = engine.scrape(url, prompt="extract product name, price, and stock status")
print(result['data']) # {"name": "Widget", "price": "$19.99", "stock": "In Stock"}
Rate limiting & parallel scraping
Scrape multiple URLs in parallel with scrape_many. Built-in domain locking prevents IP bans when multiple workers hit the same domain.
from ghost_bypass import BypassEngine
engine = BypassEngine(request_timeout=30)
urls = ["https://site.com/page1", "https://site.com/page2", "https://site.com/page3"]
# 5 parallel workers, but guarantees 2.0s delay between requests to site.com
results = engine.scrape_many(urls, workers=5, domain_delay=2.0)
For manual loops, add your own delays:
import time, random
for url in urls:
result = engine.scrape(url)
time.sleep(random.uniform(1.0, 3.0)) # min_delay=1.0, max_delay=3.0
Note: For aggressive scraping, use
domain_delay=0and supply a proxy pool to distribute requests across IPs.
The ghost CLI
ghost_bypass comes with a powerful CLI for scraping, proxy management, and AI key management.
# Scrape from the terminal
ghost scrape https://example.com --extract '{"title":"h1","price":".price"}'
ghost scrape https://example.com --prompt "extract product info"
# Parallel scraping
ghost scrape-many https://example.com/1 https://example.com/2 --workers 5
# Manage proxies
ghost proxy fetch # fetch free proxies
ghost proxy ping # test all proxies
ghost proxy list # list healthy proxies
# Manage site memory
ghost memory list # see which domains have CF detected
# Interactive REPL
ghost repl
# > /scrape https://example.com
# > /extract https://example.com {"title":"h1"}
# > /keys autodetect
AI Keys & Local Models
Use the CLI to manage keys for Tier 3 extraction:
ghost keys autodetect # Auto-discover Ollama/LM Studio running locally
ghost keys add openai sk-... # Add an API key
Security Note: API keys are stored in
~/.ghost_bypass/ai_keys.jsonusing XOR obfuscation. This is NOT cryptographic encryption — it only prevents casual plaintext reading. For production security, use environment variables (OPENAI_API_KEY, etc.) or a secret vault.
Cookie persistence
from ghost_bypass import BypassEngine, CookieManager
# Cookies auto-expire after ttl_days (default: 7)
cm = CookieManager(ttl_days=7)
engine = BypassEngine(cookie_manager=cm)
result = engine.scrape("https://cf-protected-site.com/")
# On repeat visits, saved cookies skip the CF challenge
# Manage cookies manually
print(cm.list_domains()) # domains with saved cookies
cm.clear("https://example.com") # clear one domain
cm.clear_all() # wipe all
Ad blocker & popup closer
from ghost_bypass import AdBlocker, PopupCloser
# Playwright
blocker = AdBlocker(max_iterations=5)
blocker.handle_playwright(page, original_url)
# Selenium — blocking mode
closer = PopupCloser()
closer.close_all(driver, original_url)
# Selenium — background thread + JS interval monitor
import threading
lock = threading.Lock()
closer.start_monitoring(driver, lock, interval=2.0)
# ... do your scraping ...
closer.stop_monitoring()
Human behavior simulation
HumanBehavior is applied automatically in headful browser levels (L4, L6, L8)
when using Selenium/UC. It provides Bézier-curve mouse movements, momentum scrolling,
and realistic typing to avoid bot detection.
You can also use it directly:
from ghost_bypass import HumanBehavior
human = HumanBehavior(min_delay=0.08, max_delay=0.45, movement_speed="medium")
# Use with any Selenium driver
human.human_scroll(driver, direction="down", smooth=True)
human.human_click(driver, element, overshoot=True)
human.type_like_human(element, "search query")
human.page_view_pattern(driver, duration=3.0) # realistic browsing
Architecture
ghost_bypass/
├── engine/
│ ├── engine.py ← BypassEngine (L0–L11 dispatch + ML orchestration)
│ └── site_learner.py ← SiteLearner (per-domain UCB method memory)
├── proxy/
│ └── manager.py ← MLProxyManager (domain-aware UCB proxy rotation)
├── cloudflare/
│ └── handler.py ← CloudflareHandler (detect + wait for CF to resolve)
├── ad_blocker/
│ ├── blocker.py ← AdBlocker (overlay/modal/cookie banner closer)
│ └── popup_closer.py ← PopupCloser (window + JS interval monitor)
└── support/
├── stealth.py ← StealthConfig (anti-bot JS patches)
├── cookies.py ← CookieManager (per-domain persistence, configurable TTL)
└── human.py ← HumanBehavior (Bézier mouse, scroll, typing — auto-applied in headful levels)
Escalation flow
URL requested
│
├─ SiteLearner.get_chain(domain) ─→ UCB-ranked level list
│ (new domain: L0→L11 default order)
│ (known domain: starts at best-known level)
│
└─ For each level in chain:
│
├─ MLProxyManager.get_best_proxy(domain) ─→ best proxy for this site
│ (UCB: blends global + domain-specific stats)
│
├─ Run level method (L0 → L11)
│
├─ CF detected? ──yes──→ inject L8 as next attempt immediately
│
├─ Proxy failed? ─────→ try next-best proxy for same level
│
├─ Level failed? ─────→ escalate to next level
│
└─ Success? ──────────→ record stats, return rich result dict
Contributing
See CONTRIBUTING.md. PRs welcome — especially new bypass levels!
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ghost_bypass-1.1.0.tar.gz.
File metadata
- Download URL: ghost_bypass-1.1.0.tar.gz
- Upload date:
- Size: 63.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e419671e3f2f003cbd28b5a432a3736032a3b89d8a5f6208fe58fba347cec64
|
|
| MD5 |
b96f2bbc729427e960bd4ec4a528d2b3
|
|
| BLAKE2b-256 |
daadd1911d50c30ae076589482d8897270e2c209043d373058a87428b23fb8f3
|
File details
Details for the file ghost_bypass-1.1.0-py3-none-any.whl.
File metadata
- Download URL: ghost_bypass-1.1.0-py3-none-any.whl
- Upload date:
- Size: 66.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a757b9681861ea2e2aba00ac4605cca21de5dddf2bda1604690b933dbf095d0
|
|
| MD5 |
6bddd44e859472fedce2ee2a6ce31e00
|
|
| BLAKE2b-256 |
1edc3618b76f5e76bb392d0743153b0b2457eb33d64d2a92a246516035c10d9f
|