Free, open-source, async-first web scraping toolkit for LLM agents
Project description
web4agent
Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.
Features
| Function | Description |
|---|---|
read_url |
Auto-degradation: fast → crawl4ai → browser → wayback → ddg |
read_fast |
curl_cffi TLS impersonation + realistic headers; httpx fallback |
read_browser |
Stealth headless Chromium (patchright); canvas noise; Playwright fallback |
read_crawl4ai |
Crawl4AI LLM-friendly Markdown output |
read_wayback |
Wayback Machine archive fallback — no API key needed |
read_ddg |
DuckDuckGo snippet fallback — no API key needed |
read_many |
Concurrent batch fetch with deduplication |
discover_links |
Extract, normalize, and deduplicate hrefs |
agent_read_url |
Single-URL fetch returning a slim LLM-ready dict |
agent_read_urls |
Batch fetch with summary stats for LLM context |
run_doctor |
Diagnose optional deps, upstream connectivity, circuit-breaker state |
| FastAPI server | Optional HTTP API (/read, /read_many, /discover_links) |
Installation
Minimal (httpx + trafilatura, covers most use cases):
pip install web4agent
With optional extras:
# TLS impersonation + realistic headers (bypass most bot-detection without a browser)
pip install "web4agent[stealth]"
# Headless browser with stealth context (JS-heavy pages, Cloudflare-protected sites)
pip install "web4agent[browser]"
patchright install chromium
# Crawl4AI strategy (LLM-optimised Markdown)
pip install "web4agent[crawl4ai]"
# FastAPI server
pip install "web4agent[server]"
# Everything
pip install "web4agent[all]"
patchright install chromium
From source (development):
git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"
Quick Start
CLI
# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping
# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5
# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30
# Check optional deps, upstream connectivity, and circuit-breaker state
web4agent doctor
Python
import asyncio
from web4agent import read_url, read_many, discover_links
async def main():
# Single URL — auto strategy (fast → crawl4ai → browser)
result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
print(result.title)
print(result.text[:500])
# Batch
results = await read_many(
["https://example.com", "https://python.org"],
concurrency=5,
strategy="fast",
)
for r in results:
print(r.url, "OK" if r.success else r.error)
# Links
links = await discover_links("https://docs.python.org/3/", same_domain=True)
print(links[:5])
asyncio.run(main())
Proxy rotation
import asyncio
from web4agent import agent_read_urls
async def main():
proxies = [
"http://user:pass@proxy1:8080",
"socks5://proxy2:1080",
]
summary = await agent_read_urls(
["https://example.com", "https://python.org"],
proxies=proxies,
proxy_mode="round_robin", # or "random"
)
print(summary)
asyncio.run(main())
Or manage rotation manually:
from web4agent import ProxyRotator, read_url
rotator = ProxyRotator(["http://p1:8080", "http://p2:8080"])
proxy = rotator.next()
result = await read_url("https://example.com", proxy=proxy)
rotator.mark_success(proxy) if result.success else rotator.mark_failed(proxy)
print(rotator.stats())
Agent interface (slim dicts for LLM context)
import asyncio
from web4agent import agent_read_url, agent_read_urls
async def main():
# Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
r = await agent_read_url("https://example.com")
print(r["title"])
print(r["content"][:300])
# Batch — returns {"results", "total", "succeeded", "failed"}
summary = await agent_read_urls(
["https://example.com", "https://python.org"],
concurrency=5,
)
print(f"Fetched {summary['succeeded']}/{summary['total']}")
for item in summary["results"]:
print(item["url"], item["success"])
asyncio.run(main())
Full working examples:
examples/example.py
Strategies
| Strategy | How it works | Best for |
|---|---|---|
fast |
httpx + trafilatura (+ BS4 fallback) | Static pages, high concurrency |
crawl4ai |
Crawl4AI AsyncWebCrawler |
Docs, structured Markdown output |
browser |
Playwright headless Chromium | JS-heavy SPAs, lazy-loaded content |
auto |
Degrades: fast → crawl4ai → browser | Unknown pages |
Auto-degradation triggers when:
- HTTP status ≥ 400
- Extracted text is shorter than
MIN_TEXT_LENGTH(default 300 chars) - Page looks like a JS-only shell (empty
#root/#appdiv)
Result shape
All read functions return a WebReadResult:
class WebReadResult(BaseModel):
url: str
final_url: str | None # after redirects
title: str | None
text: str | None # plain text
markdown: str | None # Markdown version
html: str | None # raw HTML
status_code: int | None
success: bool
strategy_used: str | None
attempts: list[FetchAttempt]
error: str | None
fetched_at: str # ISO-8601 UTC
elapsed_ms: int | None
metadata: dict # e.g. screenshot_b64 for browser reads
Configuration
Set via environment variables (or a .env file):
| Variable | Default | Description |
|---|---|---|
WRT_TIMEOUT |
20 |
HTTP timeout in seconds |
WRT_FAST_CONCURRENCY |
50 |
Max concurrent fast requests |
WRT_CRAWL4AI_CONCURRENCY |
10 |
Max concurrent crawl4ai requests |
WRT_BROWSER_CONCURRENCY |
3 |
Max simultaneous Playwright pages |
WRT_MIN_TEXT_LENGTH |
300 |
Min chars to consider a fetch successful |
WRT_AGENT_MAX_CONTENT_CHARS |
8000 |
Content truncation limit for agent output |
WRT_USER_AGENT |
Chrome 124 | User-Agent header string |
WRT_HEALTH_FAILURE_THRESHOLD |
3 |
Consecutive failures before a fallback tier is circuit-broken |
WRT_HEALTH_COOLDOWN_SECONDS |
60 |
Cooldown before a circuit-broken tier is retried |
FastAPI Server
pip install "web4agent[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000
| Method | Path | Body |
|---|---|---|
GET |
/health |
— |
POST |
/read |
{"url": "...", "strategy": "auto"} |
POST |
/read_many |
{"urls": [...], "concurrency": 10, "strategy": "auto"} |
POST |
/discover_links |
{"url": "...", "same_domain": true, "max_links": 100} |
Running Tests
pip install -e ".[dev]"
pytest
Compliance
robots.txt— not enforced automatically; check it yourself before scraping.- Rate limiting — use the
concurrencyparameter; add delays for the same domain. - Terms of Service — always review a site's ToS before scraping.
- Intended for lawful, authorized use only.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web4agent-0.2.0.tar.gz.
File metadata
- Download URL: web4agent-0.2.0.tar.gz
- Upload date:
- Size: 60.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d25ffe8ee8959f84f6b01b7c37c9eb88d7d4fdb6ca7cc39497e92e4436344377
|
|
| MD5 |
a90454b31c5aaba1522fac9e1ce1f68e
|
|
| BLAKE2b-256 |
ae1555e8b448d207a4232e0ea8774cb4c521054cc9f90c01c12d41ceb27e4d0d
|
Provenance
The following attestation bundles were made for web4agent-0.2.0.tar.gz:
Publisher:
publish.yml on lipiji/web4agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web4agent-0.2.0.tar.gz -
Subject digest:
d25ffe8ee8959f84f6b01b7c37c9eb88d7d4fdb6ca7cc39497e92e4436344377 - Sigstore transparency entry: 1892930156
- Sigstore integration time:
-
Permalink:
lipiji/web4agent@db892e2adeecb573ff0418b45ca456cf6549e94b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/lipiji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db892e2adeecb573ff0418b45ca456cf6549e94b -
Trigger Event:
push
-
Statement type:
File details
Details for the file web4agent-0.2.0-py3-none-any.whl.
File metadata
- Download URL: web4agent-0.2.0-py3-none-any.whl
- Upload date:
- Size: 36.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e70da97937cc900ebd9dd1235557f27651ea5e4f984dc1913790a0b02145c25e
|
|
| MD5 |
7130c16bf148eba3b17ceebacd4bf4b5
|
|
| BLAKE2b-256 |
e43a1a0c1a7fc72091149c713990689d0c9b0ff6187f9cd3125df47c78e5cf60
|
Provenance
The following attestation bundles were made for web4agent-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on lipiji/web4agent
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web4agent-0.2.0-py3-none-any.whl -
Subject digest:
e70da97937cc900ebd9dd1235557f27651ea5e4f984dc1913790a0b02145c25e - Sigstore transparency entry: 1892930286
- Sigstore integration time:
-
Permalink:
lipiji/web4agent@db892e2adeecb573ff0418b45ca456cf6549e94b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/lipiji
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@db892e2adeecb573ff0418b45ca456cf6549e94b -
Trigger Event:
push
-
Statement type: