Skip to main content

Free, open-source, async-first web scraping toolkit for LLM agents

Project description

web4agent

Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.


Features

Function Description
read_url Auto-degradation: fast → crawl4ai → browser → wayback → ddg
read_fast curl_cffi TLS impersonation + realistic headers; httpx fallback
read_browser Stealth headless Chromium (patchright); canvas noise; Playwright fallback
read_crawl4ai Crawl4AI LLM-friendly Markdown output
read_wayback Wayback Machine archive fallback — no API key needed
read_ddg DuckDuckGo snippet fallback — no API key needed
read_many Concurrent batch fetch with deduplication
discover_links Extract, normalize, and deduplicate hrefs
agent_read_url Single-URL fetch returning a slim LLM-ready dict
agent_read_urls Batch fetch with summary stats for LLM context
run_doctor Diagnose optional deps, upstream connectivity, circuit-breaker state
FastAPI server Optional HTTP API (/read, /read_many, /discover_links)

Installation

Minimal (httpx + trafilatura, covers most use cases):

pip install web4agent

With optional extras:

# TLS impersonation + realistic headers (bypass most bot-detection without a browser)
pip install "web4agent[stealth]"

# Headless browser with stealth context (JS-heavy pages, Cloudflare-protected sites)
pip install "web4agent[browser]"
patchright install chromium

# Crawl4AI strategy (LLM-optimised Markdown)
pip install "web4agent[crawl4ai]"

# FastAPI server
pip install "web4agent[server]"

# Everything
pip install "web4agent[all]"
patchright install chromium

From source (development):

git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"

Quick Start

CLI

# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping

# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5

# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30

# Check optional deps, upstream connectivity, and circuit-breaker state
web4agent doctor

Python

import asyncio
from web4agent import read_url, read_many, discover_links

async def main():
    # Single URL — auto strategy (fast → crawl4ai → browser)
    result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
    print(result.title)
    print(result.text[:500])

    # Batch
    results = await read_many(
        ["https://example.com", "https://python.org"],
        concurrency=5,
        strategy="fast",
    )
    for r in results:
        print(r.url, "OK" if r.success else r.error)

    # Links
    links = await discover_links("https://docs.python.org/3/", same_domain=True)
    print(links[:5])

asyncio.run(main())

Proxy rotation

import asyncio
from web4agent import agent_read_urls

async def main():
    proxies = [
        "http://user:pass@proxy1:8080",
        "socks5://proxy2:1080",
    ]
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        proxies=proxies,
        proxy_mode="round_robin",  # or "random"
    )
    print(summary)

asyncio.run(main())

Or manage rotation manually:

from web4agent import ProxyRotator, read_url

rotator = ProxyRotator(["http://p1:8080", "http://p2:8080"])
proxy = rotator.next()
result = await read_url("https://example.com", proxy=proxy)
rotator.mark_success(proxy) if result.success else rotator.mark_failed(proxy)
print(rotator.stats())

Agent interface (slim dicts for LLM context)

import asyncio
from web4agent import agent_read_url, agent_read_urls

async def main():
    # Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
    r = await agent_read_url("https://example.com")
    print(r["title"])
    print(r["content"][:300])

    # Batch — returns {"results", "total", "succeeded", "failed"}
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        concurrency=5,
    )
    print(f"Fetched {summary['succeeded']}/{summary['total']}")
    for item in summary["results"]:
        print(item["url"], item["success"])

asyncio.run(main())

Full working examples: examples/example.py


Strategies

Strategy How it works Best for
fast httpx + trafilatura (+ BS4 fallback) Static pages, high concurrency
crawl4ai Crawl4AI AsyncWebCrawler Docs, structured Markdown output
browser Playwright headless Chromium JS-heavy SPAs, lazy-loaded content
auto Degrades: fast → crawl4ai → browser Unknown pages

Auto-degradation triggers when:

  • HTTP status ≥ 400
  • Extracted text is shorter than MIN_TEXT_LENGTH (default 300 chars)
  • Page looks like a JS-only shell (empty #root / #app div)

Result shape

All read functions return a WebReadResult:

class WebReadResult(BaseModel):
    url: str
    final_url: str | None       # after redirects
    title: str | None
    text: str | None            # plain text
    markdown: str | None        # Markdown version
    html: str | None            # raw HTML
    status_code: int | None
    success: bool
    strategy_used: str | None
    attempts: list[FetchAttempt]
    error: str | None
    fetched_at: str             # ISO-8601 UTC
    elapsed_ms: int | None
    metadata: dict              # e.g. screenshot_b64 for browser reads

Configuration

Set via environment variables (or a .env file):

Variable Default Description
WRT_TIMEOUT 20 HTTP timeout in seconds
WRT_FAST_CONCURRENCY 50 Max concurrent fast requests
WRT_CRAWL4AI_CONCURRENCY 10 Max concurrent crawl4ai requests
WRT_BROWSER_CONCURRENCY 3 Max simultaneous Playwright pages
WRT_MIN_TEXT_LENGTH 300 Min chars to consider a fetch successful
WRT_AGENT_MAX_CONTENT_CHARS 8000 Content truncation limit for agent output
WRT_USER_AGENT Chrome 124 User-Agent header string
WRT_HEALTH_FAILURE_THRESHOLD 3 Consecutive failures before a fallback tier is circuit-broken
WRT_HEALTH_COOLDOWN_SECONDS 60 Cooldown before a circuit-broken tier is retried

FastAPI Server

pip install "web4agent[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000
Method Path Body
GET /health
POST /read {"url": "...", "strategy": "auto"}
POST /read_many {"urls": [...], "concurrency": 10, "strategy": "auto"}
POST /discover_links {"url": "...", "same_domain": true, "max_links": 100}

Running Tests

pip install -e ".[dev]"
pytest

Compliance

  • robots.txt — not enforced automatically; check it yourself before scraping.
  • Rate limiting — use the concurrency parameter; add delays for the same domain.
  • Terms of Service — always review a site's ToS before scraping.
  • Intended for lawful, authorized use only.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web4agent-0.2.0.tar.gz (60.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web4agent-0.2.0-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file web4agent-0.2.0.tar.gz.

File metadata

  • Download URL: web4agent-0.2.0.tar.gz
  • Upload date:
  • Size: 60.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for web4agent-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d25ffe8ee8959f84f6b01b7c37c9eb88d7d4fdb6ca7cc39497e92e4436344377
MD5 a90454b31c5aaba1522fac9e1ce1f68e
BLAKE2b-256 ae1555e8b448d207a4232e0ea8774cb4c521054cc9f90c01c12d41ceb27e4d0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for web4agent-0.2.0.tar.gz:

Publisher: publish.yml on lipiji/web4agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file web4agent-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: web4agent-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for web4agent-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e70da97937cc900ebd9dd1235557f27651ea5e4f984dc1913790a0b02145c25e
MD5 7130c16bf148eba3b17ceebacd4bf4b5
BLAKE2b-256 e43a1a0c1a7fc72091149c713990689d0c9b0ff6187f9cd3125df47c78e5cf60

See more details on using hashes here.

Provenance

The following attestation bundles were made for web4agent-0.2.0-py3-none-any.whl:

Publisher: publish.yml on lipiji/web4agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page