Skip to main content

Free, open-source, async-first web scraping toolkit for LLM agents

Project description

web4agent

Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.


Features

Function Description
read_url Auto-degradation: fast → crawl4ai → browser → wayback → ddg
read_fast curl_cffi TLS impersonation + realistic headers; httpx fallback
read_browser Stealth headless Chromium (patchright); canvas noise; Playwright fallback
read_crawl4ai Crawl4AI LLM-friendly Markdown output
read_wayback Wayback Machine archive fallback — no API key needed
read_ddg DuckDuckGo snippet fallback — no API key needed
read_many Concurrent batch fetch with deduplication
discover_links Extract, normalize, and deduplicate hrefs
agent_read_url Single-URL fetch returning a slim LLM-ready dict
agent_read_urls Batch fetch with summary stats for LLM context
FastAPI server Optional HTTP API (/read, /read_many, /discover_links)

Installation

Minimal (httpx + trafilatura, covers most use cases):

pip install web4agent

With optional extras:

# TLS impersonation + realistic headers (bypass most bot-detection without a browser)
pip install "web4agent[stealth]"

# Headless browser with stealth context (JS-heavy pages, Cloudflare-protected sites)
pip install "web4agent[browser]"
patchright install chromium

# Crawl4AI strategy (LLM-optimised Markdown)
pip install "web4agent[crawl4ai]"

# FastAPI server
pip install "web4agent[server]"

# Everything
pip install "web4agent[all]"
patchright install chromium

From source (development):

git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"

Quick Start

CLI

# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping

# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5

# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30

Python

import asyncio
from web4agent import read_url, read_many, discover_links

async def main():
    # Single URL — auto strategy (fast → crawl4ai → browser)
    result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
    print(result.title)
    print(result.text[:500])

    # Batch
    results = await read_many(
        ["https://example.com", "https://python.org"],
        concurrency=5,
        strategy="fast",
    )
    for r in results:
        print(r.url, "OK" if r.success else r.error)

    # Links
    links = await discover_links("https://docs.python.org/3/", same_domain=True)
    print(links[:5])

asyncio.run(main())

Proxy rotation

import asyncio
from web4agent import agent_read_urls

async def main():
    proxies = [
        "http://user:pass@proxy1:8080",
        "socks5://proxy2:1080",
    ]
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        proxies=proxies,
        proxy_mode="round_robin",  # or "random"
    )
    print(summary)

asyncio.run(main())

Or manage rotation manually:

from web4agent import ProxyRotator, read_url

rotator = ProxyRotator(["http://p1:8080", "http://p2:8080"])
proxy = rotator.next()
result = await read_url("https://example.com", proxy=proxy)
rotator.mark_success(proxy) if result.success else rotator.mark_failed(proxy)
print(rotator.stats())

Agent interface (slim dicts for LLM context)

import asyncio
from web4agent import agent_read_url, agent_read_urls

async def main():
    # Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
    r = await agent_read_url("https://example.com")
    print(r["title"])
    print(r["content"][:300])

    # Batch — returns {"results", "total", "succeeded", "failed"}
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        concurrency=5,
    )
    print(f"Fetched {summary['succeeded']}/{summary['total']}")
    for item in summary["results"]:
        print(item["url"], item["success"])

asyncio.run(main())

Full working examples: examples/example.py


Strategies

Strategy How it works Best for
fast httpx + trafilatura (+ BS4 fallback) Static pages, high concurrency
crawl4ai Crawl4AI AsyncWebCrawler Docs, structured Markdown output
browser Playwright headless Chromium JS-heavy SPAs, lazy-loaded content
auto Degrades: fast → crawl4ai → browser Unknown pages

Auto-degradation triggers when:

  • HTTP status ≥ 400
  • Extracted text is shorter than MIN_TEXT_LENGTH (default 300 chars)
  • Page looks like a JS-only shell (empty #root / #app div)

Result shape

All read functions return a WebReadResult:

class WebReadResult(BaseModel):
    url: str
    final_url: str | None       # after redirects
    title: str | None
    text: str | None            # plain text
    markdown: str | None        # Markdown version
    html: str | None            # raw HTML
    status_code: int | None
    success: bool
    strategy_used: str | None
    attempts: list[FetchAttempt]
    error: str | None
    fetched_at: str             # ISO-8601 UTC
    elapsed_ms: int | None
    metadata: dict              # e.g. screenshot_b64 for browser reads

Configuration

Set via environment variables (or a .env file):

Variable Default Description
WRT_TIMEOUT 20 HTTP timeout in seconds
WRT_FAST_CONCURRENCY 50 Max concurrent fast requests
WRT_CRAWL4AI_CONCURRENCY 10 Max concurrent crawl4ai requests
WRT_BROWSER_CONCURRENCY 3 Max simultaneous Playwright pages
WRT_MIN_TEXT_LENGTH 300 Min chars to consider a fetch successful
WRT_AGENT_MAX_CONTENT_CHARS 8000 Content truncation limit for agent output
WRT_USER_AGENT Chrome 124 User-Agent header string

FastAPI Server

pip install "web4agent[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000
Method Path Body
GET /health
POST /read {"url": "...", "strategy": "auto"}
POST /read_many {"urls": [...], "concurrency": 10, "strategy": "auto"}
POST /discover_links {"url": "...", "same_domain": true, "max_links": 100}

Running Tests

pip install -e ".[dev]"
pytest

Compliance

  • robots.txt — not enforced automatically; check it yourself before scraping.
  • Rate limiting — use the concurrency parameter; add delays for the same domain.
  • Terms of Service — always review a site's ToS before scraping.
  • Intended for lawful, authorized use only.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web4agent-0.1.0.tar.gz (50.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web4agent-0.1.0-py3-none-any.whl (32.5 kB view details)

Uploaded Python 3

File details

Details for the file web4agent-0.1.0.tar.gz.

File metadata

  • Download URL: web4agent-0.1.0.tar.gz
  • Upload date:
  • Size: 50.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for web4agent-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0839db25338e977cfcde17a436c12dae9f101e93bdcd3b38f31669441d2f71cd
MD5 d1d0655390919c109ff695adaae8f5bd
BLAKE2b-256 3d68e865b5fbf5e69758a4140465efa677c0df1ea5cc176c79b17ff6bfc070d6

See more details on using hashes here.

File details

Details for the file web4agent-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: web4agent-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for web4agent-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22e63d442a0d4d2a5d230a71f241505ebf1463ffe34e25a9b2369c1c6093b1ae
MD5 b9b30d5c29ea1b5474d5a56058eccf39
BLAKE2b-256 41cb6dddbdfb6053fe8ab2c57b7e439f3b502033d5cb6b97a76f97f657e9ee12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page