Free, open-source, async-first web scraping toolkit for LLM agents

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lipiji1986

These details have not been verified by PyPI

Project description

web4agent

Free, open-source, async-first web scraping toolkit for LLM agents.
No commercial APIs. No rate-limit bills. Self-hostable.

Features

Function	Description
`read_url`	Auto-degradation: fast → crawl4ai → browser → wayback → ddg
`read_fast`	curl_cffi TLS impersonation + realistic headers; httpx fallback
`read_browser`	Stealth headless Chromium (patchright); canvas noise; Playwright fallback
`read_crawl4ai`	Crawl4AI LLM-friendly Markdown output
`read_wayback`	Wayback Machine archive fallback — no API key needed
`read_ddg`	DuckDuckGo snippet fallback — no API key needed
`read_many`	Concurrent batch fetch with deduplication
`discover_links`	Extract, normalize, and deduplicate hrefs
`agent_read_url`	Single-URL fetch returning a slim LLM-ready dict
`agent_read_urls`	Batch fetch with summary stats for LLM context
`run_doctor`	Diagnose optional deps, upstream connectivity, circuit-breaker state
FastAPI server	Optional HTTP API (`/read`, `/read_many`, `/discover_links`)

Installation

Minimal (httpx + trafilatura, covers most use cases):

pip install web4agent

With optional extras:

# TLS impersonation + realistic headers (bypass most bot-detection without a browser)
pip install "web4agent[stealth]"

# Headless browser with stealth context (JS-heavy pages, Cloudflare-protected sites)
pip install "web4agent[browser]"
patchright install chromium

# Crawl4AI strategy (LLM-optimised Markdown)
pip install "web4agent[crawl4ai]"

# FastAPI server
pip install "web4agent[server]"

# Everything
pip install "web4agent[all]"
patchright install chromium

From source (development):

git clone https://github.com/lipiji/web4agent
cd web4agent
pip install -e ".[dev]"

Quick Start

CLI

# Fetch a single page
web4agent read https://en.wikipedia.org/wiki/Web_scraping

# Batch fetch
web4agent many https://example.com https://python.org --concurrency 5

# Extract links
web4agent links https://docs.python.org/3/ --same-domain --max-links 30

# Check optional deps, upstream connectivity, and circuit-breaker state
web4agent doctor

Python

import asyncio
from web4agent import read_url, read_many, discover_links

async def main():
    # Single URL — auto strategy (fast → crawl4ai → browser)
    result = await read_url("https://en.wikipedia.org/wiki/Web_scraping")
    print(result.title)
    print(result.text[:500])

    # Batch
    results = await read_many(
        ["https://example.com", "https://python.org"],
        concurrency=5,
        strategy="fast",
    )
    for r in results:
        print(r.url, "OK" if r.success else r.error)

    # Links
    links = await discover_links("https://docs.python.org/3/", same_domain=True)
    print(links[:5])

asyncio.run(main())

Proxy rotation

import asyncio
from web4agent import agent_read_urls

async def main():
    proxies = [
        "http://user:pass@proxy1:8080",
        "socks5://proxy2:1080",
    ]
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        proxies=proxies,
        proxy_mode="round_robin",  # or "random"
    )
    print(summary)

asyncio.run(main())

Or manage rotation manually:

from web4agent import ProxyRotator, read_url

rotator = ProxyRotator(["http://p1:8080", "http://p2:8080"])
proxy = rotator.next()
result = await read_url("https://example.com", proxy=proxy)
rotator.mark_success(proxy) if result.success else rotator.mark_failed(proxy)
print(rotator.stats())

Agent interface (slim dicts for LLM context)

import asyncio
from web4agent import agent_read_url, agent_read_urls

async def main():
    # Single — returns {"url", "title", "content", "success", "strategy_used", "error"}
    r = await agent_read_url("https://example.com")
    print(r["title"])
    print(r["content"][:300])

    # Batch — returns {"results", "total", "succeeded", "failed"}
    summary = await agent_read_urls(
        ["https://example.com", "https://python.org"],
        concurrency=5,
    )
    print(f"Fetched {summary['succeeded']}/{summary['total']}")
    for item in summary["results"]:
        print(item["url"], item["success"])

asyncio.run(main())

Full working examples: examples/example.py

Strategies

Strategy	How it works	Best for
`fast`	httpx + trafilatura (+ BS4 fallback)	Static pages, high concurrency
`crawl4ai`	Crawl4AI `AsyncWebCrawler`	Docs, structured Markdown output
`browser`	Playwright headless Chromium	JS-heavy SPAs, lazy-loaded content
`auto`	Degrades: fast → crawl4ai → browser	Unknown pages

Auto-degradation triggers when:

HTTP status ≥ 400
Extracted text is shorter than MIN_TEXT_LENGTH (default 300 chars)
Page looks like a JS-only shell (empty #root / #app div)

Result shape

All read functions return a WebReadResult:

class WebReadResult(BaseModel):
    url: str
    final_url: str | None       # after redirects
    title: str | None
    text: str | None            # plain text
    markdown: str | None        # Markdown version
    html: str | None            # raw HTML
    status_code: int | None
    success: bool
    strategy_used: str | None
    attempts: list[FetchAttempt]
    error: str | None
    fetched_at: str             # ISO-8601 UTC
    elapsed_ms: int | None
    metadata: dict              # e.g. screenshot_b64 for browser reads

Configuration

Set via environment variables (or a .env file):

Variable	Default	Description
`WRT_TIMEOUT`	`20`	HTTP timeout in seconds
`WRT_FAST_CONCURRENCY`	`50`	Max concurrent fast requests
`WRT_CRAWL4AI_CONCURRENCY`	`10`	Max concurrent crawl4ai requests
`WRT_BROWSER_CONCURRENCY`	`3`	Max simultaneous Playwright pages
`WRT_MIN_TEXT_LENGTH`	`300`	Min chars to consider a fetch successful
`WRT_AGENT_MAX_CONTENT_CHARS`	`8000`	Content truncation limit for agent output
`WRT_USER_AGENT`	Chrome 124	User-Agent header string
`WRT_HEALTH_FAILURE_THRESHOLD`	`3`	Consecutive failures before a fallback tier is circuit-broken
`WRT_HEALTH_COOLDOWN_SECONDS`	`60`	Cooldown before a circuit-broken tier is retried

FastAPI Server

pip install "web4agent[server]"
uvicorn web4agent.server:app --host 0.0.0.0 --port 8000

Method	Path	Body
`GET`	`/health`	—
`POST`	`/read`	`{"url": "...", "strategy": "auto"}`
`POST`	`/read_many`	`{"urls": [...], "concurrency": 10, "strategy": "auto"}`
`POST`	`/discover_links`	`{"url": "...", "same_domain": true, "max_links": 100}`

Running Tests

pip install -e ".[dev]"
pytest

Compliance

robots.txt — not enforced automatically; check it yourself before scraping.
Rate limiting — use the concurrency parameter; add delays for the same domain.
Terms of Service — always review a site's ToS before scraping.
Intended for lawful, authorized use only.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lipiji1986

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 21, 2026

0.1.0

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web4agent-0.2.0.tar.gz (60.6 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web4agent-0.2.0-py3-none-any.whl (36.5 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file web4agent-0.2.0.tar.gz.

File metadata

Download URL: web4agent-0.2.0.tar.gz
Upload date: Jun 21, 2026
Size: 60.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for web4agent-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d25ffe8ee8959f84f6b01b7c37c9eb88d7d4fdb6ca7cc39497e92e4436344377`
MD5	`a90454b31c5aaba1522fac9e1ce1f68e`
BLAKE2b-256	`ae1555e8b448d207a4232e0ea8774cb4c521054cc9f90c01c12d41ceb27e4d0d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web4agent-0.2.0.tar.gz:

Publisher: publish.yml on lipiji/web4agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web4agent-0.2.0.tar.gz
- Subject digest: d25ffe8ee8959f84f6b01b7c37c9eb88d7d4fdb6ca7cc39497e92e4436344377
- Sigstore transparency entry: 1892930156
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: lipiji/web4agent@db892e2adeecb573ff0418b45ca456cf6549e94b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/lipiji
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@db892e2adeecb573ff0418b45ca456cf6549e94b
- Trigger Event: push

File details

Details for the file web4agent-0.2.0-py3-none-any.whl.

File metadata

Download URL: web4agent-0.2.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 36.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for web4agent-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e70da97937cc900ebd9dd1235557f27651ea5e4f984dc1913790a0b02145c25e`
MD5	`7130c16bf148eba3b17ceebacd4bf4b5`
BLAKE2b-256	`e43a1a0c1a7fc72091149c713990689d0c9b0ff6187f9cd3125df47c78e5cf60`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web4agent-0.2.0-py3-none-any.whl:

Publisher: publish.yml on lipiji/web4agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web4agent-0.2.0-py3-none-any.whl
- Subject digest: e70da97937cc900ebd9dd1235557f27651ea5e4f984dc1913790a0b02145c25e
- Sigstore transparency entry: 1892930286
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: lipiji/web4agent@db892e2adeecb573ff0418b45ca456cf6549e94b
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/lipiji
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@db892e2adeecb573ff0418b45ca456cf6549e94b
- Trigger Event: push

web4agent 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

web4agent

Features

Installation

Quick Start

CLI

Python

Proxy rotation

Agent interface (slim dicts for LLM context)

Strategies

Result shape

Configuration

FastAPI Server

Running Tests

Compliance

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance