Skip to main content

Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

Project description

web-scout-ai

web-scout-ai logo

PyPI Version PyPI Downloads per Month Python Versions License

AI-powered web research in one async call.

pip install web-scout-ai
web-scout-setup
from web_scout import run_web_research

result = await run_web_research("climate risk for agriculture in Kenya")
print(result.synthesis)

What Problem It Solves

Built-in web search tools in frameworks like the OpenAI Agents SDK return snippets — short excerpts from search results that the model has to reason from. They don't read the actual pages.

web-scout-ai goes deeper: it scrapes, converts, and extracts relevant content from real pages — static HTML, JS-rendered sites, PDFs, DOCX/PPTX/XLSX, and JSON endpoints. Legacy Office binaries such as .doc, .xls, and .ppt are detected and skipped explicitly. You also control exactly which sources get scraped, how deep the pipeline goes, and what counts as good enough coverage before synthesis.

No Tavily + crawl4ai + custom glue code. No open-ended agent you cannot control in production.


Three Real Use Cases

1. Climate and policy evidence retrieval

Query institutional sources and get a cited synthesis — not just links.

result = await run_web_research(
    "drought impact on smallholder farmers in sub-Saharan Africa",
    include_domains=["fao.org", "ipcc.ch", "worldbank.org"],
    cache=True,  # reuse successful URL source artifacts for this Python process
)

2. Rapid literature scanning

Point it at a report library or database page. It detects list pages, follows item links, and reads the actual documents.

result = await run_web_research(
    "sustainable land management technologies",
    direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)

Quick Start

Install

pip install web-scout-ai
web-scout-setup   # installs Chromium for JS-rendered pages

First run

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={"web_researcher": "openai/gpt-4o-mini", "content_extractor": "openai/gpt-4o-mini"},
        search_backend="serper",
        cache=True,
    )
    print(result.synthesis)
    for source in result.scraped:
        print(f"- {source.title or source.url}: {source.url}")

asyncio.run(main())

What You Get Back

class WebResearchResult(BaseModel):
    synthesis: str
    scraped: list[UrlEntry]
    scrape_failed: list[UrlEntry]
    blocked_by_policy: list[UrlEntry]
    source_http_error: list[UrlEntry]
    scraped_irrelevant: list[UrlEntry]
    bot_detected: list[UrlEntry]
    snippet_only: list[UrlEntry]
    queries: list[SearchQuery]
  • synthesis: final grounded answer with inline source citations
  • scraped: URLs successfully read, with extracted relevant content
  • scrape_failed: URLs attempted but could not be scraped
  • blocked_by_policy: URLs skipped because they match the built-in block policy
  • source_http_error: URLs that failed because the source returned HTTP/network errors
  • scraped_irrelevant: URLs that were fetched successfully but did not contain relevant content
  • bot_detected: URLs blocked by bot protection
  • snippet_only: search results kept only as snippets
  • queries: all search queries executed during the run

UrlEntry contains url, title, and content. SearchQuery contains query, num_results_returned, and domains_restricted.


API At A Glance

result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models={
        "web_researcher": "openai/gpt-4o-mini",
        "content_extractor": "gemini/gemini-2.0-flash",
    },
    search_backend="serper",
    research_depth="standard",           # or "deep"
    include_domains=["ipcc.ch"],         # optional
    direct_url=None,                     # optional
    domain_expertise="climate science",  # optional
    allowed_domains=None,                # optional
    max_pdf_pages=50,                    # optional, default 50
    max_interactive_clicks=5,            # optional, max Playwright clicks per page during extraction
    cache=False,                         # optional, reuse successful source artifacts in this Python process
)

How It Works

See the maintained flow doc: [docs/pipeline-flow.md](docs/pipeline-flow.md)

  1. Generate targeted search queries.
  2. Search the web with Serper.
  3. Triage the best URLs across result sets.
  4. Scrape and extract relevant content in parallel.
  5. After each non-final search iteration, run the coverage evaluator to decide whether the evidence actually answers the question.
  6. If coverage is still weak, either reuse promising backlog URLs or run follow-up searches.
  7. Produce a grounded synthesis with inline citations.
  8. Run a deterministic citation check before returning.

Research Modes

# 1) Open web research
await run_web_research(query="...", models=models, search_backend="serper")

# 2) Domain-restricted research
await run_web_research(query="...", models=models, include_domains=["iucn.org", "wwf.org"])

# 3) Direct URL extraction (skip search)
await run_web_research(query="...", models=models, direct_url="https://example.org/report.pdf")

# 4) Direct URL list-page deepening
await run_web_research(query="...", models=models, direct_url="https://wocat.net/en/database/list/?type=technology&country=ke")

If the URL is a list, index, or database page, the pipeline detects it, collects relevant item links, follows them, and takes one pagination hop when present.

How URL Outcomes Are Classified

What happened Result bucket Meaning
Scrape and extraction succeeded scraped The URL produced usable extracted content
Search result was seen but never scraped snippet_only Only the search snippet is kept
URL matched a blocked domain policy blocked_by_policy Skipped before normal extraction
Source returned HTTP/network errors source_http_error The source failed, not the package logic
Bot protection or anti-automation page detected bot_detected The URL was reachable but blocked
Page loaded but content was not useful for the query scraped_irrelevant Fetch succeeded, relevance failed
Extraction failed for other reasons scrape_failed Generic scrape or extraction failure

Follow-Up Rules

Situation What the pipeline does next
direct_url is a list / index / database page Extract ranked detail links, allow one next-page hop, then scrape selected follow-ups
direct_url is a document Do not fan out into site chrome or navigation pages
Search mode completes a non-final iteration Run coverage evaluation to decide whether current evidence is sufficient
Search mode has weak coverage but promising snippet-only URLs Scrape backlog URLs before running new searches
Search mode has weak coverage and backlog looks weak Generate follow-up search queries
Domain-restricted mode finds a hub page Deepen within the same domain before broadening search

Search Backends

await run_web_research(query=..., models=..., search_backend="serper")
  • serper: Google-quality results with rich metadata (date, rank, People Also Ask, Knowledge Graph). Requires SERPER_API_KEY — Serper is generous with free-tier limits.

Additional backends can be added by the community — see SearchBackend in [search_backends.py](src/web_scout/search_backends.py).


Research Depth

# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")

# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")
Parameter Standard Deep
Max iterations 2 3
Search queries (first round) 3 5
Search queries (follow-up) 2 4
URLs scraped (first round) 6 12
URLs scraped (follow-up) 4 8
Hub deepening cap 10 15

Caching

await run_web_research(
    query="climate adaptation finance in Kenya",
    models=models,
    cache=True,
)

When cache=True, web-scout-ai keeps a process-local in-memory cache of successful URL source artifacts:

  • lifetime: the current Python process only
  • scope: reused across multiple run_web_research(...) calls in that same process
  • cleared automatically when Python exits

What is cached:

  • successful query-agnostic page/document source content
  • successful image/scanned-PDF source payloads, which are then reprocessed per query

What is not cached:

  • query-specific extracted summaries
  • final synthesis
  • failed scrapes
  • interactive click-driven exploration results

This means the same URL can be reused across queries without being fetched again, while still producing different extracted summaries when the query changes.


Configuration

Models

Model IDs follow LiteLLM provider naming:

models = {
    # Required
    "web_researcher": "openai/gpt-4o-mini",
    "content_extractor": "gemini/gemini-2.0-flash",

    # Optional step-specific overrides (default: web_researcher)
    "query_generator": "openai/gpt-4o-mini",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "openai/gpt-4o-mini",

    # Optional fallback for scanned PDFs, image URLs, or empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

Domain Control

# Restrict discovery to selected domains
await run_web_research(query=..., models=..., include_domains=["fao.org", "ipcc.ch"])

# Re-allow domains that are blocked by default
await run_web_research(query=..., models=..., allowed_domains=["reddit.com"])

By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in.


Where It Fits Best

web-scout-ai is a strong fit when you need:

  • up-to-date answers grounded in real web sources
  • multi-source synthesis without building a full deep-research stack
  • a reusable research tool inside an agent workflow
  • better handling of report libraries, list pages, and mixed web/document sources

It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.


Requirements

  • Python >=3.10
  • API key for at least one supported LLM provider
  • SERPER_API_KEY for the Serper search backend (generous free tier)

Brand Assets

  • Full logo: [assets/web-scout-logo.svg](assets/web-scout-logo.svg)
  • Square logo mark (avatar-safe): [assets/web-scout-logo-mark.svg](assets/web-scout-logo-mark.svg)
  • Social card preview: [assets/web-scout-social-card.svg](assets/web-scout-social-card.svg)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-1.2.2.tar.gz (62.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_scout_ai-1.2.2-py3-none-any.whl (65.6 kB view details)

Uploaded Python 3

File details

Details for the file web_scout_ai-1.2.2.tar.gz.

File metadata

  • Download URL: web_scout_ai-1.2.2.tar.gz
  • Upload date:
  • Size: 62.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.2.2.tar.gz
Algorithm Hash digest
SHA256 f089b6a7ccc06790755fdfb3822af39b2262736c44f91377fc870a7c9f14bca7
MD5 41a8326248c54eaf17042b0ca358d8f0
BLAKE2b-256 4a39d7e3afa2645d33d9bc7c20093b941640d563efc3fa0cbdbb138a5efc04b8

See more details on using hashes here.

File details

Details for the file web_scout_ai-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: web_scout_ai-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 65.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1f229a01a7a028c4a942089278149618562a719ff31d0c57d8180a478936dc85
MD5 72ca7a1a9c0a711e80f78647d2ace396
BLAKE2b-256 29344f88f2ea0b17fe9ccb8b3a91a9d1700a191ace3d47a181b9856c1ec1d622

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page