Skip to main content

Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

Project description

web-scout-ai

web-scout-ai logo

PyPI Version PyPI Downloads per Month Python Versions License

AI-powered web research in one async call.

pip install web-scout-ai
web-scout-setup
from web_scout import run_web_research

result = await run_web_research("climate risk for agriculture in Kenya")
print(result.synthesis)

What Problem It Solves

Built-in web search tools in frameworks like the OpenAI Agents SDK return snippets — short excerpts from search results that the model has to reason from. They don't read the actual pages.

web-scout-ai goes deeper: it scrapes, converts, and extracts relevant content from real pages — static HTML, JS-rendered sites, PDFs, DOCX/PPTX/XLSX, and JSON endpoints. Legacy Office binaries such as .doc, .xls, and .ppt are detected and skipped explicitly. You also control exactly which sources get scraped, how deep the pipeline goes, and what counts as good enough coverage before synthesis.

No Tavily + crawl4ai + custom glue code. No open-ended agent you cannot control in production.


Three Real Use Cases

1. Climate and policy evidence retrieval

Query institutional sources and get a cited synthesis — not just links.

result = await run_web_research(
    "drought impact on smallholder farmers in sub-Saharan Africa",
    include_domains=["fao.org", "ipcc.ch", "worldbank.org"],
    cache=True,  # reuse successful URL source artifacts for this Python process
)

2. Rapid literature scanning

Point it at a report library or database page. It detects list pages, follows item links, and reads the actual documents.

result = await run_web_research(
    "sustainable land management technologies",
    direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)

Quick Start

Install

pip install web-scout-ai
web-scout-setup   # installs Chromium for JS-rendered pages

First run

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={"web_researcher": "openai/gpt-4o-mini", "content_extractor": "openai/gpt-4o-mini"},
        search_backend="serper",
        cache=True,
    )
    print(result.synthesis)
    for source in result.scraped:
        print(f"- {source.title or source.url}: {source.url}")

asyncio.run(main())

What You Get Back

class WebResearchResult(BaseModel):
    synthesis: str
    scraped: list[UrlEntry]
    scrape_failed: list[UrlEntry]
    blocked_by_policy: list[UrlEntry]
    source_http_error: list[UrlEntry]
    scraped_irrelevant: list[UrlEntry]
    bot_detected: list[UrlEntry]
    snippet_only: list[UrlEntry]
    queries: list[SearchQuery]
  • synthesis: final grounded answer with inline source citations
  • scraped: URLs successfully read, with extracted relevant content
  • scrape_failed: URLs attempted but could not be scraped
  • blocked_by_policy: URLs skipped because they match the built-in block policy
  • source_http_error: URLs that failed because the source returned HTTP/network errors
  • scraped_irrelevant: URLs that were fetched successfully but did not contain relevant content
  • bot_detected: URLs blocked by bot protection
  • snippet_only: search results kept only as snippets
  • queries: all search queries executed during the run

UrlEntry contains url, title, and content. SearchQuery contains query, num_results_returned, and domains_restricted.


API At A Glance

result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models={                                         # optional, defaults to gemini-3-flash-preview
        "web_researcher": "openai/gpt-4o-mini",
        "content_extractor": "gemini/gemini-2.0-flash",
    },
    search_backend="serper",
    research_depth="standard",           # or "deep"
    include_domains=["ipcc.ch"],         # optional
    direct_url=None,                     # optional
    domain_expertise="climate science",  # optional
    allowed_domains=None,                # optional
    max_pdf_pages=50,                    # optional, default 50
    max_content_chars=30_000,           # optional, max chars fed to extractor per page, default 30 000
    cache=False,                         # optional, reuse successful source artifacts in this Python process
    coverage_criteria=None,              # optional, extra instructions for the coverage evaluator
)

How It Works

See the maintained flow doc: [docs/pipeline-flow.md](docs/pipeline-flow.md)

  1. Generate targeted search queries.
  2. Search the web with Serper.
  3. Triage the best URLs across result sets.
  4. Scrape and extract relevant content in parallel.
  5. After each non-final search iteration, run the coverage evaluator to decide whether the evidence actually answers the question.
  6. If coverage is still weak, either reuse promising backlog URLs or run follow-up searches.
  7. Produce a grounded synthesis with inline citations.
  8. Run a deterministic citation check before returning.

Research Modes

# 1) Open web research
await run_web_research(query="...", models=models, search_backend="serper")

# 2) Domain-restricted research
await run_web_research(query="...", models=models, include_domains=["iucn.org", "wwf.org"])

# 3) Direct URL extraction (skip search)
await run_web_research(query="...", models=models, direct_url="https://example.org/report.pdf")

# 4) Direct URL list-page deepening
await run_web_research(query="...", models=models, direct_url="https://wocat.net/en/database/list/?type=technology&country=ke")

If the URL is a list, index, or database page, the pipeline detects it, collects relevant item links, follows them, and takes one pagination hop when present.

How URL Outcomes Are Classified

What happened Result bucket Meaning
Scrape and extraction succeeded scraped The URL produced usable extracted content
Search result was seen but never scraped snippet_only Only the search snippet is kept
URL matched a blocked domain policy blocked_by_policy Skipped before normal extraction
Source returned HTTP/network errors source_http_error The source failed, not the package logic
Bot protection or anti-automation page detected bot_detected The URL was reachable but blocked
Page loaded but content was not useful for the query scraped_irrelevant Fetch succeeded, relevance failed
Extraction failed for other reasons scrape_failed Generic scrape or extraction failure

Follow-Up Rules

Situation What the pipeline does next
direct_url is a list / index / database page Extract ranked detail links, allow one next-page hop, then scrape selected follow-ups
direct_url is a document Do not fan out into site chrome or navigation pages
Search mode completes a non-final iteration Run coverage evaluation to decide whether current evidence is sufficient
Search mode has weak coverage but promising snippet-only URLs Scrape backlog URLs before running new searches
Search mode has weak coverage and backlog looks weak Generate follow-up search queries
Domain-restricted mode finds a hub page Deepen within the same domain before broadening search

Search Backends

await run_web_research(query=..., models=..., search_backend="serper")
  • serper: Google-quality results with rich metadata (date, rank, People Also Ask, Knowledge Graph). Requires SERPER_API_KEY — Serper is generous with free-tier limits.

Additional backends can be added by the community — see SearchBackend in [search_backends.py](src/web_scout/search_backends.py).


Research Depth

# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")

# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")
Parameter Standard Deep
Max iterations 2 3
Search queries (first round) 3 5
Search queries (follow-up) 2 4
URLs scraped (first round) 6 12
URLs scraped (follow-up) 4 8
Hub deepening cap 10 15

Caching

await run_web_research(
    query="climate adaptation finance in Kenya",
    models=models,
    cache=True,
)

When cache=True, web-scout-ai keeps a process-local in-memory cache of successful URL source artifacts:

  • lifetime: the current Python process only
  • scope: reused across multiple run_web_research(...) calls in that same process
  • cleared automatically when Python exits

What is cached:

  • successful query-agnostic page/document source content
  • successful image/scanned-PDF source payloads, which are then reprocessed per query

What is not cached:

  • query-specific extracted summaries
  • final synthesis
  • failed scrapes
  • interactive click-driven exploration results

This means the same URL can be reused across queries without being fetched again, while still producing different extracted summaries when the query changes.


Configuration

Models

Model IDs follow LiteLLM provider naming:

models = {
    # Required
    "web_researcher": "openai/gpt-4o-mini",
    "content_extractor": "gemini/gemini-2.0-flash",

    # Optional step-specific overrides (default: web_researcher)
    "query_generator": "openai/gpt-4o-mini",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "openai/gpt-4o-mini",

    # Optional fallback for scanned PDFs, image URLs, or empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

Domain Control

# Restrict discovery to selected domains
await run_web_research(query=..., models=..., include_domains=["fao.org", "ipcc.ch"])

# Re-allow domains that are blocked by default
await run_web_research(query=..., models=..., allowed_domains=["reddit.com"])

By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in.


Where It Fits Best

web-scout-ai is a strong fit when you need:

  • up-to-date answers grounded in real web sources
  • multi-source synthesis without building a full deep-research stack
  • a reusable research tool inside an agent workflow
  • better handling of report libraries, list pages, and mixed web/document sources

It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.


Requirements

  • Python >=3.10
  • API key for at least one supported LLM provider
  • SERPER_API_KEY for the Serper search backend (generous free tier)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-1.3.0.tar.gz (68.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_scout_ai-1.3.0-py3-none-any.whl (71.7 kB view details)

Uploaded Python 3

File details

Details for the file web_scout_ai-1.3.0.tar.gz.

File metadata

  • Download URL: web_scout_ai-1.3.0.tar.gz
  • Upload date:
  • Size: 68.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.3.0.tar.gz
Algorithm Hash digest
SHA256 3704ebda1fca40b625f1ae6f33948cecee71e167512e2e4c00df2e6060524a9b
MD5 ac55b4926e0fdee5d26e0f5196f5eae9
BLAKE2b-256 4532e21c5e570e8e62627fce8a5d4b05e7311c410f8a510dc75a3bab5d7a45cb

See more details on using hashes here.

File details

Details for the file web_scout_ai-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: web_scout_ai-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 71.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c891899aeeba84299ab49d00a11732c3272e8626a5430d7ff59bfb6cb09fd975
MD5 c27917b8051ecf24e28e20f00483c73c
BLAKE2b-256 ceb6f2011b774c9ed6385ec46ee3570b8e329b7c94c85914f055d9aa9b3b301b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page