Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

These details have not been verified by PyPI

Project links

Repository

Project description

`web-scout-ai`

web-scout-ai logo

AI-powered web research in one async call.

pip install web-scout-ai
web-scout-setup

from web_scout import run_web_research

result = await run_web_research("climate risk for agriculture in Kenya")
print(result.synthesis)

What Problem It Solves

Built-in web search tools in frameworks like the OpenAI Agents SDK return snippets — short excerpts from search results that the model has to reason from. They don't read the actual pages.

web-scout-ai goes deeper: it scrapes, converts, and extracts relevant content from real pages — static HTML, JS-rendered sites, PDFs, DOCX/PPTX/XLSX, and JSON endpoints. Legacy Office binaries such as .doc, .xls, and .ppt are detected and skipped explicitly. You also control exactly which sources get scraped, how deep the pipeline goes, and what counts as good enough coverage before synthesis.

No Tavily + crawl4ai + custom glue code. No open-ended agent you cannot control in production.

Three Real Use Cases

1. Climate and policy evidence retrieval

Query institutional sources and get a cited synthesis — not just links.

result = await run_web_research(
    "drought impact on smallholder farmers in sub-Saharan Africa",
    include_domains=["fao.org", "ipcc.ch", "worldbank.org"],
    cache=True,  # reuse successful URL source artifacts for this Python process
)

2. Rapid literature scanning

Point it at a report library or database page. It detects list pages, follows item links, and reads the actual documents.

result = await run_web_research(
    "sustainable land management technologies",
    direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)

Quick Start

Install

pip install web-scout-ai
web-scout-setup   # installs Chromium for JS-rendered pages

First run

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={"web_researcher": "openai/gpt-4o-mini", "content_extractor": "openai/gpt-4o-mini"},
        search_backend="serper",
        cache=True,
    )
    print(result.synthesis)
    for source in result.scraped:
        print(f"- {source.title or source.url}: {source.url}")

asyncio.run(main())

What You Get Back

class WebResearchResult(BaseModel):
    synthesis: str
    scraped: list[UrlEntry]
    scrape_failed: list[UrlEntry]
    blocked_by_policy: list[UrlEntry]
    source_http_error: list[UrlEntry]
    scraped_irrelevant: list[UrlEntry]
    bot_detected: list[UrlEntry]
    snippet_only: list[UrlEntry]
    queries: list[SearchQuery]

synthesis: final grounded answer with inline source citations
scraped: URLs successfully read, with extracted relevant content
scrape_failed: URLs attempted but could not be scraped
blocked_by_policy: URLs skipped because they match the built-in block policy
source_http_error: URLs that failed because the source returned HTTP/network errors
scraped_irrelevant: URLs that were fetched successfully but did not contain relevant content
bot_detected: URLs blocked by bot protection
snippet_only: search results kept only as snippets
queries: all search queries executed during the run

UrlEntry contains url, title, and content. SearchQuery contains query, num_results_returned, and domains_restricted.

API At A Glance

result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models={                                         # optional, defaults to gemini-3-flash-preview
        "web_researcher": "openai/gpt-4o-mini",
        "content_extractor": "gemini/gemini-2.0-flash",
    },
    search_backend="serper",
    research_depth="standard",           # or "deep"
    include_domains=["ipcc.ch"],         # optional
    direct_url=None,                     # optional
    domain_expertise="climate science",  # optional
    allowed_domains=None,                # optional
    max_pdf_pages=50,                    # optional, default 50
    max_content_chars=30_000,           # optional, max chars fed to extractor per page, default 30 000
    cache=False,                         # optional, reuse successful source artifacts in this Python process
    coverage_criteria=None,              # optional, extra instructions for the coverage evaluator
)

How It Works

See the maintained flow doc: [docs/pipeline-flow.md](docs/pipeline-flow.md)

Generate targeted search queries.
Search the web with Serper.
Triage the best URLs across result sets.
Scrape and extract relevant content in parallel.
After each non-final search iteration, run the coverage evaluator to decide whether the evidence actually answers the question.
If coverage is still weak, either reuse promising backlog URLs or run follow-up searches.
Produce a grounded synthesis with inline citations.
Run a deterministic citation check before returning.

Research Modes

# 1) Open web research
await run_web_research(query="...", models=models, search_backend="serper")

# 2) Domain-restricted research
await run_web_research(query="...", models=models, include_domains=["iucn.org", "wwf.org"])

# 3) Direct URL extraction (skip search)
await run_web_research(query="...", models=models, direct_url="https://example.org/report.pdf")

# 4) Direct URL list-page deepening
await run_web_research(query="...", models=models, direct_url="https://wocat.net/en/database/list/?type=technology&country=ke")

If the URL is a list, index, or database page, the pipeline detects it, collects relevant item links, follows them, and takes one pagination hop when present.

How URL Outcomes Are Classified

What happened	Result bucket	Meaning
Scrape and extraction succeeded	`scraped`	The URL produced usable extracted content
Search result was seen but never scraped	`snippet_only`	Only the search snippet is kept
URL matched a blocked domain policy	`blocked_by_policy`	Skipped before normal extraction
Source returned HTTP/network errors	`source_http_error`	The source failed, not the package logic
Bot protection or anti-automation page detected	`bot_detected`	The URL was reachable but blocked
Page loaded but content was not useful for the query	`scraped_irrelevant`	Fetch succeeded, relevance failed
Extraction failed for other reasons	`scrape_failed`	Generic scrape or extraction failure

Follow-Up Rules

Situation	What the pipeline does next
`direct_url` is a list / index / database page	Extract ranked detail links, allow one next-page hop, then scrape selected follow-ups
`direct_url` is a document	Do not fan out into site chrome or navigation pages
Search mode completes a non-final iteration	Run coverage evaluation to decide whether current evidence is sufficient
Search mode has weak coverage but promising snippet-only URLs	Scrape backlog URLs before running new searches
Search mode has weak coverage and backlog looks weak	Generate follow-up search queries
Domain-restricted mode finds a hub page	Deepen within the same domain before broadening search

Search Backends

await run_web_research(query=..., models=..., search_backend="serper")

serper: Google-quality results with rich metadata (date, rank, People Also Ask, Knowledge Graph). Requires SERPER_API_KEY — Serper is generous with free-tier limits.

Additional backends can be added by the community — see SearchBackend in [search_backends.py](src/web_scout/search_backends.py).

Research Depth

# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")

# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")

Parameter	Standard	Deep
Max iterations	2	3
Search queries (first round)	3	5
Search queries (follow-up)	2	4
URLs scraped (first round)	6	12
URLs scraped (follow-up)	4	8
Hub deepening cap	10	15

Caching

await run_web_research(
    query="climate adaptation finance in Kenya",
    models=models,
    cache=True,
)

When cache=True, web-scout-ai keeps a process-local in-memory cache of successful URL source artifacts:

lifetime: the current Python process only
scope: reused across multiple run_web_research(...) calls in that same process
cleared automatically when Python exits

What is cached:

successful query-agnostic page/document source content
successful image/scanned-PDF source payloads, which are then reprocessed per query

What is not cached:

query-specific extracted summaries
final synthesis
failed scrapes
interactive click-driven exploration results

This means the same URL can be reused across queries without being fetched again, while still producing different extracted summaries when the query changes.

Configuration

Models

Model IDs follow LiteLLM provider naming:

models = {
    # Required
    "web_researcher": "openai/gpt-4o-mini",
    "content_extractor": "gemini/gemini-2.0-flash",

    # Optional step-specific overrides (default: web_researcher)
    "query_generator": "openai/gpt-4o-mini",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "openai/gpt-4o-mini",

    # Optional fallback for scanned PDFs, image URLs, or empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

Domain Control

# Restrict discovery to selected domains
await run_web_research(query=..., models=..., include_domains=["fao.org", "ipcc.ch"])

# Re-allow domains that are blocked by default
await run_web_research(query=..., models=..., allowed_domains=["reddit.com"])

By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in.

Where It Fits Best

web-scout-ai is a strong fit when you need:

up-to-date answers grounded in real web sources
multi-source synthesis without building a full deep-research stack
a reusable research tool inside an agent workflow
better handling of report libraries, list pages, and mixed web/document sources

It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.

Requirements

Python >=3.10
API key for at least one supported LLM provider
SERPER_API_KEY for the Serper search backend (generous free tier)

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

1.3.1

Jun 19, 2026

This version

1.3.0

May 19, 2026

1.2.2

May 18, 2026

1.2.1

May 18, 2026

1.2.0

May 13, 2026

1.1.1

May 5, 2026

1.1.0

Apr 23, 2026

1.0.5

Apr 16, 2026

1.0.3

Apr 14, 2026

0.9.4

Apr 10, 2026

0.9.2

Mar 27, 2026

0.9.1

Mar 19, 2026

0.9.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-1.3.0.tar.gz (68.0 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web_scout_ai-1.3.0-py3-none-any.whl (71.7 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file web_scout_ai-1.3.0.tar.gz.

File metadata

Download URL: web_scout_ai-1.3.0.tar.gz
Upload date: May 19, 2026
Size: 68.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`3704ebda1fca40b625f1ae6f33948cecee71e167512e2e4c00df2e6060524a9b`
MD5	`ac55b4926e0fdee5d26e0f5196f5eae9`
BLAKE2b-256	`4532e21c5e570e8e62627fce8a5d4b05e7311c410f8a510dc75a3bab5d7a45cb`

See more details on using hashes here.

File details

Details for the file web_scout_ai-1.3.0-py3-none-any.whl.

File metadata

Download URL: web_scout_ai-1.3.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 71.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c891899aeeba84299ab49d00a11732c3272e8626a5430d7ff59bfb6cb09fd975`
MD5	`c27917b8051ecf24e28e20f00483c73c`
BLAKE2b-256	`ceb6f2011b774c9ed6385ec46ee3570b8e329b7c94c85914f055d9aa9b3b301b`

See more details on using hashes here.

web-scout-ai 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

web-scout-ai

What Problem It Solves

Three Real Use Cases

1. Climate and policy evidence retrieval

2. Rapid literature scanning

Quick Start

Install

First run

What You Get Back

API At A Glance

How It Works

Research Modes

How URL Outcomes Are Classified

Follow-Up Rules

Search Backends

Research Depth

Caching

Configuration

Models

Domain Control

Where It Fits Best

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`web-scout-ai`