Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.
Project description
The missing middle ground between basic search APIs and heavyweight deep research agents.
One async call that finds the right URLs, reads real pages and documents, and returns a grounded synthesis with sources.
TL;DR
web-scout-ai is for teams that want better-than-snippets web research without the latency and cost profile of full deep-research stacks.
You get:
- Search -> scrape -> evaluate -> iterate -> synthesize in one deterministic pipeline
- Support for HTML, JS-rendered pages, PDFs, DOCX, PPTX, XLSX
- Structured output that drops directly into agent workflows
- Provider flexibility through LiteLLM (OpenAI, Anthropic, Gemini, Mistral, Groq, local, and more)
Why People Switch To web-scout-ai
| Option | Typical output | Pain point |
|---|---|---|
| Search API only | snippets and links | not enough context to answer reliably |
| Single-page markdown tools | one page at a time | no discovery loop, no multi-source synthesis |
| Heavy deep-research agents | long reports | slower, more expensive, often overkill |
web-scout-ai |
sourced synthesis from real content | built for practical speed + depth balance |
What Makes It Hook
1) It reads sources, not snippets
The pipeline extracts substantial query-relevant content from each source, then synthesizes across them.
2) It handles real documents out of the box
- Static HTML via fast HTTP
- JS pages via Playwright
- PDF, DOCX, PPTX, XLSX via
docling - Scanned PDFs via vision-model fallback
3) It closes coverage gaps automatically
If first-pass sources are incomplete, it checks the existing backlog first, then runs targeted follow-up searches only when needed.
4) It is agent-native by design
One async function (run_web_research), one typed result (WebResearchResult), zero framework lock-in.
Install In 30 Seconds
pip install web-scout-ai
web-scout-setup
web-scout-setup installs Chromium required for JS-rendered pages.
First Run
import asyncio
from web_scout import run_web_research
async def main():
result = await run_web_research(
query="What are the main threats to coral reefs worldwide?",
models={
"web_researcher": "gemini/gemini-2.0-flash",
"content_extractor": "gemini/gemini-2.0-flash",
},
)
print(result.synthesis)
print("Sources:")
for s in result.scraped:
print(f"- {s.title}: {s.url}")
asyncio.run(main())
API At A Glance
result = await run_web_research(
query="latest IPCC findings on sea level rise",
models={
"web_researcher": "openai/gpt-4o",
"content_extractor": "gemini/gemini-2.0-flash",
},
search_backend="duckduckgo", # or "serper"
research_depth="standard", # or "deep"
include_domains=["ipcc.ch"], # optional
direct_url=None, # optional
domain_expertise="climate science",# optional
)
Configuration
Models
Model ids follow LiteLLM provider naming:
models = {
# Required
"web_researcher": "openai/gpt-4o",
"content_extractor": "gemini/gemini-2.0-flash",
# Optional step-specific overrides (default: web_researcher)
"query_generator": "anthropic/claude-sonnet-4-20250514",
"coverage_evaluator": "openai/gpt-4o-mini",
"synthesiser": "anthropic/claude-sonnet-4-20250514",
# Optional fallback for scanned PDFs / empty JS pages
"vision_fallback": "gemini/gemini-2.0-flash",
}
Environment variables
# Search backend (optional if using DuckDuckGo)
export SERPER_API_KEY="..."
# LLM providers (set what you use)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export GROQ_API_KEY="..."
Research modes
# 1) Open web research (default)
await run_web_research(query="latest IPCC findings on sea level rise", models=models)
# 2) Domain-restricted
await run_web_research(
query="endemic species conservation programs",
models=models,
include_domains=["iucn.org", "wwf.org"],
)
# 3) Direct URL extraction (skip search)
await run_web_research(
query="key findings from this report",
models=models,
direct_url="https://example.org/biodiversity-report.pdf",
)
# 4) Direct URL list-page deepening
await run_web_research(
query="sustainable land management technologies in Kenya",
models=models,
direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)
Search backends
# Default: Serper (requires SERPER_API_KEY)
await run_web_research(query=..., models=..., search_backend="serper")
# Free: DuckDuckGo (no API key)
await run_web_research(query=..., models=..., search_backend="duckduckgo")
Research depth
# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")
# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")
| Parameter | Standard | Deep |
|---|---|---|
| Max iterations | 2 | 3 |
| Search queries (first round) | 3 | 5 |
| Search queries (follow-up) | 2 | 4 |
| URLs scraped (first round) | 6 | 12 |
| URLs scraped (follow-up) | 4 | 8 |
Pipeline Overview
Editable diagram: pipeline-diagram.excalidraw
Query
|
+- Generate search queries (LLM)
+- Search web (Serper or DuckDuckGo)
+- Select best URLs
+- Scrape and extract in parallel
| +- Static HTML
| +- JS/SPA via Playwright
| +- PDF/DOCX/PPTX/XLSX via docling
| +- Scanned PDFs via vision fallback
+- Evaluate coverage (LLM)
| +- Scrape promising backlog URLs
| +- Or generate targeted follow-up queries
+- Synthesize findings (LLM)
|
+- WebResearchResult
Use As An Agent Tool
from agents import Agent, function_tool
from web_scout import run_web_research
@function_tool
async def research(query: str) -> str:
result = await run_web_research(
query=query,
models={
"web_researcher": "gemini/gemini-2.0-flash",
"content_extractor": "gemini/gemini-2.0-flash",
},
search_backend="duckduckgo",
)
sources = "\n".join(f"- {s.url}" for s in result.scraped)
return f"{result.synthesis}\n\nSources:\n{sources}"
agent = Agent(
name="researcher",
model="gpt-4o",
tools=[research],
instructions="Use the research tool to answer with up-to-date web sources.",
)
Output Schema
class WebResearchResult(BaseModel):
synthesis: str
scraped: list[UrlEntry]
scrape_failed: list[UrlEntry]
snippet_only: list[UrlEntry]
queries: list[SearchQuery]
UrlEntry contains url, title, and content.
SearchQuery contains query, num_results_returned, and domains_restricted.
Brand Assets
- Full logo:
assets/web-scout-logo.svg - Square logo mark (avatar-safe):
assets/web-scout-logo-mark.svg - Social card preview:
assets/web-scout-social-card.svg
Requirements
- Python
>=3.10 - API key for at least one supported LLM provider
- Optional
SERPER_API_KEY(or use DuckDuckGo)
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_scout_ai-0.9.1.tar.gz.
File metadata
- Download URL: web_scout_ai-0.9.1.tar.gz
- Upload date:
- Size: 35.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11132d5b30b069c9b2a43c5ddd1ea01b019a9b500fa5318d7c1680a922d0575e
|
|
| MD5 |
5ec721a0bfca2dbbf2e2d93c8ee8750c
|
|
| BLAKE2b-256 |
3a6f9e670f9aa61eb91b81926cdc5a73f73ac0e0d7c0bccaeb39013e5d6e35ed
|
File details
Details for the file web_scout_ai-0.9.1-py3-none-any.whl.
File metadata
- Download URL: web_scout_ai-0.9.1-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85f5be8c426aff8317372d7d2a6b7ad23a5a9e3aeb215c594470df111d56dc40
|
|
| MD5 |
0952663e5095959a624f569fed606479
|
|
| BLAKE2b-256 |
734adac88bebe3fdb64c47efde7c7f53c7fee99cb678edf2e7b5e7eaed716087
|