Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

These details have not been verified by PyPI

Project links

Repository

Project description

web-scout-ai

The missing middle ground between web search APIs and deep research agents.

Most AI tools give you one of two options: fast search APIs that return shallow snippets, or heavyweight research agents that take minutes and cost dollars per query. web-scout-ai fills the gap by giving you a streamlined pipeline that automatically gets the right URLs and automatically handles complex file types. It searches, scrapes, reads documents, evaluates coverage, and synthesizes findings into a sourced answer, all in a single async call that typically completes in 15-40 seconds.

from web_scout import run_web_research

result = await run_web_research(
    query="What regulations protect mangrove ecosystems in Southeast Asia?",
    models={
        "web_researcher": "gemini/gemini-3.0-flash-preview",
        "content_extractor": "gemini/gemini-3.0-flash-preview",
    },
)
print(result.synthesis)   # Coherent narrative with citations
print(result.scraped)     # Full extracted content from each source
print(result.queries)     # Every search query that was executed

Core Strengths

1. Automatically Gets the Right URLs

You don't need to manually feed it links. You pass a high-level question, and the tool:

Uses an LLM to generate targeted search engine queries.
Executes searches via Serper or DuckDuckGo.
Interleaves and ranks the resulting URLs.
Automatically evaluates if the scraped content fully answers the query. If there are gaps, it first checks the unscraped backlog of search results already collected — scraping promising ones before running new searches. Only if the backlog is unpromising does it generate new targeted queries.

2. Automatically Handles Complex File Types

Most web scrapers break on PDFs or single-page applications. web-scout-ai seamlessly handles:

Static HTML (fast HTTP fetches)
JS-rendered SPAs (headless Playwright browser)
Real Documents (PDFs, DOCX, PPTX, XLSX via docling)
Scanned PDFs (vision LLM fallback to extract text from screenshots)

3. Plug-and-Play Tool for Any Agent

Designed from the ground up to be called by other AI agents, not just as a standalone script.

One Function Call: A single run_web_research async function.
Structured Output: Returns a predictable Pydantic model (WebResearchResult).
Framework Agnostic: Works flawlessly with OpenAI Agents SDK, LangChain, LlamaIndex, or your custom agent loop.
Model Agnostic: Uses LiteLLM under the hood, so it works with OpenAI, Anthropic, Gemini, Groq, Mistral, or local models.

4. Multiple Content Extraction Methods

The tool doesn't just rely on search. It supports multiple ways to gather content:

Open Web Search: Queries search engines and scrapes the best results.
Domain-Restricted Search: Limits searches to specific websites (e.g., only iucn.org or un.org).
Direct URL Extraction: Skip the search step entirely and just extract and synthesize content from a specific webpage or document link.

Why web-scout-ai?

The problem with existing tools

Tool	What you get	What's missing
Tavily / Exa	Search snippets via proprietary API	No actual page content. No synthesis. Vendor lock-in. Paid per query.
Jina Reader	Single URL to markdown	No search. No multi-source reasoning. No synthesis.
Firecrawl	Single URL to markdown (paid SaaS)	No search. No synthesis. Requires hosting or SaaS subscription.
ScrapeGraphAI	LLM-driven single-page extraction	No web search. No cross-source synthesis. Single page at a time.
GPT-Researcher	Deep multi-agent reports (2000+ words)	1-3+ minutes per query. $0.05-0.10+ per report. Heavy LangChain dependency. Overkill for most questions.
LangChain/LlamaIndex tools	Building blocks you wire together	No integrated pipeline. You build and maintain the glue code.

What web-scout-ai does differently

It actually reads the pages. Search APIs return 200-character snippets. web-scout-ai scrapes each page, extracts the relevant content with a dedicated LLM sub-agent, and returns ~5,000 characters of focused, query-relevant content per source — not just a snippet.

It handles real documents. PDFs, DOCX, PPTX, XLSX — not just HTML. Government reports, academic papers, UN documents — the kind of sources that matter for serious research but that most tools silently skip.

It closes the loop. Search → Scrape → Evaluate → Iterate → Synthesize. If the first round of sources doesn't fully answer the query, the evaluator first inspects the unscraped backlog of search results already collected. If any look promising for the missing information, they are scraped next — no new search round needed. Only if the backlog is unhelpful does it generate new targeted queries. Most tools stop after search.

It's deterministic. No unbounded agentic loops. No unpredictable costs. The pipeline has a fixed structure with circuit breakers at every stage. You know what it will do and what it will cost.

It works with any LLM. OpenAI, Anthropic, Google Gemini, Mistral, Groq, DeepSeek, Together, local models via Ollama — anything LiteLLM supports. No vendor lock-in. Mix and match providers across pipeline steps.

It's a single function call. Designed as a plug-and-play tool for AI agents, not a framework you need to learn. One function, one result type, zero configuration beyond model names.

How it works

An editable diagram of the full pipeline is available in pipeline-diagram.excalidraw. Open it at excalidraw.com or with the VS Code Excalidraw extension.

Query
 │
 ├─ Generate search queries (LLM)
 ├─ Search the web (Serper or DuckDuckGo)
 ├─ Select best URLs (interleaved from multiple queries)
 ├─ Scrape & extract in parallel
 │   ├─ Static HTML → fast HTTP fetch (no browser)
 │   ├─ JS/SPA pages → Playwright browser
 │   ├─ PDFs → docling (text layer, no OCR)
 │   ├─ DOCX/PPTX/XLSX → docling
 │   └─ Scanned PDFs → vision LLM fallback (screenshot → extract)
 ├─ Evaluate coverage (LLM) — are there gaps?
 │   ├─ If gaps + backlog looks promising → scrape backlog URLs (skip new search)
 │   └─ If gaps + backlog is unpromising → generate targeted queries → search & scrape again
 ├─ Synthesize findings (LLM)
 │
 └─ WebResearchResult
      ├─ synthesis: str (coherent answer)
      ├─ scraped: list[UrlEntry] (sources with full extracted content)
      ├─ scrape_failed: list[UrlEntry]
      ├─ snippet_only: list[UrlEntry]
      └─ queries: list[SearchQuery]

Every step has timeouts, circuit breakers, and deduplication. URL validation (HEAD + GET) skips dead links, paywalls, binary files, and blocked domains before any expensive processing starts.

Installation

pip install web-scout-ai
web-scout-setup

The first command installs all dependencies including document extraction (PDF, DOCX, PPTX, XLSX via docling) and both search backends (Serper and DuckDuckGo). The second command installs the Chromium browser needed for scraping JS-rendered pages.

Quick start

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={
            "web_researcher": "gemini/gemini-2.0-flash",
            "content_extractor": "gemini/gemini-2.0-flash",
        },
    )

    print(result.synthesis)

    for source in result.scraped:
        print(f"  - {source.title}: {source.url}")

asyncio.run(main())

Configuration

Models

Pass a models dict to configure which LLM handles each pipeline step. All model strings follow the LiteLLM naming convention:

models = {
    # Required
    "web_researcher": "openai/gpt-4o",              # query generation, coverage evaluation, synthesis
    "content_extractor": "gemini/gemini-2.0-flash",  # page content extraction sub-agent

    # Optional overrides (default to web_researcher)
    "query_generator": "anthropic/claude-sonnet-4-20250514",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "anthropic/claude-sonnet-4-20250514",

    # Optional: vision fallback for scanned PDFs / empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

You can mix providers — e.g. use a cheap fast model for extraction and a stronger model for synthesis.

Environment variables

Set the API key for your chosen provider(s):

# Search backend
export SERPER_API_KEY="..."          # for Serper (Google results) — or use DuckDuckGo (free, no key)

# LLM providers (set the ones you use)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export GROQ_API_KEY="..."
# ... any LiteLLM-supported provider

Research modes

# 1. Open web search (default)
result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models=models,
)

# 2. Domain-restricted search — only search within specific sites
result = await run_web_research(
    query="endemic species conservation programs",
    models=models,
    include_domains=["iucn.org", "wwf.org"],
)

# 3. Direct URL extraction — skip search, extract a specific page or document
result = await run_web_research(
    query="key findings from this report",
    models=models,
    direct_url="https://example.org/biodiversity-report.pdf",
)

Search backends

# Serper (default) — Google-quality results, requires SERPER_API_KEY
result = await run_web_research(query=..., models=..., search_backend="serper")

# DuckDuckGo — zero config, no API key needed
result = await run_web_research(query=..., models=..., search_backend="duckduckgo")

Research depth

Control how thoroughly the tool searches and scrapes:

# Standard (default) — 2 iterations, up to ~10 sources
result = await run_web_research(query=..., models=..., research_depth="standard")

# Deep — 3 iterations, up to ~28 sources, more search queries per round
# Best for complex queries that need cross-referencing multiple technical sources
result = await run_web_research(query=..., models=..., research_depth="deep")

Parameter	Standard	Deep
Max iterations	2	3
Search queries (first round)	3	5
Search queries (follow-up)	2	4
URLs scraped (first round)	6	12
URLs scraped (follow-up)	4	8

Domain expertise

Provide domain context to improve query generation and synthesis quality:

result = await run_web_research(
    query="red list status of Panthera tigris subspecies",
    models=models,
    domain_expertise="conservation biology and IUCN Red List assessments",
)

Use as an agent tool

web-scout-ai is designed to be called by AI agents. One function, structured output, async-native:

from agents import Agent, Runner, function_tool
from web_scout import run_web_research

@function_tool
async def research(query: str) -> str:
    """Search the web and return a synthesized answer with sources."""
    result = await run_web_research(
        query=query,
        models={
            "web_researcher": "gemini/gemini-2.0-flash",
            "content_extractor": "gemini/gemini-2.0-flash",
        },
        search_backend="duckduckgo",
    )
    sources = "\n".join(f"- {s.url}" for s in result.scraped)
    return f"{result.synthesis}\n\nSources:\n{sources}"

agent = Agent(
    name="researcher",
    model="gpt-4o",
    tools=[research],
    instructions="Use the research tool to answer questions with up-to-date web information.",
)

Works with any agent framework — OpenAI Agents SDK, LangChain, LlamaIndex, or your own. It's just an async function that returns a Pydantic model.

Output structure

run_web_research returns a WebResearchResult:

class WebResearchResult(BaseModel):
    synthesis: str                     # Coherent narrative answering the query
    scraped: list[UrlEntry]            # Sources with full extracted content (~5000 chars each)
    scrape_failed: list[UrlEntry]      # URLs where scraping failed
    snippet_only: list[UrlEntry]       # Search results not scraped (with snippets)
    queries: list[SearchQuery]         # All search queries executed

Each UrlEntry contains url, title, and content. Each SearchQuery contains query, num_results_returned, and domains_restricted.

Requirements

Python >= 3.10
An API key for at least one LLM provider
(Optional) SERPER_API_KEY for Google-quality search — or use DuckDuckGo for free

License

MIT

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

1.1.0

Apr 23, 2026

1.0.5

Apr 16, 2026

1.0.3

Apr 14, 2026

0.9.4

Apr 10, 2026

0.9.2

Mar 27, 2026

0.9.1

Mar 19, 2026

This version

0.9.0

Mar 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-0.9.0.tar.gz (36.8 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web_scout_ai-0.9.0-py3-none-any.whl (37.3 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file web_scout_ai-0.9.0.tar.gz.

File metadata

Download URL: web_scout_ai-0.9.0.tar.gz
Upload date: Mar 18, 2026
Size: 36.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`63653ffb1830b5322b275ae986a944897ac4b8e73d7c13aa7c9ac7b5f6d36feb`
MD5	`24d3eaa56d756d5d668a26f0c3380e20`
BLAKE2b-256	`7c2df922102c94cd06745014418a6f57253513c9bfa6d633934929db0d39389d`

See more details on using hashes here.

File details

Details for the file web_scout_ai-0.9.0-py3-none-any.whl.

File metadata

Download URL: web_scout_ai-0.9.0-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 37.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d3e92e5854d8dae751718d0b367085352d17dd238f38ad2a1db75c9575fabb1`
MD5	`846597a02ea246f5b0a1ba0befac2994`
BLAKE2b-256	`2cc2d3875830edb6ff217c91808701beb2462b91589e1aabcd9ee535ce802904`

See more details on using hashes here.

web-scout-ai 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

web-scout-ai

Core Strengths

1. Automatically Gets the Right URLs

2. Automatically Handles Complex File Types

3. Plug-and-Play Tool for Any Agent

4. Multiple Content Extraction Methods

Why web-scout-ai?

The problem with existing tools

What web-scout-ai does differently

How it works

Installation

Quick start

Configuration

Models

Environment variables

Research modes

Search backends

Research depth

Domain expertise

Use as an agent tool

Output structure

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes