Skip to main content

Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

Project description

web-scout-ai

The missing middle ground between web search APIs and deep research agents.

Most AI tools give you one of two options: fast search APIs that return shallow snippets, or heavyweight research agents that take minutes and cost dollars per query. web-scout-ai fills the gap by giving you a streamlined pipeline that automatically gets the right URLs and automatically handles complex file types. It searches, scrapes, reads documents, evaluates coverage, and synthesizes findings into a sourced answer, all in a single async call that typically completes in 15-40 seconds.

from web_scout import run_web_research

result = await run_web_research(
    query="What regulations protect mangrove ecosystems in Southeast Asia?",
    models={
        "web_researcher": "gemini/gemini-3.0-flash-preview",
        "content_extractor": "gemini/gemini-3.0-flash-preview",
    },
)
print(result.synthesis)   # Coherent narrative with citations
print(result.scraped)     # Full extracted content from each source
print(result.queries)     # Every search query that was executed

Core Strengths

1. Automatically Gets the Right URLs

You don't need to manually feed it links. You pass a high-level question, and the tool:

  • Uses an LLM to generate targeted search engine queries.
  • Executes searches via Serper or DuckDuckGo.
  • Interleaves and ranks the resulting URLs.
  • Automatically evaluates if the scraped content fully answers the query. If there are gaps, it first checks the unscraped backlog of search results already collected — scraping promising ones before running new searches. Only if the backlog is unpromising does it generate new targeted queries.

2. Automatically Handles Complex File Types

Most web scrapers break on PDFs or single-page applications. web-scout-ai seamlessly handles:

  • Static HTML (fast HTTP fetches)
  • JS-rendered SPAs (headless Playwright browser)
  • Real Documents (PDFs, DOCX, PPTX, XLSX via docling)
  • Scanned PDFs (vision LLM fallback to extract text from screenshots)

3. Plug-and-Play Tool for Any Agent

Designed from the ground up to be called by other AI agents, not just as a standalone script.

  • One Function Call: A single run_web_research async function.
  • Structured Output: Returns a predictable Pydantic model (WebResearchResult).
  • Framework Agnostic: Works flawlessly with OpenAI Agents SDK, LangChain, LlamaIndex, or your custom agent loop.
  • Model Agnostic: Uses LiteLLM under the hood, so it works with OpenAI, Anthropic, Gemini, Groq, Mistral, or local models.

4. Multiple Content Extraction Methods

The tool doesn't just rely on search. It supports multiple ways to gather content:

  • Open Web Search: Queries search engines and scrapes the best results.
  • Domain-Restricted Search: Limits searches to specific websites (e.g., only iucn.org or un.org).
  • Direct URL Extraction: Skip the search step entirely and just extract and synthesize content from a specific webpage or document link.

Why web-scout-ai?

The problem with existing tools

Tool What you get What's missing
Tavily / Exa Search snippets via proprietary API No actual page content. No synthesis. Vendor lock-in. Paid per query.
Jina Reader Single URL to markdown No search. No multi-source reasoning. No synthesis.
Firecrawl Single URL to markdown (paid SaaS) No search. No synthesis. Requires hosting or SaaS subscription.
ScrapeGraphAI LLM-driven single-page extraction No web search. No cross-source synthesis. Single page at a time.
GPT-Researcher Deep multi-agent reports (2000+ words) 1-3+ minutes per query. $0.05-0.10+ per report. Heavy LangChain dependency. Overkill for most questions.
LangChain/LlamaIndex tools Building blocks you wire together No integrated pipeline. You build and maintain the glue code.

What web-scout-ai does differently

It actually reads the pages. Search APIs return 200-character snippets. web-scout-ai scrapes each page, extracts the relevant content with a dedicated LLM sub-agent, and returns ~5,000 characters of focused, query-relevant content per source — not just a snippet.

It handles real documents. PDFs, DOCX, PPTX, XLSX — not just HTML. Government reports, academic papers, UN documents — the kind of sources that matter for serious research but that most tools silently skip.

It closes the loop. Search → Scrape → Evaluate → Iterate → Synthesize. If the first round of sources doesn't fully answer the query, the evaluator first inspects the unscraped backlog of search results already collected. If any look promising for the missing information, they are scraped next — no new search round needed. Only if the backlog is unhelpful does it generate new targeted queries. Most tools stop after search.

It's deterministic. No unbounded agentic loops. No unpredictable costs. The pipeline has a fixed structure with circuit breakers at every stage. You know what it will do and what it will cost.

It works with any LLM. OpenAI, Anthropic, Google Gemini, Mistral, Groq, DeepSeek, Together, local models via Ollama — anything LiteLLM supports. No vendor lock-in. Mix and match providers across pipeline steps.

It's a single function call. Designed as a plug-and-play tool for AI agents, not a framework you need to learn. One function, one result type, zero configuration beyond model names.

How it works

An editable diagram of the full pipeline is available in pipeline-diagram.excalidraw. Open it at excalidraw.com or with the VS Code Excalidraw extension.

Query
 │
 ├─ Generate search queries (LLM)
 ├─ Search the web (Serper or DuckDuckGo)
 ├─ Select best URLs (interleaved from multiple queries)
 ├─ Scrape & extract in parallel
 │   ├─ Static HTML → fast HTTP fetch (no browser)
 │   ├─ JS/SPA pages → Playwright browser
 │   ├─ PDFs → docling (text layer, no OCR)
 │   ├─ DOCX/PPTX/XLSX → docling
 │   └─ Scanned PDFs → vision LLM fallback (screenshot → extract)
 ├─ Evaluate coverage (LLM) — are there gaps?
 │   ├─ If gaps + backlog looks promising → scrape backlog URLs (skip new search)
 │   └─ If gaps + backlog is unpromising → generate targeted queries → search & scrape again
 ├─ Synthesize findings (LLM)
 │
 └─ WebResearchResult
      ├─ synthesis: str (coherent answer)
      ├─ scraped: list[UrlEntry] (sources with full extracted content)
      ├─ scrape_failed: list[UrlEntry]
      ├─ snippet_only: list[UrlEntry]
      └─ queries: list[SearchQuery]

Every step has timeouts, circuit breakers, and deduplication. URL validation (HEAD + GET) skips dead links, paywalls, binary files, and blocked domains before any expensive processing starts.

Installation

pip install web-scout-ai
web-scout-setup

The first command installs all dependencies including document extraction (PDF, DOCX, PPTX, XLSX via docling) and both search backends (Serper and DuckDuckGo). The second command installs the Chromium browser needed for scraping JS-rendered pages.

Quick start

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={
            "web_researcher": "gemini/gemini-2.0-flash",
            "content_extractor": "gemini/gemini-2.0-flash",
        },
    )

    print(result.synthesis)

    for source in result.scraped:
        print(f"  - {source.title}: {source.url}")

asyncio.run(main())

Configuration

Models

Pass a models dict to configure which LLM handles each pipeline step. All model strings follow the LiteLLM naming convention:

models = {
    # Required
    "web_researcher": "openai/gpt-4o",              # query generation, coverage evaluation, synthesis
    "content_extractor": "gemini/gemini-2.0-flash",  # page content extraction sub-agent

    # Optional overrides (default to web_researcher)
    "query_generator": "anthropic/claude-sonnet-4-20250514",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "anthropic/claude-sonnet-4-20250514",

    # Optional: vision fallback for scanned PDFs / empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

You can mix providers — e.g. use a cheap fast model for extraction and a stronger model for synthesis.

Environment variables

Set the API key for your chosen provider(s):

# Search backend
export SERPER_API_KEY="..."          # for Serper (Google results) — or use DuckDuckGo (free, no key)

# LLM providers (set the ones you use)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export GROQ_API_KEY="..."
# ... any LiteLLM-supported provider

Research modes

# 1. Open web search (default)
result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models=models,
)

# 2. Domain-restricted search — only search within specific sites
result = await run_web_research(
    query="endemic species conservation programs",
    models=models,
    include_domains=["iucn.org", "wwf.org"],
)

# 3. Direct URL extraction — skip search, extract a specific page or document
result = await run_web_research(
    query="key findings from this report",
    models=models,
    direct_url="https://example.org/biodiversity-report.pdf",
)

Search backends

# Serper (default) — Google-quality results, requires SERPER_API_KEY
result = await run_web_research(query=..., models=..., search_backend="serper")

# DuckDuckGo — zero config, no API key needed
result = await run_web_research(query=..., models=..., search_backend="duckduckgo")

Research depth

Control how thoroughly the tool searches and scrapes:

# Standard (default) — 2 iterations, up to ~10 sources
result = await run_web_research(query=..., models=..., research_depth="standard")

# Deep — 3 iterations, up to ~28 sources, more search queries per round
# Best for complex queries that need cross-referencing multiple technical sources
result = await run_web_research(query=..., models=..., research_depth="deep")
Parameter Standard Deep
Max iterations 2 3
Search queries (first round) 3 5
Search queries (follow-up) 2 4
URLs scraped (first round) 6 12
URLs scraped (follow-up) 4 8

Domain expertise

Provide domain context to improve query generation and synthesis quality:

result = await run_web_research(
    query="red list status of Panthera tigris subspecies",
    models=models,
    domain_expertise="conservation biology and IUCN Red List assessments",
)

Use as an agent tool

web-scout-ai is designed to be called by AI agents. One function, structured output, async-native:

from agents import Agent, Runner, function_tool
from web_scout import run_web_research

@function_tool
async def research(query: str) -> str:
    """Search the web and return a synthesized answer with sources."""
    result = await run_web_research(
        query=query,
        models={
            "web_researcher": "gemini/gemini-2.0-flash",
            "content_extractor": "gemini/gemini-2.0-flash",
        },
        search_backend="duckduckgo",
    )
    sources = "\n".join(f"- {s.url}" for s in result.scraped)
    return f"{result.synthesis}\n\nSources:\n{sources}"

agent = Agent(
    name="researcher",
    model="gpt-4o",
    tools=[research],
    instructions="Use the research tool to answer questions with up-to-date web information.",
)

Works with any agent framework — OpenAI Agents SDK, LangChain, LlamaIndex, or your own. It's just an async function that returns a Pydantic model.

Output structure

run_web_research returns a WebResearchResult:

class WebResearchResult(BaseModel):
    synthesis: str                     # Coherent narrative answering the query
    scraped: list[UrlEntry]            # Sources with full extracted content (~5000 chars each)
    scrape_failed: list[UrlEntry]      # URLs where scraping failed
    snippet_only: list[UrlEntry]       # Search results not scraped (with snippets)
    queries: list[SearchQuery]         # All search queries executed

Each UrlEntry contains url, title, and content. Each SearchQuery contains query, num_results_returned, and domains_restricted.

Requirements

  • Python >= 3.10
  • An API key for at least one LLM provider
  • (Optional) SERPER_API_KEY for Google-quality search — or use DuckDuckGo for free

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-0.9.0.tar.gz (36.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_scout_ai-0.9.0-py3-none-any.whl (37.3 kB view details)

Uploaded Python 3

File details

Details for the file web_scout_ai-0.9.0.tar.gz.

File metadata

  • Download URL: web_scout_ai-0.9.0.tar.gz
  • Upload date:
  • Size: 36.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-0.9.0.tar.gz
Algorithm Hash digest
SHA256 63653ffb1830b5322b275ae986a944897ac4b8e73d7c13aa7c9ac7b5f6d36feb
MD5 24d3eaa56d756d5d668a26f0c3380e20
BLAKE2b-256 7c2df922102c94cd06745014418a6f57253513c9bfa6d633934929db0d39389d

See more details on using hashes here.

File details

Details for the file web_scout_ai-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: web_scout_ai-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 37.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d3e92e5854d8dae751718d0b367085352d17dd238f38ad2a1db75c9575fabb1
MD5 846597a02ea246f5b0a1ba0befac2994
BLAKE2b-256 2cc2d3875830edb6ff217c91808701beb2462b91589e1aabcd9ee535ce802904

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page