Skip to main content

Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

Project description

web-scout-ai

web-scout-ai logo

PyPI Version PyPI Downloads per Month Python Versions License

Grounded web research for agents and apps.
One async call to discover sources, read real pages and documents, close coverage gaps, and return a cited synthesis.

Why This Exists

Most web tools stop too early.

  • Search APIs give you snippets and links, not enough context to answer reliably.
  • Single-page scrapers can read one URL well, but they do not know what to read next.
  • Full deep-research agents often produce good work, but they can be slower, more expensive, and harder to control in production flows.

web-scout-ai is the middle path: a deterministic research pipeline that is stronger than search-only tooling and much lighter than open-ended research agents.

What It Actually Does

web-scout-ai does not just search and summarize. It runs a full research loop:

  1. Generate targeted search queries.
  2. Search the web with Serper or DuckDuckGo.
  3. Triage the best URLs across result sets.
  4. Scrape and extract relevant content in parallel.
  5. Evaluate whether the evidence actually answers the question.
  6. Reuse promising backlog URLs or run follow-up searches if coverage is still weak.
  7. Produce a grounded synthesis with inline citations.
  8. Run a deterministic citation check before returning the final answer.

That gives you a practical balance of depth, speed, and control in one function: run_web_research(...).

Why It Feels Different

It reads sources, not snippets

Each selected URL is scraped and converted into a substantial query-relevant extract before synthesis.

It handles messy real-world content

  • Static HTML via fast HTTP
  • JS-rendered pages via Playwright
  • JSON endpoints via structured extraction
  • Image URLs via optional vision extraction
  • PDF, DOCX, PPTX, XLSX via docling, including extensionless download URLs detected from response headers
  • Bot-protected PDFs (e.g. Akamai) via Playwright browser download fallback
  • Short metadata/catalogue pages retained for extractor inspection instead of being dropped as thin pages
  • Scanned PDFs and empty JS pages via optional vision fallback

It can deepen automatically

If a direct URL is actually a list or database page, web-scout-ai can detect that, follow relevant item links, and even take one pagination hop.

It is easy to plug into agents

You get one async entry point, typed output, provider flexibility via LiteLLM, and no framework lock-in.

Quick Start

pip install web-scout-ai
web-scout-setup

web-scout-setup installs the Chromium browser required for JS-rendered pages.

First Run

This example uses DuckDuckGo so it works without a search API key.

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={
            "web_researcher": "openai/gpt-5.4-mini",
            "content_extractor": "gemini/gemini-3-flash-preview",
        },
        search_backend="duckduckgo",
    )

    print(result.synthesis)
    print("\nSources:")
    for source in result.scraped:
        print(f"- {source.title or source.url}: {source.url}")

asyncio.run(main())

What You Get Back

class WebResearchResult(BaseModel):
    synthesis: str
    scraped: list[UrlEntry]
    scrape_failed: list[UrlEntry]
    bot_detected: list[UrlEntry]
    snippet_only: list[UrlEntry]
    queries: list[SearchQuery]
  • synthesis: final grounded answer with inline source citations
  • scraped: URLs successfully scraped, with extracted relevant content
  • scrape_failed: URLs that were attempted but could not be scraped
  • bot_detected: URLs blocked by bot protection
  • snippet_only: search results kept only as snippets
  • queries: all search queries executed during the run

UrlEntry contains url, title, and content. SearchQuery contains query, num_results_returned, and domains_restricted.

API At A Glance

result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models={
        "web_researcher": "openai/gpt-5.4-mini",
        "content_extractor": "gemini/gemini-3-flash-preview",
    },
    search_backend="duckduckgo",         # or "serper"
    research_depth="standard",           # or "deep"
    include_domains=["ipcc.ch"],         # optional
    direct_url=None,                     # optional
    domain_expertise="climate science",  # optional
    allowed_domains=None,                # optional
    max_pdf_pages=50,                    # optional, default 50
)

Research Modes

# 1) Open web research
await run_web_research(
    query="latest IPCC findings on sea level rise",
    models=models,
    search_backend="duckduckgo",
)

# 2) Domain-restricted research
await run_web_research(
    query="endemic species conservation programs",
    models=models,
    include_domains=["iucn.org", "wwf.org"],
)

# 3) Direct URL extraction (skip search)
await run_web_research(
    query="key findings from this report",
    models=models,
    direct_url="https://example.org/biodiversity-report.pdf",
)

# 4) Direct URL list-page deepening
await run_web_research(
    query="sustainable land management technologies in Kenya",
    models=models,
    direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)

Direct URL mode is more than single-page extraction

If the URL is a list, index, or database page, the pipeline can:

  • detect that it is a hub page
  • collect the most relevant item links
  • follow up to a depth-dependent cap of those links
  • take one "next page" hop when pagination is present

This is especially useful for catalog pages, result listings, and structured report libraries.

Search Backends

# Default: Serper (requires SERPER_API_KEY)
await run_web_research(query=..., models=..., search_backend="serper")

# Free: DuckDuckGo (no API key)
await run_web_research(query=..., models=..., search_backend="duckduckgo")
  • serper: Google-quality results with richer metadata
  • duckduckgo: zero-config and free, ideal for quick starts and lightweight usage

Research Depth

# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")

# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")
Parameter Standard Deep
Max iterations 2 3
Search queries (first round) 3 5
Search queries (follow-up) 2 4
URLs scraped (first round) 6 12
URLs scraped (follow-up) 4 8
Hub deepening cap 10 15

Configuration

Models

Model IDs follow LiteLLM provider naming:

models = {
    # Required
    "web_researcher": "openai/gpt-5.4-mini",
    "content_extractor": "gemini/gemini-3-flash-preview",

    # Optional step-specific overrides (default: web_researcher)
    "query_generator": "openai/gpt-5.4-mini",
    "coverage_evaluator": "openai/gpt-5.4-mini",
    "synthesiser": "openai/gpt-5.4-mini",

    # Optional fallback for scanned PDFs, image URLs, or empty JS pages
    "vision_fallback": "gemini/gemini-3-flash-preview",
}

Environment Variables

# Search backend (optional if using DuckDuckGo)
export SERPER_API_KEY="..."

# LLM providers (set what you use)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."
export MISTRAL_API_KEY="..."
export GROQ_API_KEY="..."

Domain Control

# Restrict discovery to selected domains
await run_web_research(
    query=...,
    models=...,
    include_domains=["fao.org", "ipcc.ch"],
)

# Re-allow domains that are blocked by default
await run_web_research(
    query=...,
    models=...,
    allowed_domains=["reddit.com"],
)

By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in when they are genuinely useful for the task.

Pipeline Overview

Editable diagram: pipeline-diagram.excalidraw

Query
 |
 +- Generate search queries (LLM)
 +- Search web (Serper or DuckDuckGo)
 +- Select best URLs across result sets
 +- Scrape and extract in parallel
 |   +- Static HTML
 |   +- JS/SPA via Playwright
 |   +- JSON endpoints via structured extraction
 |   +- Image URLs via vision extraction
 |   +- PDF/DOCX/PPTX/XLSX via docling
 |   +- Extensionless document downloads via content-type/content-disposition sniffing
 |   +- Bot-protected PDFs via Playwright download fallback
 |   +- Short metadata pages retained for linked-document follow-up
 |   +- Scanned PDFs via vision fallback
 +- Evaluate coverage (LLM)
 |   +- Reuse promising backlog URLs
 |   +- Or generate targeted follow-up searches
 +- Synthesize findings with citations (LLM)
 +- Run deterministic citation checks
 |
 +- WebResearchResult

Use As An Agent Tool

from agents import Agent, function_tool
from web_scout import run_web_research

@function_tool
async def research(query: str) -> str:
    result = await run_web_research(
        query=query,
        models={
            "web_researcher": "openai/gpt-5.4-mini",
            "content_extractor": "gemini/gemini-3-flash-preview",
        },
        search_backend="duckduckgo",
    )
    sources = "\n".join(f"- {s.url}" for s in result.scraped)
    return f"{result.synthesis}\n\nSources:\n{sources}"

agent = Agent(
    name="researcher",
    model="gpt-5.4-mini",
    tools=[research],
    instructions="Use the research tool to answer with up-to-date web sources.",
)

Where It Fits Best

web-scout-ai is a strong fit when you need:

  • up-to-date answers grounded in real web sources
  • multi-source synthesis without building a full deep-research stack
  • a reusable research tool inside an agent workflow
  • better handling of report libraries, list pages, and mixed web/document sources

It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.

Brand Assets

Requirements

  • Python >=3.10
  • API key for at least one supported LLM provider
  • Optional SERPER_API_KEY if you want the Serper backend

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-1.0.3.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_scout_ai-1.0.3-py3-none-any.whl (48.3 kB view details)

Uploaded Python 3

File details

Details for the file web_scout_ai-1.0.3.tar.gz.

File metadata

  • Download URL: web_scout_ai-1.0.3.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.0.3.tar.gz
Algorithm Hash digest
SHA256 7d97a38dd6d7b041cd2dfccc5bf6ab9eb0a8d63d105545794349ceb8c01b6a7f
MD5 1010fcc0e3dead0b63977c35c38cfa9a
BLAKE2b-256 5cf3d4034f0fed8b42b180c3817c2d665bda783aea997c5f96eba86cc0cf4827

See more details on using hashes here.

File details

Details for the file web_scout_ai-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: web_scout_ai-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 48.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 34d55bf9a472f86e536a368c5e50cdae2101ff5ab922e1ebce0efde6d268c0bd
MD5 468524ee31f5d449295225992010d11c
BLAKE2b-256 a118204dd19eda7e5f0da8262319f1d67baae6e2eef2b89bc7a88d8b7b7b752c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page