Skip to main content

Agentic web research tool — smarter than search, faster than deep research. Search, scrape, and synthesize web content using LLMs.

Project description

web-scout-ai

web-scout-ai logo

PyPI Version PyPI Downloads per Month Python Versions License

AI-powered web research in one async call.

pip install web-scout-ai
web-scout-setup
from web_scout import run_web_research

result = await run_web_research("climate risk for agriculture in Kenya")
print(result.synthesis)

What Problem It Solves

Built-in web search tools in frameworks like the OpenAI Agents SDK return snippets — short excerpts from search results that the model has to reason from. They don't read the actual pages.

web-scout-ai goes deeper: it scrapes, converts, and extracts relevant content from real pages — static HTML, JS-rendered sites, PDFs, DOCX, and JSON endpoints. You also control exactly which sources get scraped, how deep the pipeline goes, and what counts as good enough coverage before synthesis.

No Tavily + crawl4ai + custom glue code. No open-ended agent you cannot control in production.


Three Real Use Cases

1. Climate and policy evidence retrieval

Query institutional sources and get a cited synthesis — not just links.

result = await run_web_research(
    "drought impact on smallholder farmers in sub-Saharan Africa",
    include_domains=["fao.org", "ipcc.ch", "worldbank.org"],
    cache=True,  # reuse successful URL source artifacts for this Python process
)

2. Rapid literature scanning

Point it at a report library or database page. It detects list pages, follows item links, and reads the actual documents.

result = await run_web_research(
    "sustainable land management technologies",
    direct_url="https://wocat.net/en/database/list/?type=technology&country=ke",
)

Quick Start

Install

pip install web-scout-ai
web-scout-setup   # installs Chromium for JS-rendered pages

First run

import asyncio
from web_scout import run_web_research

async def main():
    result = await run_web_research(
        query="What are the main threats to coral reefs worldwide?",
        models={"web_researcher": "openai/gpt-4o-mini", "content_extractor": "openai/gpt-4o-mini"},
        search_backend="serper",
        cache=True,
    )
    print(result.synthesis)
    for source in result.scraped:
        print(f"- {source.title or source.url}: {source.url}")

asyncio.run(main())

What You Get Back

class WebResearchResult(BaseModel):
    synthesis: str
    scraped: list[UrlEntry]
    scrape_failed: list[UrlEntry]
    blocked_by_policy: list[UrlEntry]
    source_http_error: list[UrlEntry]
    scraped_irrelevant: list[UrlEntry]
    bot_detected: list[UrlEntry]
    snippet_only: list[UrlEntry]
    queries: list[SearchQuery]
  • synthesis: final grounded answer with inline source citations
  • scraped: URLs successfully read, with extracted relevant content
  • scrape_failed: URLs attempted but could not be scraped
  • blocked_by_policy: URLs skipped because they match the built-in block policy
  • source_http_error: URLs that failed because the source returned HTTP/network errors
  • scraped_irrelevant: URLs that were fetched successfully but did not contain relevant content
  • bot_detected: URLs blocked by bot protection
  • snippet_only: search results kept only as snippets
  • queries: all search queries executed during the run

UrlEntry contains url, title, and content. SearchQuery contains query, num_results_returned, and domains_restricted.


API At A Glance

result = await run_web_research(
    query="latest IPCC findings on sea level rise",
    models={
        "web_researcher": "openai/gpt-4o-mini",
        "content_extractor": "gemini/gemini-2.0-flash",
    },
    search_backend="serper",
    research_depth="standard",           # or "deep"
    include_domains=["ipcc.ch"],         # optional
    direct_url=None,                     # optional
    domain_expertise="climate science",  # optional
    allowed_domains=None,                # optional
    max_pdf_pages=50,                    # optional, default 50
    cache=False,                         # optional, reuse successful source artifacts in this Python process
)

How It Works

See the maintained flow doc: [docs/pipeline-flow.md](docs/pipeline-flow.md)

  1. Generate targeted search queries.
  2. Search the web with Serper.
  3. Triage the best URLs across result sets.
  4. Scrape and extract relevant content in parallel.
  5. After each non-final search iteration, run the coverage evaluator to decide whether the evidence actually answers the question.
  6. If coverage is still weak, either reuse promising backlog URLs or run follow-up searches.
  7. Produce a grounded synthesis with inline citations.
  8. Run a deterministic citation check before returning.

Research Modes

# 1) Open web research
await run_web_research(query="...", models=models, search_backend="serper")

# 2) Domain-restricted research
await run_web_research(query="...", models=models, include_domains=["iucn.org", "wwf.org"])

# 3) Direct URL extraction (skip search)
await run_web_research(query="...", models=models, direct_url="https://example.org/report.pdf")

# 4) Direct URL list-page deepening
await run_web_research(query="...", models=models, direct_url="https://wocat.net/en/database/list/?type=technology&country=ke")

If the URL is a list, index, or database page, the pipeline detects it, collects relevant item links, follows them, and takes one pagination hop when present.

How URL Outcomes Are Classified

What happened Result bucket Meaning
Scrape and extraction succeeded scraped The URL produced usable extracted content
Search result was seen but never scraped snippet_only Only the search snippet is kept
URL matched a blocked domain policy blocked_by_policy Skipped before normal extraction
Source returned HTTP/network errors source_http_error The source failed, not the package logic
Bot protection or anti-automation page detected bot_detected The URL was reachable but blocked
Page loaded but content was not useful for the query scraped_irrelevant Fetch succeeded, relevance failed
Extraction failed for other reasons scrape_failed Generic scrape or extraction failure

Follow-Up Rules

Situation What the pipeline does next
direct_url is a list / index / database page Extract ranked detail links, allow one next-page hop, then scrape selected follow-ups
direct_url is a document Do not fan out into site chrome or navigation pages
Search mode completes a non-final iteration Run coverage evaluation to decide whether current evidence is sufficient
Search mode has weak coverage but promising snippet-only URLs Scrape backlog URLs before running new searches
Search mode has weak coverage and backlog looks weak Generate follow-up search queries
Domain-restricted mode finds a hub page Deepen within the same domain before broadening search

Search Backends

await run_web_research(query=..., models=..., search_backend="serper")
  • serper: Google-quality results with rich metadata (date, rank, People Also Ask, Knowledge Graph). Requires SERPER_API_KEY — Serper is generous with free-tier limits.

Additional backends can be added by the community — see SearchBackend in [search_backends.py](src/web_scout/search_backends.py).


Research Depth

# Standard (default): usually up to ~10 sources
await run_web_research(query=..., models=..., research_depth="standard")

# Deep: usually up to ~28 sources
await run_web_research(query=..., models=..., research_depth="deep")
Parameter Standard Deep
Max iterations 2 3
Search queries (first round) 3 5
Search queries (follow-up) 2 4
URLs scraped (first round) 6 12
URLs scraped (follow-up) 4 8
Hub deepening cap 10 15

Caching

await run_web_research(
    query="climate adaptation finance in Kenya",
    models=models,
    cache=True,
)

When cache=True, web-scout-ai keeps a process-local in-memory cache of successful URL source artifacts:

  • lifetime: the current Python process only
  • scope: reused across multiple run_web_research(...) calls in that same process
  • cleared automatically when Python exits

What is cached:

  • successful query-agnostic page/document source content
  • successful image/scanned-PDF source payloads, which are then reprocessed per query

What is not cached:

  • query-specific extracted summaries
  • final synthesis
  • failed scrapes
  • interactive click-driven exploration results

This means the same URL can be reused across queries without being fetched again, while still producing different extracted summaries when the query changes.


Configuration

Models

Model IDs follow LiteLLM provider naming:

models = {
    # Required
    "web_researcher": "openai/gpt-4o-mini",
    "content_extractor": "gemini/gemini-2.0-flash",

    # Optional step-specific overrides (default: web_researcher)
    "query_generator": "openai/gpt-4o-mini",
    "coverage_evaluator": "openai/gpt-4o-mini",
    "synthesiser": "openai/gpt-4o-mini",

    # Optional fallback for scanned PDFs, image URLs, or empty JS pages
    "vision_fallback": "gemini/gemini-2.0-flash",
}

Domain Control

# Restrict discovery to selected domains
await run_web_research(query=..., models=..., include_domains=["fao.org", "ipcc.ch"])

# Re-allow domains that are blocked by default
await run_web_research(query=..., models=..., allowed_domains=["reddit.com"])

By default, the scraper blocks common social and video platforms. allowed_domains lets you opt specific domains back in.


Where It Fits Best

web-scout-ai is a strong fit when you need:

  • up-to-date answers grounded in real web sources
  • multi-source synthesis without building a full deep-research stack
  • a reusable research tool inside an agent workflow
  • better handling of report libraries, list pages, and mixed web/document sources

It is probably not the right tool if you only need simple search snippets or if you want a fully autonomous long-form research agent that decides everything itself.


Requirements

  • Python >=3.10
  • API key for at least one supported LLM provider
  • SERPER_API_KEY for the Serper search backend (generous free tier)

Brand Assets

  • Full logo: [assets/web-scout-logo.svg](assets/web-scout-logo.svg)
  • Square logo mark (avatar-safe): [assets/web-scout-logo-mark.svg](assets/web-scout-logo-mark.svg)
  • Social card preview: [assets/web-scout-social-card.svg](assets/web-scout-social-card.svg)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scout_ai-1.2.0.tar.gz (61.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web_scout_ai-1.2.0-py3-none-any.whl (64.8 kB view details)

Uploaded Python 3

File details

Details for the file web_scout_ai-1.2.0.tar.gz.

File metadata

  • Download URL: web_scout_ai-1.2.0.tar.gz
  • Upload date:
  • Size: 61.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.2.0.tar.gz
Algorithm Hash digest
SHA256 0ffef5d6ff59682abab6f36149a9171803fed153dfd8fc3e1d9d43fb9ee0f5d2
MD5 a30c19667922fc51e5625791c70f3c84
BLAKE2b-256 dd5ee8cc850fcaf39127bf9057d279050f8dcb5e283adf92b7f1eca7762de361

See more details on using hashes here.

File details

Details for the file web_scout_ai-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: web_scout_ai-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 64.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.2 Darwin/25.3.0

File hashes

Hashes for web_scout_ai-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc7f321535b49a2290a280c4a2e0c9ebbfaa5d692f2e33cda1084c2e5380b923
MD5 2140657f730d10a55b40dfc75558225a
BLAKE2b-256 f77dc05953cd13c2c142296a933c1a95ec5e2fc0e8903b2cd5796a1be1c209e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page