Curated AI search for agents — bring your own sources, your own LLM.
Project description
vouch
Curated AI search for agents. Bring your own sources, your own LLM. The self-hosted, agent-ready alternative to Tavily and Perplexity for trusted-source research.
vouch is a Python library that lets you register a list of trusted sites with metadata (category, description, tags), and then runs intelligent searches across them using any LLM — including local models via Ollama. The AI dynamically figures out how to search each site (no manual selectors), caches what it learns, and returns structured results ready to feed an agent.
from vouch import SearchEngine, Site
engine = SearchEngine(llm="ollama/qwen2.5:14b")
engine.add(Site("cvm.gov.br", category="regulação financeira BR"))
engine.add(Site("arxiv.org", category="papers acadêmicos", tags=["ml", "ai"]))
results = engine.search("LLM agents in 2026", depth=2)
for r in results.chunks:
print(f"[{r.source_url}] {r.title}\n{r.snippet}\n")
Table of contents
- Why vouch
- How it compares
- Installation
- Quick start
- Three ways to use
- Core concepts
- The Site model
- Search depth
- Smart Site Routing
- Selector cache
- Profile registry
- LLM configuration (BYOK)
- Using as an agent tool
- Optional features
- Configuration reference
- Architecture
- Cost and performance
- Limits and what's out of scope
- Roadmap
- Contributing
- References and prior art
What's new in 0.2
- CSS selector pinning — the AI now examines DOM context and emits reusable
{container, title, url, snippet, date, author}selectors. Cached in SQLite; subsequent calls replay vialxml.cssselect— no LLM, milliseconds. - Auto-escalation chain —
http → browser → stealth. The engine learns the cheapest tier each site needs and persists it. - Browser pool — one shared Chromium across all searches in a session. Eliminates the per-call launch overhead and the Windows-on-fresh-fork instability.
- Plugin model —
pip install vouch-adapter-arxiv(or anything matchingvouch-adapter-*,vouch-router-*,vouch-profiles-*) and the nextSearchEnginepicks it up automatically via Python entry_points. Seedocs/plugins.md. - Profile registry + auto-update — 24 curated profiles ship in the box (arxiv, github, huggingface, gov.uk, irs.gov, sede.agenciatributaria, jusbrasil, MDN, pypi, …).
vouch profiles updatepulls the latest community registry. - Auto-DNS resolution — fixes wrong-host catalog entries automatically (
agenciatributaria.es→sede.agenciatributaria.gob.es). - Async API —
await engine.asearch(query, ...)alongside the sync API. - Result formatting —
result.to_markdown(),to_json(). - Smarter quality detector — flags megamenu shells, recursive search-page links, no-results sentinels in PT/ES/EN.
- Probe crawl on
add()— opt-in: site profile is learned the moment you add it, not on the first user query. - Package renamed from
curiotovouch(thecurioname on PyPI is taken by David Beazley's async library since 2016).
See CHANGELOG.md for the full list.
Why vouch
The agentic AI ecosystem has two kinds of search tools:
- Open-web search (Tavily, Exa, Perplexica, Firecrawl
/search): great for broad questions, but you can't tell them "only use these sources I trust." - Browser agents (Stagehand, Skyvern, browser-use): great at navigating one site at a time, but they don't know which site to navigate to and don't keep a registry of your trusted sources.
vouch fills the gap in between: a persistent catalog of sites you trust, with an LLM that reads your descriptions to route each query to the right sources, a browser layer that learns how to search each site once and caches the selectors, and a clean tool interface for CrewAI / LangChain / PydanticAI / MCP.
It's what you'd build if you wanted Tavily, but for your own list of sources, using your own LLM, without paying anyone.
How it compares
| vouch | Tavily / Exa | Perplexica | browser-use | Stagehand | Kagi Lenses | |
|---|---|---|---|---|---|---|
| Curated source list with metadata | ✅ | ⚠️ per call | ❌ | ❌ | ❌ | ✅ |
| LLM routing by description | ✅ | ❌ | ❌ | ❌ | ❌ | ⚠️ |
| Dynamic search-bar discovery | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ |
| Cached selectors after first visit | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| Adjustable depth (0–3) | ✅ | ⚠️ | ⚠️ | ❌ | ❌ | ❌ |
| BYOK + local LLM (Ollama) | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ |
Self-hosted Python pip install |
✅ | ❌ | ❌ TS | ✅ | ⚠️ TS-first | ❌ |
| First-class agent tool | ✅ | ✅ | ❌ | ⚠️ | ⚠️ | ❌ |
Installation
Minimal install (HTTP-only, no browser):
pip install vouch
Most users want the browser layer too:
pip install "vouch[browser]"
vouch install-browser # downloads Chromium for Playwright
Optional extras:
pip install "vouch[browser,stealth]" # patchright for harder sites
pip install "vouch[browser,ocr]" # Tesseract CAPTCHA tier (CPU, no GPU)
pip install "vouch[browser,vision]" # image handling for vision-LLM CAPTCHA
pip install "vouch[browser,monitor]" # APScheduler for change tracking
pip install "vouch[browser,server]" # FastAPI dashboard
pip install "vouch[mcp]" # MCP server (Claude Desktop, Cursor)
pip install "vouch[all]" # everything above
pip install "vouch[crewai]" # framework integrations (separate)
pip install "vouch[langchain]"
pip install "vouch[pydantic-ai]"
Requirements: Python 3.10+. Linux, macOS, or Windows. ~500MB disk after install-browser.
Quick start
30 seconds with Ollama
# 1. Make sure Ollama is running with a capable model
ollama pull qwen2.5:14b
# 2. Install vouch
pip install "vouch[browser]"
vouch install-browser
# 3. One-liner search
python -c "from vouch import search; print(search('arxiv attention transformers', sites=['arxiv.org']))"
30 seconds with a paid API
export ANTHROPIC_API_KEY=sk-ant-...
pip install "vouch[browser]"
vouch install-browser
python -c "from vouch import search; print(search('FDA drug approvals 2025', sites=['fda.gov'], llm='anthropic/claude-haiku-4-5'))"
Three ways to use
vouch has three progressive levels of API. Pick the one that fits your case — they all coexist and use the same engine underneath.
Level 1 — One-shot function
For quick scripts or trying it out:
from vouch import search
results = search(
query="regulamentação fundos 2026",
sites=["cvm.gov.br", "valor.globo.com"],
llm="ollama/qwen2.5:14b",
depth=1,
)
No catalog, no metadata, no persistence. Ephemeral. Good for python -c "...".
Level 2 — Programmatic catalog
The typical workflow. Build a catalog in code:
from vouch import SearchEngine, Site
engine = SearchEngine(
llm="ollama/qwen2.5:14b",
cache_dir="~/.vouch", # selectors and metadata persist here
)
engine.add(Site(
url="cvm.gov.br",
category="regulação financeira BR",
description="Comissão de Valores Mobiliários — regulamentos, deliberações, consultas públicas",
tags=["gov", "financeiro", "regulação"],
))
engine.add(Site(
url="arxiv.org",
category="papers acadêmicos",
description="Preprints em CS, math, physics, biology",
tags=["research", "ml", "ai"],
))
# All metadata fields are optional. The LLM router uses whatever is provided.
engine.add(Site("news.ycombinator.com"))
# Search with automatic routing
results = engine.search("transformer architectures", depth=2)
The catalog persists in SQLite (~/.vouch/catalog.db). You can engine.list_sites(), engine.remove("arxiv.org"), engine.update("arxiv.org", tags=[...]). (engine.list() still works as a deprecated alias through v0.x.)
Level 3 — YAML-declared catalog
For projects you want to version-control and share:
# sites.yaml
defaults:
llm: ollama/qwen2.5:14b
depth: 1
humanize: true
sites:
- url: cvm.gov.br
category: regulação financeira BR
description: Comissão de Valores Mobiliários
tags: [gov, financeiro]
- url: arxiv.org
category: papers acadêmicos
description: Preprints em CS, math, physics
tags: [research, ml, ai]
- url: news.ycombinator.com
# no metadata — that's fine, LLM will infer from URL
- url: site-protegido.com
behavior: stealth # use patchright for this site
rate_limit: "1/30s" # be extra polite
from vouch import SearchEngine
engine = SearchEngine.from_yaml("sites.yaml")
results = engine.search("LLM regulation")
Same engine, same SQLite cache, just a different way to declare the catalog. Good for CI, sharing with teammates, or templating per-project.
Core concepts
vouch is built around four primitives. Understanding them helps you customize what you need:
┌──────────────────────────────────────────────────────────────┐
│ SearchEngine │
│ the orchestrator — you mostly talk to this │
└──────────────────────────────────────────────────────────────┘
│
├─── Catalog ──────── persistent registry of Sites
│
├─── Router ────────── picks relevant Sites for a query
│ (uses Site metadata + LLM)
│
├─── SiteAdapter ───── per-site search executor
│ ├─ discovers selectors (LLM, first time)
│ ├─ caches selectors (SQLite, every time after)
│ └─ executes search + extraction
│
└─── Aggregator ────── ranks, dedups, returns SearchResult
Each is a Protocol — you can replace any of them. Want a custom router that uses a vector DB instead of LLM? Implement Router. Want to plug a paid scraping service? Implement SiteAdapter.
The Site model
from vouch import Site
Site(
url: str, # required: "cvm.gov.br" or "https://cvm.gov.br"
category: str | None = None, # optional: "regulação financeira"
description: str | None = None, # optional: free-text, fed to router LLM
tags: list[str] | None = None, # optional: ["gov", "financeiro"]
# Behavioral overrides (all optional)
behavior: str = "natural", # "natural" | "stealth" | "external"
rate_limit: str | None = None, # "10/min", "1/30s", etc
requires_login: bool = False, # tells engine to handle auth flow
search_url_template: str | None = None, # bypass discovery: "/search?q={query}"
# Custom hooks (advanced)
pre_search: Callable | None = None,
post_extract: Callable | None = None,
)
How the optional metadata is used
- Category and description are concatenated and fed to the router LLM when there are too many sites to search them all. The LLM reads your descriptions and picks the most relevant subset for the query. With no description, the router falls back to the URL and any cached info.
- Tags can be used as filters (
engine.search(query, only_tags=["ml"])) and as routing hints. - Behavior controls the browser strategy (see Optional features).
search_url_templateis the escape hatch: if you already know how the site's search works (https://github.com/search?q={query}), provide it and vouch skips discovery entirely.
Search depth
The depth parameter is the central knob for cost vs. quality:
| Depth | Behavior | Latency | Cost (typical) |
|---|---|---|---|
| 0 | Fetch homepage + extract recent content. No search interaction. | ~2s/site | ~$0 |
| 1 | Discover & execute search, return result snippets. | ~5–10s/site | $0.005–0.02 |
| 2 | Enter top-N results, extract full content. | ~20–40s/site | $0.05–0.15 |
| 3 | Follow internal links from results, deep crawl. | ~1–3min/site | $0.20–0.60 |
# Quick check — just see what's on the homepages
engine.search("breaking news", depth=0)
# Standard — get search results
engine.search("LLM benchmarks", depth=1)
# Research mode — full content extraction
engine.search("GPT-5 architecture details", depth=2)
# Deep dive — agent navigates internal links
engine.search("complete history of CVM rule 175", depth=3)
Costs above assume Claude Haiku or Gemini Flash. With Ollama: $0 for everything except hardware/electricity.
Smart Site Routing
When your catalog grows beyond ~5 sites, searching all of them for every query wastes time and tokens. vouch uses an LLM to route: given a query and your catalog metadata, pick the top-K most relevant sites.
engine = SearchEngine(
llm="ollama/qwen2.5:14b",
router_top_k=3, # search top 3 sites per query
router_strategy="llm", # "llm" | "embedding" | "all" | "tags"
router_explain=True, # log why each site was chosen
)
Strategies
llm(default): single LLM call reads catalog + query, returns ranked sites. ~200 tokens, ~$0.0001 with Haiku, free with Ollama. Most accurate.embedding: pre-embeds catalog descriptions; routes via cosine similarity. ~10ms, no LLM call. Cheaper at scale, slightly less accurate. Usessentence-transformersif installed.all: searches every site every time. Honest fallback for small catalogs (≤5).tags: filter by tag match. Useful when query domain is explicit.
# Examples of router behavior
engine.search("Resolução CVM 175")
# → routes to: cvm.gov.br, valor.globo.com (skips arxiv, hacker news)
engine.search("attention is all you need paper")
# → routes to: arxiv.org (skips financial sites)
engine.search("python async best practices")
# → routes to: news.ycombinator.com (skips arxiv if no description match)
Inspect what the router did:
results = engine.search("LLM eval methods")
print(results.routing_decisions)
# [
# RouteDecision(site="arxiv.org", score=0.92, reason="academic ML papers"),
# RouteDecision(site="news.ycombinator.com", score=0.61, reason="developer discussions"),
# ]
Selector cache
The first time vouch visits a site, it spends an LLM call to figure out the search interface (where's the input? where's the submit? how do results render?). It caches what it learns in SQLite. Subsequent searches on the same site use Playwright directly, with no LLM in the loop — milliseconds, $0.
The cache is keyed by (domain, dom_fingerprint). If the site redesigns and the cached selectors fail, vouch invalidates the entry and re-discovers automatically.
# First search on cvm.gov.br: ~8s, $0.01 (LLM discovery)
# Second search on cvm.gov.br: ~2s, $0 (cached)
# After 1000 searches: still ~2s, still $0
engine.cache_stats()
# {
# "cvm.gov.br": {"discovered": "2026-05-09", "hits": 47, "fails": 0},
# "arxiv.org": {"discovered": "2026-05-10", "hits": 12, "fails": 1},
# }
# Force re-discovery for a site
engine.invalidate_cache("cvm.gov.br")
The cache lives at ~/.vouch/selectors.db — a tiny SQLite file you can ship in version control if you want a head-start for teammates.
Profile registry
vouch ships with curated profiles for popular sites — preconfigured Site objects with the right search_url_template, tags, and category. You don't have to figure out arxiv's search URL or how to phrase HuggingFace's description.
from vouch import SearchEngine, get_profile, list_profiles
engine = SearchEngine(llm="ollama/qwen2.5:14b")
# List what's bundled
list_profiles()
# → ['arxiv.org', 'bbc.com', 'crates.io', 'developer.mozilla.org',
# 'docs.python.org', 'elpais.com', 'en.wikipedia.org', 'es.wikipedia.org',
# 'folha.uol.com.br', 'github.com', 'gov.uk', 'huggingface.co', 'irs.gov',
# 'jusbrasil.com.br', 'news.ycombinator.com', 'npmjs.com',
# 'paperswithcode.com', 'pkg.go.dev', 'planalto.gov.br', 'pt.wikipedia.org',
# 'pypi.org', 'reddit.com', 'sede.agenciatributaria.gob.es',
# 'stackoverflow.com']
# Use a profile directly
engine.add(get_profile("arxiv.org"))
engine.add(get_profile("github.com"))
engine.add(get_profile("jusbrasil.com.br")) # for BR legal/tax queries
results = engine.search("LLM benchmarks 2026")
Or via CLI:
vouch profiles list # show bundled profiles
vouch profiles show arxiv.org # see one in detail
vouch profiles import arxiv.org,github.com,huggingface.co
vouch profiles import all # add everything to your local catalog
The profile YAML lives at vouch/profiles/builtin.yaml. Community contributions via PR welcome — see the new_profile issue template.
A separate community-maintained registry (vouch-profiles repo) is on the roadmap; teams can publish their domain-specific profiles there and run vouch profiles update to pull them. The update command needs internet access — it falls back silently to the bundled profiles on any network failure, so air-gapped installs keep working.
LLM configuration (BYOK)
vouch uses LiteLLM underneath, which means any of 100+ providers work with the same syntax.
Local (Ollama, LM Studio)
Recommended models:
ollama/qwen2.5:14b— best quality/size balance, runs in 16GB RAMollama/qwen2.5:7b— faster, lower hardwareollama/qwen2.5-vl:7b— multimodal, needed for vision featuresollama/llama3.2:8b— alternative
engine = SearchEngine(llm="ollama/qwen2.5:14b")
# Or split: cheap model for routing, capable model for extraction
engine = SearchEngine(
llm="ollama/qwen2.5:14b", # discovery + extraction
router_llm="ollama/qwen2.5:7b", # routing decisions
)
API providers
# Anthropic
engine = SearchEngine(llm="anthropic/claude-haiku-4-5")
# requires ANTHROPIC_API_KEY env var
# OpenAI
engine = SearchEngine(llm="openai/gpt-4.1-mini")
# Google
engine = SearchEngine(llm="gemini/gemini-2.5-flash")
# Mix providers
engine = SearchEngine(
llm="anthropic/claude-sonnet-4-6",
router_llm="gemini/gemini-2.5-flash-lite", # cheaper routing
vision_llm="anthropic/claude-haiku-4-5", # for CAPTCHA / visual extraction
)
Pass keys explicitly
engine = SearchEngine(
llm="anthropic/claude-haiku-4-5",
api_keys={
"anthropic": "sk-ant-...",
"openai": "sk-...",
},
)
Fallback chain
engine = SearchEngine(
llm=["anthropic/claude-haiku-4-5", "openai/gpt-4.1-mini", "ollama/qwen2.5:14b"],
)
# Tries each in order if previous fails
Using as an agent tool
vouch's primary audience is agentic frameworks. First-class adapters:
CrewAI
from crewai import Agent, Task, Crew
from vouch.integrations.crewai import VouchSearchTool
search_tool = VouchSearchTool(catalog="sites.yaml", default_depth=2)
researcher = Agent(
role="Financial Regulation Researcher",
goal="Find recent CVM rules on fund management",
tools=[search_tool],
llm="anthropic/claude-sonnet-4-6",
)
LangChain / LangGraph
from langchain.agents import AgentExecutor
from vouch.integrations.langchain import VouchSearchTool
tools = [VouchSearchTool.from_yaml("sites.yaml")]
agent = AgentExecutor(tools=tools, ...)
PydanticAI
from pydantic_ai import Agent
from vouch.integrations.pydantic_ai import vouch_tool
agent = Agent("anthropic:claude-sonnet-4-6", tools=[vouch_tool("sites.yaml")])
MCP server (Claude Desktop, Cursor, Cline)
vouch mcp serve --catalog sites.yaml
Configure in claude_desktop_config.json:
{
"mcpServers": {
"vouch": {
"command": "vouch",
"args": ["mcp", "serve", "--catalog", "/path/to/sites.yaml"]
}
}
}
Now Claude Desktop can call search_curated_sources(query, depth) directly.
Optional features
Human-like behavior
Reduces bot-detection signals on sites with simple behavioral analytics. Does not bypass Cloudflare/DataDome/PerimeterX — see limits.
engine = SearchEngine(
llm="ollama/qwen2.5:14b",
humanize=True, # default: lognormal pauses, variable typing speed
typing_speed="natural", # "fast" | "natural" | "slow"
business_hours_only=False, # restrict to user-local business hours
)
What it does:
- Lognormal pause distribution between actions (60–250ms typing, 0.5–3s reading)
- Random typo injection with backspace correction (~2% rate)
- Optional business-hours-only execution (
business_hours_only=True) - Single, recent Chrome user-agent applied per session (no rotation)
Stealth mode (per site)
Drop-in Patchright replacement for Playwright. Helps with mild Cloudflare Bot Fight Mode, fingerprint checks. Does not defeat Turnstile invisible or DataDome.
engine.add(Site("portal-protegido.com", behavior="stealth"))
# OR globally:
engine = SearchEngine(default_behavior="stealth")
CAPTCHA assist (lightweight by default)
Optional, opt-in, multi-backend. Two tiers used in order from cheapest to heaviest:
1. Tesseract OCR — for distorted-text CAPTCHAs (Receita, CVM, Jusbrasil, agencia tributaria, gov.uk). CPU-only, ~50-200 ms per image, $0, no GPU, no model download.
pip install "vouch[ocr]"
# plus the system binary:
# Ubuntu/Debian: sudo apt install tesseract-ocr tesseract-ocr-por tesseract-ocr-spa
# macOS: brew install tesseract tesseract-lang
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
That's already enough for the common text-CAPTCHA case — no vision_llm needed.
2. Vision LLM — upgrade for image-grid CAPTCHAs ("select all traffic lights").
engine = SearchEngine(
llm="ollama/qwen2.5:14b",
vision_llm="ollama/qwen2.5vl:7b", # ~5 GB, enables image-grid solving
captcha_min_confidence=0.7,
captcha_max_attempts=2,
)
The CaptchaSolver chains them automatically: Tesseract first → vision LLM on low confidence → return best. result.solver tells you which one produced the answer.
Supported types: text OCR, reCAPTCHA v2 image grid, hCaptcha image grid (~50–80% success rate).
Not supported: reCAPTCHA v3 (invisible), Cloudflare Turnstile, Arkose 3D puzzles. These return status="blocked".
# When solver fails or type is unsupported:
result.status # "blocked"
result.reason # "captcha_unsupported"
result.suggestion # "Use the official API: https://..."
Change monitoring
Watch sites for changes, get notified.
from vouch import Monitor
monitor = Monitor(engine)
monitor.watch(
site="cvm.gov.br",
query="resoluções recentes",
interval="1h",
notify_via="email", # or "slack", "discord", "webhook"
)
monitor.start()
Uses 3-tier change detection: hash → diff → embedding similarity. Avoids LLM calls when content is unchanged, keeping monitoring cheap (~$1–3/month for 100 sites hourly with Gemini Flash-Lite).
FastAPI dashboard
vouch serve --port 8080
# → web UI at http://localhost:8080
# → manage catalog, view cache, run searches, see costs
Configuration reference
SearchEngine(...)
SearchEngine(
# LLM
llm: str | list[str] = "ollama/qwen2.5:14b",
router_llm: str | None = None, # defaults to llm
vision_llm: str | None = None, # disables CAPTCHA features if None
api_keys: dict | None = None,
# Routing
router_strategy: str = "llm", # "llm" | "embedding" | "all" | "tags"
router_top_k: int = 3,
router_explain: bool = False,
# Search
default_depth: int = 1,
parallel_sites: int = 3, # how many sites to search concurrently
# Browser
default_behavior: str = "natural", # "natural" | "stealth" | "external"
humanize: bool = True,
typing_speed: str = "natural",
headless: bool = True,
# Caching
cache_dir: str = "~/.vouch",
cache_ttl_days: int = 30,
# Politeness
respect_robots_txt: bool = True,
default_rate_limit: str = "2/min",
user_agent: str | None = None, # defaults to recent Chrome
# Captcha (only if vision_llm is set)
captcha_min_confidence: float = 0.7,
captcha_max_attempts: int = 2,
# Misc
cache_dir: str = "~/.vouch",
verbose: bool = False,
)
engine.search(...)
engine.search(
query: str,
depth: int = None, # overrides default_depth
sites: list[str] | None = None, # restrict to specific sites
only_tags: list[str] | None = None,
exclude_tags: list[str] | None = None,
max_results: int = 10,
timeout: float = 60.0,
) -> SearchResult
SearchResult
SearchResult(
query: str,
chunks: list[Chunk],
sites_searched: list[str],
routing_decisions: list[RouteDecision],
duration_ms: float,
cost_estimate_usd: float,
tokens_used: dict[str, int], # {"input": ..., "output": ...}
cache_stats: dict, # {"hits": ..., "misses": ...}
)
Chunk(
source_url: str,
site: str, # "cvm.gov.br"
site_category: str | None,
title: str,
snippet: str, # ~200 chars
content: str | None, # full markdown if depth >= 2
relevance_score: float, # 0–1
extracted_at: datetime,
metadata: dict, # extra fields per adapter
)
Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ SearchEngine │
└──────────────────────────────────────────────────────────────────────┘
│
┌──────────────────┴──────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Catalog │ │ Router │
│ (SQLite) │ ─── metadata ────▶ │ (LLM/emb) │
└──────────────┘ └──────────────┘
│
│ ranked sites
▼
┌────────────────────┐
│ parallel fetch │
└────────────────────┘
│
┌───────────────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SiteAdapter │ │ SiteAdapter │ │ SiteAdapter │
│ cvm.gov.br │ │ arxiv.org │ │ news.yc.com │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ selector │ │ │ │ selector │ │ │ │ selector │ │
│ │ cache │ │ │ │ cache │ │ │ │ cache │ │
│ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │
│ │ MISS │ │ │ HIT │ │ │ HIT │
│ ▼ │ │ ▼ │ │ ▼ │
│ discover │ │ playwright │ │ playwright │
│ via LLM │ │ replay │ │ replay │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└──────────────────────────────┴──────────────────────────────┘
│
▼
┌────────────────────┐
│ Aggregator/Rank │
└────────────────────┘
│
▼
SearchResult
Module layout
vouch/
├── __init__.py # public API: search, SearchEngine, Site
├── engine.py # SearchEngine orchestrator
├── catalog.py # Catalog (SQLite-backed Site registry)
├── models.py # Site, Chunk, SearchResult, RouteDecision
├── config.py # SearchEngine config dataclass
├── exceptions.py # VouchError + back-compat CurioError alias
├── dns_resolver.py # auto-fixes wrong-host catalog entries
├── plugins.py # entry_points-based plugin discovery
├── cli.py # typer-based CLI (vouch ...)
├── server.py # optional: FastAPI dashboard
├── _lang.py # language detection helpers
├── _llm.py # LiteLLM wrapper (sync + async)
├── router/
│ ├── base.py # Router Protocol
│ ├── llm_router.py # LLM-based routing (default)
│ ├── embedding_router.py# cosine-similarity routing
│ ├── tag_router.py # tag-match routing
│ └── all_router.py # "search every site" fallback
├── adapters/
│ ├── base.py # SiteAdapter Protocol
│ ├── http.py # depth=0, fast HTTP+Trafilatura
│ ├── browser.py # depth=1+, Playwright
│ ├── browser_pool.py # shared Chromium pool across searches
│ └── stealth.py # patchright wrapper
├── discovery/
│ ├── search_bar.py # LLM-powered selector discovery
│ ├── cache.py # SQLite selector cache
│ ├── humanize.py # lognormal pauses, typing, business hours
│ └── probe.py # opt-in profile probe on add()
├── extraction/
│ ├── trafilatura.py # HTML → markdown
│ ├── pdf.py # PDF handling (pypdf)
│ ├── llm.py # structured extraction via Pydantic
│ ├── llm_extract.py # quality detector + LLM-driven extraction
│ └── css_selectors.py # cached-selector replay (lxml.cssselect)
├── captcha/ # optional: Tesseract OCR + vision_llm CAPTCHA
│ ├── solver.py
│ └── tesseract.py
├── monitor/ # optional: change tracking + notifications
│ ├── watcher.py
│ └── notify.py
├── profiles/ # 24 curated site profiles + registry
│ ├── registry.py
│ ├── update.py
│ └── builtin.yaml
├── integrations/
│ ├── _common.py
│ ├── crewai.py # VouchSearchTool (CrewAI)
│ ├── langchain.py # VouchSearchTool (LangChain)
│ ├── pydantic_ai.py # vouch_tool (PydanticAI)
│ └── mcp.py # MCP server (Claude Desktop, Cursor)
Key dependencies
- playwright — browser automation
- litellm — multi-LLM, BYOK, fallback chains
- pydantic v2 — models everywhere
- sqlalchemy + sqlite — catalog and cache
- trafilatura — fast HTML→markdown extraction
- typer — CLI
- apscheduler — optional, for monitor
- fastapi + uvicorn — optional, for dashboard
- patchright — optional, for stealth
- sentence-transformers — optional, for embedding router
No CrewAI/LangChain/etc as hard deps. They're integration extras only.
Cost and performance
Typical numbers from internal runs. Depend heavily on LLM, network, site, and cache state — treat as rough order-of-magnitude, not a guarantee. Reproducible benchmark scripts under examples/ if you want to validate on your stack.
Per-search costs
Typical query, depth=1, 3 sites routed, all cache hits:
| LLM | Routing | Discovery | Extraction | Total |
|---|---|---|---|---|
| Ollama qwen2.5:14b | $0 | $0 (cached) | $0 | $0 |
| Gemini Flash-Lite | $0.0001 | $0 (cached) | $0.001 | ~$0.001 |
| Claude Haiku 4.5 | $0.0005 | $0 (cached) | $0.005 | ~$0.005 |
| GPT-4.1-mini | $0.0003 | $0 (cached) | $0.003 | ~$0.003 |
First search on a new site adds ~$0.01–0.05 for discovery (one-time per site).
Latency
- HTTP-only fetch (depth=0): ~1–3s/site
- Cached browser search (depth=1): ~2–5s/site
- First-time discovery (depth=1): ~8–15s/site
- Full content extraction (depth=2): ~15–40s/site
Searches across multiple sites run in parallel (controlled by parallel_sites).
Hardware requirements
- Pure HTTP mode: any machine with Python
- Browser mode: 4GB RAM minimum, 8GB recommended
- Local LLM mode: depends on model. Qwen2.5:14b needs ~16GB RAM (CPU) or ~10GB VRAM
- Vision mode: Qwen2.5-VL 7B needs ~8GB VRAM
Limits and what's out of scope
vouch is opinionated about what it tries to do and what it doesn't.
What's in scope:
- Public sites that respond well to standard web requests
- Sites where you want to discover search behavior dynamically
- Sites you've explicitly added to your catalog
- Polite, rate-limited, robots.txt-respecting access
- Educational, research, regulatory, professional use cases
What's explicitly out of scope:
- Bypassing Cloudflare's modern bot protection (Turnstile invisible, Bot Fight Mode aggressive). vouch includes humanize and stealth options that help with mild detection, but no open-source library reliably defeats current Cloudflare. For those cases, plug a commercial bypass service (Browserbase, Bright Data, ZenRows) via the
behavior="external"site option. - Mass scraping — vouch is built for "search across my list of trusted sources", not "scrape the entire web."
- CAPTCHA solving as a sales pitch. The optional vision-LLM CAPTCHA assist is best-effort and only works on simple visual challenges. It won't (and shouldn't) work on adversarial challenges.
- Logged-in personal accounts (LinkedIn, Instagram, etc.) — even when technically possible, this is usually a ToS violation. vouch supports authentication for sites where it's legitimate (gov portals, enterprise dashboards you own), not for this.
This isn't a limitation we apologize for — it's the position. A library that promises to bypass everything breaks every week and isn't sustainable as open-source.
Roadmap
Shipped (see CHANGELOG.md for the full list):
- v0.1 — Levels 1-3 API, HTTP + browser adapters, LLM router, selector cache, BYOK via LiteLLM, CrewAI + LangChain integrations, CLI.
- v0.2 — embedding/tag/all routers, vision-LLM + Tesseract CAPTCHA assist, MCP server, PydanticAI integration, browser pool, plugin model via entry_points, profile registry + auto-update, async API, FastAPI dashboard, stealth mode (patchright), change monitor, auto-escalation ladder, auto-DNS resolution, quality detector, CSS selector pinning, Accept-Language header.
Coming next (no committed dates — direction shaped by feedback and real use cases):
- Auth flow helpers for sites that legitimately need a login (cookie persistence, refresh-token rotation)
- Distributed search workers (multi-machine fanout for large catalogs)
- Built-in vector store for chunk reuse across queries
- Docling-backed PDF extraction
- Workflow chains (search → reason → search → extract)
- Standalone community profile registry repo (
vouch-profiles) - Per-site adapter packages on PyPI (
vouch-adapter-*) as the plugin ecosystem matures
Contributing
See CONTRIBUTING.md. Quick start for dev:
git clone https://github.com/evertonsjjj/vouch
cd vouch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,all]"
playwright install chromium
pytest
Issues, PRs, ideas welcome. Especially:
- New
SiteAdapterimplementations (specific site optimizations) - New
Routerstrategies - Integrations with other agent frameworks
- Translations of docs
References and prior art
vouch stands on the shoulders of these projects. Read their docs — most of them solve adjacent problems brilliantly.
Direct inspiration
- Tavily & Exa — defined the "search API for agents" category. vouch is "Tavily, but for your own sources."
- Kagi Lenses & Brave Goggles — proved curated-source search is a real product. vouch brings it to OSS.
- Skyvern — pioneered "discover-then-cache" for browser automation. vouch applies the same pattern to search specifically.
- Stagehand —
act()/observe()API inspired the discovery layer.
Building blocks used
- Playwright — browser automation
- browser-use — DOM-to-LLM serialization patterns
- Crawl4AI — extraction strategies, JsonCssExtractionStrategy
- LiteLLM — multi-provider LLM client
- Trafilatura — HTML to clean text
- Patchright — stealth Playwright
- Camoufox — for serious anti-fingerprinting
- semantic-router — embedding-based routing patterns
- APScheduler — for the optional monitor module
- Apprise — 100+ notification channels for monitor
Adjacent projects worth knowing
- Perplexica — open-source Perplexity clone (open-web)
- gpt-researcher — research agent (open-web)
- local-deep-research — local research bot (engine-based, not site-based)
- changedetection.io — site change tracking (no LLM extraction)
- bdambrosio/llmsearch — small precedent for source-curated LLM search
- Firecrawl —
/searchendpoint backed by Google/Bing - ScrapeGraphAI — LLM-driven scraping
- SearXNG — federated meta-search engine
Reading
- Stagehand caching internals: https://www.browserbase.com/blog/stagehand-caching
- Skyvern adaptive caching: https://github.com/Skyvern-AI/skyvern (recent PRs)
- Anti-detection landscape 2026: https://github.com/pim97/anti-detect-browser-tools-tech-comparison
- LLM Gateway comparison: https://github.com/BerriAI/litellm
License
MIT — see LICENSE.
Acknowledgments
Built by @evertonsjjj. Early-stage open source — looking for real-world feedback, contributors, and battle-tested use cases. The gap this fills (curated-source search for agents) is one I needed myself, but I'd rather hear what's wrong with it from people using it than guess in isolation.
If vouch helps you, a ⭐ on GitHub goes a long way.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vouch-0.2.3.tar.gz.
File metadata
- Download URL: vouch-0.2.3.tar.gz
- Upload date:
- Size: 134.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6d6d9de00c336ffc4c949394c5837c0b10ca5f20ae04a5ff9e7ea1cbcedd85f
|
|
| MD5 |
1b8841235a123a0606f2876807a91e82
|
|
| BLAKE2b-256 |
d1a3a277409fd7c071915e2c63342b0b1d8711f5d73a78eb68a9802e1b7389f8
|
Provenance
The following attestation bundles were made for vouch-0.2.3.tar.gz:
Publisher:
release.yml on evertonsjjj/vouch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vouch-0.2.3.tar.gz -
Subject digest:
d6d6d9de00c336ffc4c949394c5837c0b10ca5f20ae04a5ff9e7ea1cbcedd85f - Sigstore transparency entry: 1509219641
- Sigstore integration time:
-
Permalink:
evertonsjjj/vouch@e8a5fc2f0d966caed764e3e669ccfb264afbbf80 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/evertonsjjj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e8a5fc2f0d966caed764e3e669ccfb264afbbf80 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vouch-0.2.3-py3-none-any.whl.
File metadata
- Download URL: vouch-0.2.3-py3-none-any.whl
- Upload date:
- Size: 105.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f279a8c1a1cfed3f240c0af86cc4659abe49d5d87841e6800a75f3ce76476438
|
|
| MD5 |
8de6336a697fb018c15d5870bb85a7c2
|
|
| BLAKE2b-256 |
05838cb5592426cc2d11c5a48998046029f1a3f89a436e2b5b49205c1fcdd045
|
Provenance
The following attestation bundles were made for vouch-0.2.3-py3-none-any.whl:
Publisher:
release.yml on evertonsjjj/vouch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vouch-0.2.3-py3-none-any.whl -
Subject digest:
f279a8c1a1cfed3f240c0af86cc4659abe49d5d87841e6800a75f3ce76476438 - Sigstore transparency entry: 1509219730
- Sigstore integration time:
-
Permalink:
evertonsjjj/vouch@e8a5fc2f0d966caed764e3e669ccfb264afbbf80 -
Branch / Tag:
refs/tags/v0.2.3 - Owner: https://github.com/evertonsjjj
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e8a5fc2f0d966caed764e3e669ccfb264afbbf80 -
Trigger Event:
push
-
Statement type: