Multi-engine web search and content extraction, exposed as an MCP server
Project description
websearchmcp
Multi-engine web search and content extraction, exposed as an MCP server
Part of the MCP AI Suite.
Features
- Multi-engine search with priority fallback: SearXNG (self-hosted) -> DuckDuckGo -> Mojeek -> Brave
- Parallel search + Reciprocal Rank Fusion (optional) -- query all engines at once and fuse for better coverage
- Cross-encoder reranking (optional) -- reorder results by relevance to the query, no API key
- Deep rerank (optional) -- re-score the top candidates on their actual page content, not just snippets
- Search + answer (
search_with_answer) -- ranked sources + a synthesized answer via a bring-your-own-LLM callback (the agent-facing surface of commercial search APIs, at zero search cost) - Passage trimming -- return only the query-relevant passages of a page, not the whole thing (≈35% fewer tokens for the downstream LLM, with no answer loss in our benchmark)
- Content extraction via trafilatura (optional, state-of-the-art boilerplate removal) with regex/BeautifulSoup fallback
- Playwright browser_fetch for JavaScript-rendered pages with full DOM access and screenshots
- Search + fetch caching -- in-memory TTL caches avoid re-querying engines and re-downloading pages
- Per-engine circuit breaker -- stops retrying failed engines for a cooldown period
- Per-engine rate limiter -- max N requests per minute per engine
- In-memory TTL cache for search results with configurable expiry
- CAPTCHA detection -- auto-detects bot challenges and suggests browser_fetch fallback
- Result deduplication based on normalized URL domain+path
Installation
pip install mcpaisuite-websearchmcp
# Optional extras:
pip install mcpaisuite-websearchmcp[bs4] # BeautifulSoup for better content extraction
pip install mcpaisuite-websearchmcp[browser] # Playwright for JS-rendered pages
pip install mcpaisuite-websearchmcp[rerank] # fastembed cross-encoder for relevance reranking
pip install mcpaisuite-websearchmcp[extract] # trafilatura for high-quality content extraction
pip install mcpaisuite-websearchmcp[dev] # Development tools
Note: BeautifulSoup (
beautifulsoup4+lxml) is optional. Without it, websearchmcp uses a built-in regex extractor that works for most pages. Install thebs4extra for higher-quality extraction on complex HTML.
Quick Start
from websearchmcp import WebSearchFactory
pipeline = WebSearchFactory.from_env()
results = await pipeline.search("Python 3.13 new features", max_results=5)
for r in results:
print(f"{r.title}: {r.url}")
MCP Server
websearchmcp-server
Robust backend: SearXNG (recommended)
By default websearchmcp scrapes DuckDuckGo/Mojeek/Brave HTML, which is fragile
(CAPTCHA, 403s, parser breakage). For a reliable, key-free, self-hosted backend,
run SearXNG — a metasearch engine with a clean JSON API.
When SEARXNG_URL is set, it's used as Priority 1, with scraping as fallback.
cd deploy/searxng # docker-compose.yml + settings.yml provided
docker compose up -d
export SEARXNG_URL=http://localhost:8080
Already running SearXNG? Its JSON API is off by default — websearchmcp's
format=jsonrequest then 403s. Verify withcurl "http://localhost:8080/search?q=test&format=json"; if it's not JSON, addsearch.formats: [html, json](andserver.limiter: false) to yoursettings.ymland restart. Seedeploy/searxng/for a ready-made config.
Configuration
| Variable | Default | Description |
|---|---|---|
SEARXNG_URL |
-- | Base URL for self-hosted SearXNG instance |
WEBSEARCH_ENGINES |
duckduckgo,mojeek,brave |
Comma-separated engine list |
WEBSEARCH_MAX_LENGTH |
8000 |
Max content length for extraction |
WEBSEARCH_RERANK |
false |
Enable cross-encoder result reranking (needs [rerank]) |
WEBSEARCH_RERANK_MODEL |
Xenova/ms-marco-MiniLM-L-6-v2 |
Reranker model override |
WEBSEARCH_TRAFILATURA |
true |
Prefer trafilatura extraction when installed (needs [extract]) |
API Reference
WebSearchPipeline
Priority-based search pipeline with cache, circuit breaker, and deduplication.
await pipeline.search(query, max_results=10, rerank=None,
deep_rerank=False, deep_rerank_k=5, parallel=False) -> list[SearchResult]
await pipeline.fetch(url, max_length=8000) -> FetchResult
await pipeline.browser_fetch(url, timeout_ms=30000, wait_until="networkidle",
screenshot=False) -> FetchResult
await pipeline.search_with_answer(query, max_results=5, answer_fn=None,
fetch_content=False, rerank=None,
trim_passages=True, passages_per_source=3) -> AnswerResult
Reranking & answers (bring-your-own-LLM)
from websearchmcp import WebSearchFactory
pipeline = WebSearchFactory.create(enable_rerank=True) # cross-encoder relevance
# Reranked results (most relevant first), no LLM needed:
results = await pipeline.search("capital of australia", rerank=True)
# Search + synthesized answer: you supply the LLM, we supply ranked+grounded sources.
def answer_fn(query, sources): # sources: [{title, url, snippet, content?}]
ctx = "\n".join(f"[{i+1}] {s['title']}: {s.get('content', s['snippet'])}"
for i, s in enumerate(sources))
return my_llm(f"Answer with citations.\nQ: {query}\nSources:\n{ctx}")
res = await pipeline.search_with_answer("capital of australia", answer_fn=answer_fn,
fetch_content=True)
print(res.answer) # "The capital of Australia is Canberra [1][3]..."
print(res.synthesized) # True (LLM); False = extractive snippet fallback
Honest scope: websearchmcp aggregates free engines (no proprietary index), so raw result quality/freshness depends on those engines. What this layer adds is the agent-facing surface — relevance reranking + cited answer synthesis — plus a focus on token economy: trafilatura extraction and passage trimming mean the downstream LLM reads only the relevant text (≈35% fewer tokens with no answer loss in
benchmarks/quality_bench.py), at zero search-API cost and no key/lock-in. It is not a drop-in replacement for a paid search index; it's the open, self-hosted alternative.
Cost / token economy
The cost of agentic search is mostly the tokens your LLM ingests. websearchmcp minimizes that:
- trafilatura extraction strips menus/ads/cookie banners → less boilerplate per page.
- reranking lets you return top-3 instead of top-10 and still have the answer.
- passage trimming returns only the query-relevant passages of a page.
- fetch + search caches avoid paying twice for the same page/query.
Run python benchmarks/quality_bench.py to measure relevance@3 and the token saving
on live queries (no LLM calls, so the benchmark itself is free).
WebSearchFactory
WebSearchFactory.from_env() # Build from environment variables
WebSearchFactory.create(searxng_url=..., engines=...) # Explicit config
Architecture
WebSearchPipeline implements a priority-based search strategy: SearXNG (if configured) is tried first as a reliable self-hosted option, then the pipeline rotates through DuckDuckGo, Mojeek, and Brave engines. Each engine has its own circuit breaker and rate limiter. Results are deduplicated by URL and cached with a TTL. Content extraction uses WebExtractor (regex-based or BeautifulSoup) to convert raw HTML into clean text suitable for LLM consumption.
Testing
pip install -e ".[dev]"
pytest tests/ -v
License
AGPL-3.0 — see LICENSE.
Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact gaeldev@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcpaisuite_websearchmcp-1.0.3.tar.gz.
File metadata
- Download URL: mcpaisuite_websearchmcp-1.0.3.tar.gz
- Upload date:
- Size: 47.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a78355096eadd36ea734e900bed6e9d9bcb6152f16c57aece923183ac14630dd
|
|
| MD5 |
5b8b61ebcfaa03d462a4a6c5c8f7696a
|
|
| BLAKE2b-256 |
47322c0be7d7319dbd97055a483fc6b37ad2cee949d94c6eee6831d14fc40aa4
|
Provenance
The following attestation bundles were made for mcpaisuite_websearchmcp-1.0.3.tar.gz:
Publisher:
release.yml on gashel01/websearchmcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcpaisuite_websearchmcp-1.0.3.tar.gz -
Subject digest:
a78355096eadd36ea734e900bed6e9d9bcb6152f16c57aece923183ac14630dd - Sigstore transparency entry: 1841152446
- Sigstore integration time:
-
Permalink:
gashel01/websearchmcp@0c5a9e136a0a49826566a06ec6ee5a4a6b95f16c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/gashel01
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0c5a9e136a0a49826566a06ec6ee5a4a6b95f16c -
Trigger Event:
push
-
Statement type:
File details
Details for the file mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl.
File metadata
- Download URL: mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl
- Upload date:
- Size: 54.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fc4c8365fe5a8bbc31fc543988d841461e349ed0a000fc2e787a48d891f919f
|
|
| MD5 |
338f45374c6419bc2353da306d72415a
|
|
| BLAKE2b-256 |
eed778099a233115573e6450ee7b2c14e069366fcfc68e87ccfe6ec54a91654b
|
Provenance
The following attestation bundles were made for mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl:
Publisher:
release.yml on gashel01/websearchmcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl -
Subject digest:
1fc4c8365fe5a8bbc31fc543988d841461e349ed0a000fc2e787a48d891f919f - Sigstore transparency entry: 1841152458
- Sigstore integration time:
-
Permalink:
gashel01/websearchmcp@0c5a9e136a0a49826566a06ec6ee5a4a6b95f16c -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/gashel01
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0c5a9e136a0a49826566a06ec6ee5a4a6b95f16c -
Trigger Event:
push
-
Statement type: