Skip to main content

Multi-engine web search and content extraction, exposed as an MCP server

Project description

websearchmcp

Multi-engine web search and content extraction, exposed as an MCP server

Part of the MCP AI Suite.

Features

  • Multi-engine search with priority fallback: SearXNG (self-hosted) -> DuckDuckGo -> Mojeek -> Brave
  • Parallel search + Reciprocal Rank Fusion (optional) -- query all engines at once and fuse for better coverage
  • Cross-encoder reranking (optional) -- reorder results by relevance to the query, no API key
  • Deep rerank (optional) -- re-score the top candidates on their actual page content, not just snippets
  • Search + answer (search_with_answer) -- ranked sources + a synthesized answer via a bring-your-own-LLM callback (the agent-facing surface of commercial search APIs, at zero search cost)
  • Passage trimming -- return only the query-relevant passages of a page, not the whole thing (≈35% fewer tokens for the downstream LLM, with no answer loss in our benchmark)
  • Content extraction via trafilatura (optional, state-of-the-art boilerplate removal) with regex/BeautifulSoup fallback
  • Playwright browser_fetch for JavaScript-rendered pages with full DOM access and screenshots
  • Search + fetch caching -- in-memory TTL caches avoid re-querying engines and re-downloading pages
  • Per-engine circuit breaker -- stops retrying failed engines for a cooldown period
  • Per-engine rate limiter -- max N requests per minute per engine
  • In-memory TTL cache for search results with configurable expiry
  • CAPTCHA detection -- auto-detects bot challenges and suggests browser_fetch fallback
  • Result deduplication based on normalized URL domain+path

Installation

pip install mcpaisuite-websearchmcp
# Optional extras:
pip install mcpaisuite-websearchmcp[bs4]       # BeautifulSoup for better content extraction
pip install mcpaisuite-websearchmcp[browser]   # Playwright for JS-rendered pages
pip install mcpaisuite-websearchmcp[rerank]    # fastembed cross-encoder for relevance reranking
pip install mcpaisuite-websearchmcp[extract]   # trafilatura for high-quality content extraction
pip install mcpaisuite-websearchmcp[dev]       # Development tools

Note: BeautifulSoup (beautifulsoup4 + lxml) is optional. Without it, websearchmcp uses a built-in regex extractor that works for most pages. Install the bs4 extra for higher-quality extraction on complex HTML.

Quick Start

from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.from_env()
results = await pipeline.search("Python 3.13 new features", max_results=5)
for r in results:
    print(f"{r.title}: {r.url}")

MCP Server

websearchmcp-server

Robust backend: SearXNG (recommended)

By default websearchmcp scrapes DuckDuckGo/Mojeek/Brave HTML, which is fragile (CAPTCHA, 403s, parser breakage). For a reliable, key-free, self-hosted backend, run SearXNG — a metasearch engine with a clean JSON API. When SEARXNG_URL is set, it's used as Priority 1, with scraping as fallback.

cd deploy/searxng       # docker-compose.yml + settings.yml provided
docker compose up -d
export SEARXNG_URL=http://localhost:8080

Already running SearXNG? Its JSON API is off by default — websearchmcp's format=json request then 403s. Verify with curl "http://localhost:8080/search?q=test&format=json"; if it's not JSON, add search.formats: [html, json] (and server.limiter: false) to your settings.yml and restart. See deploy/searxng/ for a ready-made config.

Configuration

Variable Default Description
SEARXNG_URL -- Base URL for self-hosted SearXNG instance
WEBSEARCH_ENGINES duckduckgo,mojeek,brave Comma-separated engine list
WEBSEARCH_MAX_LENGTH 8000 Max content length for extraction
WEBSEARCH_RERANK false Enable cross-encoder result reranking (needs [rerank])
WEBSEARCH_RERANK_MODEL Xenova/ms-marco-MiniLM-L-6-v2 Reranker model override
WEBSEARCH_TRAFILATURA true Prefer trafilatura extraction when installed (needs [extract])

API Reference

WebSearchPipeline

Priority-based search pipeline with cache, circuit breaker, and deduplication.

await pipeline.search(query, max_results=10, rerank=None,
                      deep_rerank=False, deep_rerank_k=5, parallel=False) -> list[SearchResult]
await pipeline.fetch(url, max_length=8000) -> FetchResult
await pipeline.browser_fetch(url, timeout_ms=30000, wait_until="networkidle",
                             screenshot=False) -> FetchResult
await pipeline.search_with_answer(query, max_results=5, answer_fn=None,
                                  fetch_content=False, rerank=None,
                                  trim_passages=True, passages_per_source=3) -> AnswerResult

Reranking & answers (bring-your-own-LLM)

from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.create(enable_rerank=True)  # cross-encoder relevance

# Reranked results (most relevant first), no LLM needed:
results = await pipeline.search("capital of australia", rerank=True)

# Search + synthesized answer: you supply the LLM, we supply ranked+grounded sources.
def answer_fn(query, sources):           # sources: [{title, url, snippet, content?}]
    ctx = "\n".join(f"[{i+1}] {s['title']}: {s.get('content', s['snippet'])}"
                    for i, s in enumerate(sources))
    return my_llm(f"Answer with citations.\nQ: {query}\nSources:\n{ctx}")

res = await pipeline.search_with_answer("capital of australia", answer_fn=answer_fn,
                                        fetch_content=True)
print(res.answer)       # "The capital of Australia is Canberra [1][3]..."
print(res.synthesized)  # True (LLM); False = extractive snippet fallback

Honest scope: websearchmcp aggregates free engines (no proprietary index), so raw result quality/freshness depends on those engines. What this layer adds is the agent-facing surface — relevance reranking + cited answer synthesis — plus a focus on token economy: trafilatura extraction and passage trimming mean the downstream LLM reads only the relevant text (≈35% fewer tokens with no answer loss in benchmarks/quality_bench.py), at zero search-API cost and no key/lock-in. It is not a drop-in replacement for a paid search index; it's the open, self-hosted alternative.

Cost / token economy

The cost of agentic search is mostly the tokens your LLM ingests. websearchmcp minimizes that:

  • trafilatura extraction strips menus/ads/cookie banners → less boilerplate per page.
  • reranking lets you return top-3 instead of top-10 and still have the answer.
  • passage trimming returns only the query-relevant passages of a page.
  • fetch + search caches avoid paying twice for the same page/query.

Run python benchmarks/quality_bench.py to measure relevance@3 and the token saving on live queries (no LLM calls, so the benchmark itself is free).

WebSearchFactory

WebSearchFactory.from_env()                          # Build from environment variables
WebSearchFactory.create(searxng_url=..., engines=...) # Explicit config

Architecture

WebSearchPipeline implements a priority-based search strategy: SearXNG (if configured) is tried first as a reliable self-hosted option, then the pipeline rotates through DuckDuckGo, Mojeek, and Brave engines. Each engine has its own circuit breaker and rate limiter. Results are deduplicated by URL and cached with a TTL. Content extraction uses WebExtractor (regex-based or BeautifulSoup) to convert raw HTML into clean text suitable for LLM consumption.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

AGPL-3.0 — see LICENSE.

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact gaeldev@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpaisuite_websearchmcp-1.0.3.tar.gz (47.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl (54.9 kB view details)

Uploaded Python 3

File details

Details for the file mcpaisuite_websearchmcp-1.0.3.tar.gz.

File metadata

  • Download URL: mcpaisuite_websearchmcp-1.0.3.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_websearchmcp-1.0.3.tar.gz
Algorithm Hash digest
SHA256 a78355096eadd36ea734e900bed6e9d9bcb6152f16c57aece923183ac14630dd
MD5 5b8b61ebcfaa03d462a4a6c5c8f7696a
BLAKE2b-256 47322c0be7d7319dbd97055a483fc6b37ad2cee949d94c6eee6831d14fc40aa4

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_websearchmcp-1.0.3.tar.gz:

Publisher: release.yml on gashel01/websearchmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1fc4c8365fe5a8bbc31fc543988d841461e349ed0a000fc2e787a48d891f919f
MD5 338f45374c6419bc2353da306d72415a
BLAKE2b-256 eed778099a233115573e6450ee7b2c14e069366fcfc68e87ccfe6ec54a91654b

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_websearchmcp-1.0.3-py3-none-any.whl:

Publisher: release.yml on gashel01/websearchmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page