Multi-engine web search and content extraction, exposed as an MCP server

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gashel01

These details have not been verified by PyPI

Project links

Homepage

Project description

websearchmcp

Multi-engine web search and content extraction, exposed as an MCP server

Part of the MCP AI Suite.

Features

Multi-engine search with priority fallback: SearXNG (self-hosted) -> DuckDuckGo -> Mojeek -> Brave
Parallel search + Reciprocal Rank Fusion (optional) -- query all engines at once and fuse for better coverage
Cross-encoder reranking (optional) -- reorder results by relevance to the query, no API key
Deep rerank (optional) -- re-score the top candidates on their actual page content, not just snippets
Search + answer (search_with_answer) -- ranked sources + a synthesized answer via a bring-your-own-LLM callback (the agent-facing surface of commercial search APIs, at zero search cost)
Passage trimming -- return only the query-relevant passages of a page, not the whole thing (≈35% fewer tokens for the downstream LLM, with no answer loss in our benchmark)
Content extraction via trafilatura (optional, state-of-the-art boilerplate removal) with regex/BeautifulSoup fallback
Playwright browser_fetch for JavaScript-rendered pages with full DOM access and screenshots
Search + fetch caching -- in-memory TTL caches avoid re-querying engines and re-downloading pages
Per-engine circuit breaker -- stops retrying failed engines for a cooldown period
Per-engine rate limiter -- max N requests per minute per engine
In-memory TTL cache for search results with configurable expiry
CAPTCHA detection -- auto-detects bot challenges and suggests browser_fetch fallback
Result deduplication based on normalized URL domain+path

Installation

pip install mcpaisuite-websearchmcp
# Optional extras:
pip install mcpaisuite-websearchmcp[bs4]       # BeautifulSoup for better content extraction
pip install mcpaisuite-websearchmcp[browser]   # Playwright for JS-rendered pages
pip install mcpaisuite-websearchmcp[rerank]    # fastembed cross-encoder for relevance reranking
pip install mcpaisuite-websearchmcp[extract]   # trafilatura for high-quality content extraction
pip install mcpaisuite-websearchmcp[dev]       # Development tools

Note: BeautifulSoup (beautifulsoup4 + lxml) is optional. Without it, websearchmcp uses a built-in regex extractor that works for most pages. Install the bs4 extra for higher-quality extraction on complex HTML.

Quick Start

from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.from_env()
results = await pipeline.search("Python 3.13 new features", max_results=5)
for r in results:
    print(f"{r.title}: {r.url}")

MCP Server

websearchmcp-server

Robust backend: SearXNG (recommended)

By default websearchmcp scrapes DuckDuckGo/Mojeek/Brave HTML, which is fragile (CAPTCHA, 403s, parser breakage). For a reliable, key-free, self-hosted backend, run SearXNG — a metasearch engine with a clean JSON API. When SEARXNG_URL is set, it's used as Priority 1, with scraping as fallback.

cd deploy/searxng       # docker-compose.yml + settings.yml provided
docker compose up -d
export SEARXNG_URL=http://localhost:8080

Already running SearXNG? Its JSON API is off by default — websearchmcp's format=json request then 403s. Verify with curl "http://localhost:8080/search?q=test&format=json"; if it's not JSON, add search.formats: [html, json] (and server.limiter: false) to your settings.yml and restart. See deploy/searxng/ for a ready-made config.

Configuration

Variable	Default	Description
`SEARXNG_URL`	--	Base URL for self-hosted SearXNG instance
`WEBSEARCH_ENGINES`	`duckduckgo,mojeek,brave`	Comma-separated engine list
`WEBSEARCH_MAX_LENGTH`	`8000`	Max content length for extraction
`WEBSEARCH_RERANK`	`false`	Enable cross-encoder result reranking (needs `[rerank]`)
`WEBSEARCH_RERANK_MODEL`	`Xenova/ms-marco-MiniLM-L-6-v2`	Reranker model override
`WEBSEARCH_TRAFILATURA`	`true`	Prefer trafilatura extraction when installed (needs `[extract]`)

API Reference

WebSearchPipeline

Priority-based search pipeline with cache, circuit breaker, and deduplication.

await pipeline.search(query, max_results=10, rerank=None,
                      deep_rerank=False, deep_rerank_k=5, parallel=False) -> list[SearchResult]
await pipeline.fetch(url, max_length=8000) -> FetchResult
await pipeline.browser_fetch(url, timeout_ms=30000, wait_until="networkidle",
                             screenshot=False) -> FetchResult
await pipeline.search_with_answer(query, max_results=5, answer_fn=None,
                                  fetch_content=False, rerank=None,
                                  trim_passages=True, passages_per_source=3) -> AnswerResult

Reranking & answers (bring-your-own-LLM)

from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.create(enable_rerank=True)  # cross-encoder relevance

# Reranked results (most relevant first), no LLM needed:
results = await pipeline.search("capital of australia", rerank=True)

# Search + synthesized answer: you supply the LLM, we supply ranked+grounded sources.
def answer_fn(query, sources):           # sources: [{title, url, snippet, content?}]
    ctx = "\n".join(f"[{i+1}] {s['title']}: {s.get('content', s['snippet'])}"
                    for i, s in enumerate(sources))
    return my_llm(f"Answer with citations.\nQ: {query}\nSources:\n{ctx}")

res = await pipeline.search_with_answer("capital of australia", answer_fn=answer_fn,
                                        fetch_content=True)
print(res.answer)       # "The capital of Australia is Canberra [1][3]..."
print(res.synthesized)  # True (LLM); False = extractive snippet fallback

Honest scope: websearchmcp aggregates free engines (no proprietary index), so raw result quality/freshness depends on those engines. What this layer adds is the agent-facing surface — relevance reranking + cited answer synthesis — plus a focus on token economy: trafilatura extraction and passage trimming mean the downstream LLM reads only the relevant text (≈35% fewer tokens with no answer loss in benchmarks/quality_bench.py), at zero search-API cost and no key/lock-in. It is not a drop-in replacement for a paid search index; it's the open, self-hosted alternative.

Cost / token economy

The cost of agentic search is mostly the tokens your LLM ingests. websearchmcp minimizes that:

trafilatura extraction strips menus/ads/cookie banners → less boilerplate per page.
reranking lets you return top-3 instead of top-10 and still have the answer.
passage trimming returns only the query-relevant passages of a page.
fetch + search caches avoid paying twice for the same page/query.

Run python benchmarks/quality_bench.py to measure relevance@3 and the token saving on live queries (no LLM calls, so the benchmark itself is free).

WebSearchFactory

WebSearchFactory.from_env()                          # Build from environment variables
WebSearchFactory.create(searxng_url=..., engines=...) # Explicit config

Architecture

WebSearchPipeline implements a priority-based search strategy: SearXNG (if configured) is tried first as a reliable self-hosted option, then the pipeline rotates through DuckDuckGo, Mojeek, and Brave engines. Each engine has its own circuit breaker and rate limiter. Results are deduplicated by URL and cached with a TTL. Content extraction uses WebExtractor (regex-based or BeautifulSoup) to convert raw HTML into clean text suitable for LLM consumption.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

Apache-2.0 — see LICENSE.

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact contact@mcpaisuite.com.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gashel01

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.1.0

Jun 22, 2026

1.0.5

Jun 19, 2026

1.0.4

Jun 18, 2026

1.0.3

Jun 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpaisuite_websearchmcp-1.1.0.tar.gz (39.2 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl (47.1 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file mcpaisuite_websearchmcp-1.1.0.tar.gz.

File metadata

Download URL: mcpaisuite_websearchmcp-1.1.0.tar.gz
Upload date: Jun 22, 2026
Size: 39.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_websearchmcp-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`08595f651b6cdf850d66e3ee9d846f503ada52dbc6a1c87cb39b9c0b0b0f1c89`
MD5	`3028ee1c03c1e0955e9301224115df3f`
BLAKE2b-256	`e3b2edd3deeb13da76aaba70883d6b960581f8b9c8ec366debaccf3358d6fcb5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_websearchmcp-1.1.0.tar.gz:

Publisher: release.yml on gashel01/websearchmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcpaisuite_websearchmcp-1.1.0.tar.gz
- Subject digest: 08595f651b6cdf850d66e3ee9d846f503ada52dbc6a1c87cb39b9c0b0b0f1c89
- Sigstore transparency entry: 1912308141
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: gashel01/websearchmcp@255520c3759e30f58cecacd683d13998c5f593f5
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/gashel01
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@255520c3759e30f58cecacd683d13998c5f593f5
- Trigger Event: push

File details

Details for the file mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl.

File metadata

Download URL: mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 47.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1a50827d544e06eed4a637bfa24e16b550454e1afcbe4629ef97ad38c5d05f59`
MD5	`00551b802a0016042292508ee3cc703e`
BLAKE2b-256	`1634434a638e78bb99011b67248499a5d6df2a5c709650719f40a48cdb1c6dcf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl:

Publisher: release.yml on gashel01/websearchmcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mcpaisuite_websearchmcp-1.1.0-py3-none-any.whl
- Subject digest: 1a50827d544e06eed4a637bfa24e16b550454e1afcbe4629ef97ad38c5d05f59
- Sigstore transparency entry: 1912308229
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: gashel01/websearchmcp@255520c3759e30f58cecacd683d13998c5f593f5
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/gashel01
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@255520c3759e30f58cecacd683d13998c5f593f5
- Trigger Event: push

mcpaisuite-websearchmcp 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

websearchmcp

Features

Installation

Quick Start

MCP Server

Robust backend: SearXNG (recommended)

Configuration

API Reference

WebSearchPipeline

Reranking & answers (bring-your-own-LLM)

Cost / token economy

WebSearchFactory

Architecture

Testing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance