Shared web infrastructure: search, scraping, HTTP security, browsers
Project description
web-core
Shared web infrastructure for search, scraping, HTTP security, and stealth browsers -- powers wet-mcp and downstream apps.
Sister projects from n24q02m (click to expand)
| Project | Tagline | Tag |
|---|---|---|
| better-code-review-graph | Knowledge graph for token-efficient code reviews -- semantic search and call-... | MCP |
| better-email-mcp | IMAP/SMTP email for AI agents -- read, send, organize folders, and manage att... | MCP |
| better-godot-mcp | Composite MCP server for Godot Engine -- 17 composite tools for AI-assisted g... | MCP |
| better-notion-mcp | Markdown-first Notion for AI agents -- pages, databases, blocks, and comments... | MCP |
| better-telegram-mcp | Telegram for AI agents -- messages, chats, media, and contacts across both bo... | MCP |
| claude-plugins | Claude Code plugin marketplace for the n24q02m MCP servers -- install web sea... | Marketplace |
| imagine-mcp | Image and video understanding + generation for AI agents -- across Gemini, Op... | MCP |
| jules-task-archiver | Chrome Extension for bulk operations on Jules tasks via batchexecute API -- a... | Tooling |
| mcp-core | Shared foundation for building MCP servers -- Streamable HTTP transport, OAut... | MCP |
| mnemo-mcp | Persistent AI memory with hybrid search and embedded sync. Open, free, unlimi... | MCP |
| qwen3-embed | Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF | Library |
| skret | Secrets without the server. | CLI |
| tacet | TACET: a self-distilling neuro-symbolic cascade that amortises LLM cost in kn... | Tooling |
| web-core | Shared web infrastructure package for search, scraping, HTTP security, and st... | Library |
| wet-mcp | Open-source MCP server for AI agents: web search, content extraction, and lib... | MCP |
Table of contents
Shared web infrastructure package: SearXNG search, multi-strategy scraping (basic, TLS spoof, Patchright stealth, Cloudflare CAPTCHA), SSRF-safe HTTP client, and stealth browser primitives. Used by wet-mcp and downstream applications.
Site-specific selectors moved to consumer applications. This package provides generic infrastructure only. Consumers bring their own per-domain selectors via the WEB_CORE_DOMAIN_COOKIES env-var pattern documented below.
Installation
# From PyPI
uv add n24q02m-web-core
# Or pin to v2.x (current stable line)
uv add "n24q02m-web-core>=2.0.0"
Quick Usage
SearXNG Search
from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search
# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()
# Search with retry, deduplication, and domain filtering
results = await search(
searxng_url=url,
query="Python async patterns",
max_results=10,
include_domains=["docs.python.org"],
)
for r in results:
print(f"{r.title}: {r.url}")
# Clean shutdown
await shutdown_searxng()
Multi-Strategy Scraping
from web_core.scraper import ScrapingAgent
from web_core.scraper.strategies import BasicHTTPStrategy, TLSSpoofStrategy
# Initialize agent with desired strategies
# Note: Some strategies (e.g., HeadlessStrategy, PatchrightStrategy)
# require optional dependencies like crawl4ai or patchright.
agent = ScrapingAgent(strategies={
"basic": BasicHTTPStrategy(),
"tls": TLSSpoofStrategy(),
})
# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")
SSRF-Safe HTTP Client
from web_core.http import safe_httpx_client, is_safe_url
# Validate URL before use
assert is_safe_url("https://example.com") # True
assert not is_safe_url("http://localhost") # False (SSRF blocked)
# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
resp = await client.get("https://example.com")
URL Utilities
from web_core.http import normalize_url, strip_tracking_params, is_valid_domain
# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"
# Validate domain names (prevents search operator injection)
is_valid_domain("example.com") # True
is_valid_domain("localhost") # False
Architecture
src/web_core/
__init__.py -- Public API re-exports
py.typed -- PEP 561 type stub marker
http/ -- Layer 1: SSRF-safe HTTP primitives
client.py -- safe_httpx_client, DNS pinning, IP validation
url.py -- normalize_url, strip_tracking_params, is_valid_domain
search/ -- Layer 2: SearXNG search engine
client.py -- search() with retry, dedup, domain filtering
models.py -- SearchResult, SearchError dataclasses
runner.py -- Cross-process SearXNG singleton manager
scraper/ -- Layer 2: Multi-strategy scraping agent
agent.py -- ScrapingAgent (LangGraph state machine)
base.py -- BaseStrategy ABC, ScrapingResult
cache.py -- StrategyCache (per-domain performance tracking)
state.py -- ScrapingState TypedDict, ScrapingError
strategies/ -- Concrete strategy implementations
api_direct.py -- API endpoint detection and direct fetch
basic_http.py -- Simple httpx GET with SSRF protection
captcha.py -- CapSolver-backed captcha bypass
headless.py -- Crawl4AI headless browser rendering
tls_spoof.py -- curl_cffi TLS fingerprint spoofing
browsers/ -- Layer 2: Stealth browser abstraction
protocol.py -- BrowserProvider Protocol (structural typing)
patchright.py -- Patchright (undetected Playwright) provider
Key Design Decisions
- SSRF protection: All outbound HTTP goes through
safe_httpx_clientwith DNS pinning to prevent DNS rebinding attacks. - Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure.
- Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
- Structural typing:
BrowserProviderusesProtocolso implementations don't need inheritance.
Development
Prerequisites
Setup
git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install
Commands
# Via mise
mise run setup # uv sync --all-extras
mise run lint # ruff check + ruff format --check
mise run test # pytest with coverage
mise run fix # auto-fix lint + format
# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q
Tests
asyncio_mode = "auto"-- no@pytest.mark.asyncioneeded- Coverage threshold: 95% (enforced in pyproject.toml)
- Test files mirror source module structure under
tests/
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file n24q02m_web_core-2.3.0b2.tar.gz.
File metadata
- Download URL: n24q02m_web_core-2.3.0b2.tar.gz
- Upload date:
- Size: 246.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
273befba0369ac1ed7cfb5dcb7eadaf025322cf9fda0c156a8b1c096866760ef
|
|
| MD5 |
c2d6acd1978d1c8dbb4d50865fc31003
|
|
| BLAKE2b-256 |
555ca2a6137ae9995474db1f903cba03b8e3cbd9eee4ea2a86478cdf76911208
|
File details
Details for the file n24q02m_web_core-2.3.0b2-py3-none-any.whl.
File metadata
- Download URL: n24q02m_web_core-2.3.0b2-py3-none-any.whl
- Upload date:
- Size: 67.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e0c3f94ed00d5b21623bb914447994cae83861ba44a4ea5672ec3aa59f740a4
|
|
| MD5 |
1a4605ec32d7d0868458669a9b605bb3
|
|
| BLAKE2b-256 |
b06b7d12b7c7f61c597d0e867bae3021105b39a1fe19b977767e86b90fcba490
|