Shared web infrastructure: search, scraping, HTTP security, browsers
Project description
web-core
Shared web infrastructure for search, scraping, HTTP security, and stealth browsers -- powers wet-mcp and downstream apps.
Sister projects from n24q02m (click to expand)
| Project | Tagline | Tag |
|---|---|---|
| better-code-review-graph | Knowledge graph for token-efficient code reviews -- semantic search and call-... | MCP |
| better-email-mcp | IMAP/SMTP email for AI agents -- read, send, organize folders, and manage att... | MCP |
| better-godot-mcp | Composite MCP server for Godot Engine -- 17 composite tools for AI-assisted g... | MCP |
| better-notion-mcp | Markdown-first Notion for AI agents -- pages, databases, blocks, and comments... | MCP |
| better-telegram-mcp | Telegram for AI agents -- messages, chats, media, and contacts across both bo... | MCP |
| claude-plugins | Claude Code plugin marketplace for the n24q02m MCP servers -- install web sea... | Marketplace |
| imagine-mcp | Image and video understanding + generation for AI agents -- across Gemini, Op... | MCP |
| jules-task-archiver | Chrome Extension for bulk operations on Jules tasks via batchexecute API -- a... | Tooling |
| mcp-core | Shared foundation for building MCP servers -- Streamable HTTP transport, OAut... | MCP |
| mnemo-mcp | Persistent AI memory with hybrid search and embedded sync. Open, free, unlimi... | MCP |
| qwen3-embed | Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF | Library |
| skret | Secrets without the server. | CLI |
| tacet | TACET: a self-distilling neuro-symbolic cascade that amortises LLM cost in kn... | Tooling |
| web-core | Shared web infrastructure package for search, scraping, HTTP security, and st... | Library |
| wet-mcp | Open-source MCP server for AI agents: web search, content extraction, and lib... | MCP |
Table of contents
Shared web infrastructure package providing:
- SearXNG search -- cross-process singleton runner plus a retry/dedup/domain-filtering client.
- Multi-strategy scraping -- a LangGraph agent that escalates across API-direct, basic HTTP, TLS-fingerprint spoofing, headless rendering, remote rendering, and CAPTCHA-solving strategies.
- SSRF-safe HTTP client -- DNS-pinned
httpxclient plus URL normalization and domain validation helpers. - Stealth + remote browsers -- a Patchright (undetected Playwright) provider, plus remote render clients (Cloudflare Browser Rendering, self-host browserless) for slim containers that offload JS rendering.
- robots.txt compliance -- per-domain cached
robots.txtchecks before fetching. - LLM selector inference -- optional, env-key-gated CSS-selector inference when built-in selectors fail.
- External API adapters -- typed, SSRF-safe clients for Google Drive and MangaDex.
Used by wet-mcp and downstream applications.
Site-specific selectors live in consumer applications. This package provides generic infrastructure only. Consumers supply per-domain cookies and selectors via the environment variables in the Configuration section.
Installation
# From PyPI
uv add n24q02m-web-core
# Or pin to v2.x (current stable line)
uv add "n24q02m-web-core>=2.0.0"
Quick Usage
SearXNG Search
from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search
# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()
# Search with retry, deduplication, and domain filtering
results = await search(
searxng_url=url,
query="Python async patterns",
max_results=10,
include_domains=["docs.python.org"],
)
for r in results:
print(f"{r.title}: {r.url}")
# Clean shutdown
await shutdown_searxng()
Multi-Strategy Scraping
from web_core.scraper import ScrapingAgent
from web_core.scraper.strategies import BasicHTTPStrategy, TLSSpoofStrategy
# Initialize the agent with the strategies you want, in escalation order.
# All scraping dependencies (crawl4ai, patchright, curl-cffi, capsolver)
# ship as core dependencies, so every built-in strategy is importable.
agent = ScrapingAgent(strategies={
"basic": BasicHTTPStrategy(),
"tls": TLSSpoofStrategy(),
})
# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")
SSRF-Safe HTTP Client
from web_core.http import safe_httpx_client, is_safe_url
# Validate URL before use
assert is_safe_url("https://example.com") # True
assert not is_safe_url("http://localhost") # False (SSRF blocked)
# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
resp = await client.get("https://example.com")
URL Utilities
from web_core.http import normalize_url, strip_tracking_params, is_valid_domain
# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"
# Validate domain names (prevents search operator injection)
is_valid_domain("example.com") # True
is_valid_domain("localhost") # False
Configuration
All configuration is read from environment variables. Every variable is optional; omitting one disables the feature it controls (no variable is required to import or use the package).
Search
| Variable | Used by | Purpose |
|---|---|---|
SEARXNG_URL |
search.runner |
Use an already-running SearXNG instance instead of starting a managed one. |
SEARXNG_USER |
search.runner |
User the managed SearXNG container runs as (default nobody). |
Scraping
| Variable | Used by | Purpose |
|---|---|---|
WEB_CORE_DOMAIN_COOKIES |
scraper.selector_inference |
JSON object {"domain": {"cookie": "value"}} of per-domain cookies (e.g. age-gate tokens). Keeps secrets out of source. |
LLM selector inference (optional)
When built-in selectors fail to extract content, the scraper can ask an LLM to infer
CSS selectors. A provider is auto-detected from whichever key is present; if none is
set, inference is skipped silently. Consumers may also inject a custom llm_caller.
| Variable | Provider |
|---|---|
GEMINI_API_KEY / GOOGLE_API_KEY |
Google Gemini |
OPENAI_API_KEY |
OpenAI |
ANTHROPIC_API_KEY |
Anthropic |
XAI_API_KEY |
xAI |
WEB_CORE_LLM_MODEL |
Override the per-provider default model. |
GOOGLE_CLOUD_PROJECT / GOOGLE_CLOUD_LOCATION |
Route Gemini through Vertex AI instead of the public API. |
Remote render backends (optional)
Credentials are passed as constructor arguments to the render clients; consumers typically source them from their own config/env:
CFBrowserRenderingClient(account_id, api_token)-- Cloudflare Browser Rendering.BrowserlessClient(base_url, token=...)-- a self-hosted browserless/contentendpoint.
Architecture
src/web_core/
__init__.py -- Public API re-exports
py.typed -- PEP 561 type stub marker
http/ -- Layer 1: SSRF-safe HTTP primitives
client.py -- safe_httpx_client, DNS pinning, IP validation, browser SSRF setup
url.py -- normalize_url, strip_tracking_params, is_valid_domain, extract_domain
search/ -- Layer 2: SearXNG search engine
client.py -- search() with retry, dedup, domain filtering
models.py -- SearchResult, SearchError dataclasses
runner.py -- Cross-process SearXNG singleton manager
scraper/ -- Layer 2: Multi-strategy scraping agent
agent.py -- ScrapingAgent (LangGraph state machine)
base.py -- BaseStrategy ABC, ScrapingResult
cache.py -- StrategyCache (per-domain performance tracking)
robots.py -- RobotsCache (per-domain robots.txt compliance)
selector_inference.py -- LLM-based CSS selector inference + domain cookie loading
state.py -- ScrapingState TypedDict, ScrapingError
utils.py -- Shared scraping helpers
strategies/ -- Concrete strategy implementations
api_direct.py -- API endpoint detection and direct fetch
basic_http.py -- Simple httpx GET with SSRF protection
captcha.py -- CapSolver-backed captcha bypass
headless.py -- Crawl4AI headless browser rendering
patchright_browser.py -- Patchright stealth-browser rendering
remote_render.py -- RemoteRenderStrategy over a RenderClient (CF / browserless)
tls_spoof.py -- curl_cffi TLS fingerprint spoofing
browsers/ -- Layer 2: Browser + remote render clients
protocol.py -- BrowserProvider Protocol (structural typing)
patchright.py -- Patchright (undetected Playwright) provider
browserless.py -- BrowserlessClient (self-host /content render client)
cf_rendering.py -- CFBrowserRenderingClient (Cloudflare Browser Rendering)
adapters/ -- Layer 2: External API adapters (typed, SSRF-safe)
google_drive.py -- Google Drive folder/file fetch
mangadex.py -- MangaDex client (manga, chapters, images)
Key Design Decisions
- SSRF protection: All outbound HTTP goes through
safe_httpx_clientwith DNS pinning to prevent DNS rebinding attacks. - Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure (including past under-rendered JS shells to a render backend).
- Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
- Structural typing:
BrowserProviderandRenderClientuseProtocolso implementations don't need inheritance.
Development
Prerequisites
Setup
git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install
Commands
# Via mise
mise run setup # uv sync --all-extras
mise run lint # ruff check + ruff format --check
mise run test # pytest with coverage
mise run fix # auto-fix lint + format
# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q
Tests
asyncio_mode = "auto"-- no@pytest.mark.asyncioneeded- Coverage threshold: 95% (enforced in pyproject.toml)
- Test files mirror source module structure under
tests/
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file n24q02m_web_core-2.3.0.tar.gz.
File metadata
- Download URL: n24q02m_web_core-2.3.0.tar.gz
- Upload date:
- Size: 247.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b379ef2ce443cb3db4045c3196bdd56ddc4dff6d8e17e4ae963ddf9a165a7f1
|
|
| MD5 |
5a5e447e30c863051efb17272afecd7a
|
|
| BLAKE2b-256 |
e65a2adc7426f39e2d099b4594d3bb00554b7d5b8becda6042f055b15465ac32
|
File details
Details for the file n24q02m_web_core-2.3.0-py3-none-any.whl.
File metadata
- Download URL: n24q02m_web_core-2.3.0-py3-none-any.whl
- Upload date:
- Size: 69.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0b1ef930cfe807601d735fa20011814d068e85dc8a11c7a76dc3329f2ad1455
|
|
| MD5 |
8464225ae66668d5e4c08b81bd915da9
|
|
| BLAKE2b-256 |
f66aa73dd4e2dea51364f7c8cf1a75e28792a3de3920c9c299e31178a9bfd4e7
|