Skip to main content

Shared web infrastructure: search, scraping, HTTP security, browsers

Project description

web-core

Shared web infrastructure for search, scraping, HTTP security, and stealth browsers -- powers wet-mcp and downstream apps.

Sister projects from n24q02m (click to expand)
Project Tagline Tag
better-code-review-graph Knowledge graph for token-efficient code reviews -- semantic search and call-... MCP
better-email-mcp IMAP/SMTP email for AI agents -- read, send, organize folders, and manage att... MCP
better-godot-mcp Composite MCP server for Godot Engine -- 17 composite tools for AI-assisted g... MCP
better-notion-mcp Markdown-first Notion for AI agents -- pages, databases, blocks, and comments... MCP
better-telegram-mcp Telegram for AI agents -- messages, chats, media, and contacts across both bo... MCP
claude-plugins Claude Code plugin marketplace for the n24q02m MCP servers -- install web sea... Marketplace
imagine-mcp Image and video understanding + generation for AI agents -- across Gemini, Op... MCP
jules-task-archiver Chrome Extension for bulk operations on Jules tasks via batchexecute API -- a... Tooling
mcp-core Shared foundation for building MCP servers -- Streamable HTTP transport, OAut... MCP
mnemo-mcp Persistent AI memory with hybrid search and embedded sync. Open, free, unlimi... MCP
qwen3-embed Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF Library
skret Secrets without the server. CLI
tacet TACET: a self-distilling neuro-symbolic cascade that amortises LLM cost in kn... Tooling
web-core Shared web infrastructure package for search, scraping, HTTP security, and st... Library
wet-mcp Open-source MCP server for AI agents: web search, content extraction, and lib... MCP

Table of contents

Shared web infrastructure package: SearXNG search, multi-strategy scraping (basic, TLS spoof, Patchright stealth, Cloudflare CAPTCHA), SSRF-safe HTTP client, and stealth browser primitives. Used by wet-mcp and downstream applications.

Site-specific selectors moved to consumer applications. This package provides generic infrastructure only. Consumers bring their own per-domain selectors via the WEB_CORE_DOMAIN_COOKIES env-var pattern documented below.

Installation

# From PyPI
uv add n24q02m-web-core

# Or pin to v2.x (current stable line)
uv add "n24q02m-web-core>=2.0.0"

Quick Usage

SearXNG Search

from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search

# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()

# Search with retry, deduplication, and domain filtering
results = await search(
    searxng_url=url,
    query="Python async patterns",
    max_results=10,
    include_domains=["docs.python.org"],
)

for r in results:
    print(f"{r.title}: {r.url}")

# Clean shutdown
await shutdown_searxng()

Multi-Strategy Scraping

from web_core.scraper import ScrapingAgent
from web_core.scraper.strategies import BasicHTTPStrategy, TLSSpoofStrategy

# Initialize agent with desired strategies
# Note: Some strategies (e.g., HeadlessStrategy, PatchrightStrategy)
# require optional dependencies like crawl4ai or patchright.
agent = ScrapingAgent(strategies={
    "basic": BasicHTTPStrategy(),
    "tls": TLSSpoofStrategy(),
})

# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")

SSRF-Safe HTTP Client

from web_core.http import safe_httpx_client, is_safe_url

# Validate URL before use
assert is_safe_url("https://example.com")  # True
assert not is_safe_url("http://localhost")  # False (SSRF blocked)

# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
    resp = await client.get("https://example.com")

URL Utilities

from web_core.http import normalize_url, strip_tracking_params, is_valid_domain

# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"

# Validate domain names (prevents search operator injection)
is_valid_domain("example.com")   # True
is_valid_domain("localhost")     # False

Architecture

src/web_core/
  __init__.py              -- Public API re-exports
  py.typed                 -- PEP 561 type stub marker
  http/                    -- Layer 1: SSRF-safe HTTP primitives
    client.py              -- safe_httpx_client, DNS pinning, IP validation
    url.py                 -- normalize_url, strip_tracking_params, is_valid_domain
  search/                  -- Layer 2: SearXNG search engine
    client.py              -- search() with retry, dedup, domain filtering
    models.py              -- SearchResult, SearchError dataclasses
    runner.py              -- Cross-process SearXNG singleton manager
  scraper/                 -- Layer 2: Multi-strategy scraping agent
    agent.py               -- ScrapingAgent (LangGraph state machine)
    base.py                -- BaseStrategy ABC, ScrapingResult
    cache.py               -- StrategyCache (per-domain performance tracking)
    state.py               -- ScrapingState TypedDict, ScrapingError
    strategies/            -- Concrete strategy implementations
      api_direct.py        -- API endpoint detection and direct fetch
      basic_http.py        -- Simple httpx GET with SSRF protection
      captcha.py           -- CapSolver-backed captcha bypass
      headless.py          -- Crawl4AI headless browser rendering
      tls_spoof.py         -- curl_cffi TLS fingerprint spoofing
  browsers/                -- Layer 2: Stealth browser abstraction
    protocol.py            -- BrowserProvider Protocol (structural typing)
    patchright.py          -- Patchright (undetected Playwright) provider

Key Design Decisions

  • SSRF protection: All outbound HTTP goes through safe_httpx_client with DNS pinning to prevent DNS rebinding attacks.
  • Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure.
  • Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
  • Structural typing: BrowserProvider uses Protocol so implementations don't need inheritance.

Development

Prerequisites

  • Python 3.13
  • uv
  • mise (optional, for task shortcuts)

Setup

git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install

Commands

# Via mise
mise run setup     # uv sync --all-extras
mise run lint      # ruff check + ruff format --check
mise run test      # pytest with coverage
mise run fix       # auto-fix lint + format

# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q

Tests

  • asyncio_mode = "auto" -- no @pytest.mark.asyncio needed
  • Coverage threshold: 95% (enforced in pyproject.toml)
  • Test files mirror source module structure under tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

n24q02m_web_core-2.3.0b1.tar.gz (245.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

n24q02m_web_core-2.3.0b1-py3-none-any.whl (67.2 kB view details)

Uploaded Python 3

File details

Details for the file n24q02m_web_core-2.3.0b1.tar.gz.

File metadata

  • Download URL: n24q02m_web_core-2.3.0b1.tar.gz
  • Upload date:
  • Size: 245.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for n24q02m_web_core-2.3.0b1.tar.gz
Algorithm Hash digest
SHA256 e0f6736f6b09e2e08992e22488b80c21221abe3e2a6f75bd36686782fdf31167
MD5 ed5813eb37bf92f02f8bb1201f870af1
BLAKE2b-256 0ef98b6658a07414c2ec5716cd2e3a8ee1f96e72b066162107fa9ae75ed9911d

See more details on using hashes here.

File details

Details for the file n24q02m_web_core-2.3.0b1-py3-none-any.whl.

File metadata

  • Download URL: n24q02m_web_core-2.3.0b1-py3-none-any.whl
  • Upload date:
  • Size: 67.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.22 {"installer":{"name":"uv","version":"0.11.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for n24q02m_web_core-2.3.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 b675060a90198552126743bf3cbfb93d725b541912cf8d1a120adee08fc64ae4
MD5 abe1ca470fb2101b84dfeff4f19c6a73
BLAKE2b-256 cd4f9bf9d9fb71ba7a02d90b0c24347f708ecc8dc6304e3b128b2174fa7986ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page