Shared web infrastructure: search, scraping, HTTP security, browsers

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

n24q02m

These details have not been verified by PyPI

Project description

web-core

Shared web infrastructure for search, scraping, HTTP security, and stealth browsers -- powers wet-mcp and downstream apps.

Sister projects from n24q02m (click to expand)

Project	Tagline	Tag
better-code-review-graph	Knowledge graph for token-efficient code reviews -- semantic search and call-...	MCP
better-email-mcp	IMAP/SMTP email for AI agents -- read, send, organize folders, and manage att...	MCP
better-godot-mcp	Composite MCP server for Godot Engine -- 17 composite tools for AI-assisted g...	MCP
better-notion-mcp	Markdown-first Notion for AI agents -- pages, databases, blocks, and comments...	MCP
better-telegram-mcp	Telegram for AI agents -- messages, chats, media, and contacts across both bo...	MCP
claude-plugins	Claude Code plugin marketplace for the n24q02m MCP servers -- install web sea...	Marketplace
imagine-mcp	Image and video understanding + generation for AI agents -- across Gemini, Op...	MCP
jules-task-archiver	Chrome Extension for bulk operations on Jules tasks via batchexecute API -- a...	Tooling
mcp-core	Shared foundation for building MCP servers -- Streamable HTTP transport, OAut...	MCP
mnemo-mcp	Persistent AI memory with hybrid search and embedded sync. Open, free, unlimi...	MCP
qwen3-embed	Lightweight Qwen3 text embedding and reranking via ONNX Runtime and GGUF	Library
skret	Secrets without the server.	CLI
tacet	TACET: a self-distilling neuro-symbolic cascade that amortises LLM cost in kn...	Tooling
web-core	Shared web infrastructure package for search, scraping, HTTP security, and st...	Library
wet-mcp	Open-source MCP server for AI agents: web search, content extraction, and lib...	MCP

Installation
Quick Usage
Configuration
Architecture
Development
License

Shared web infrastructure package providing:

SearXNG search -- cross-process singleton runner plus a retry/dedup/domain-filtering client.
Multi-strategy scraping -- a LangGraph agent that escalates across API-direct, basic HTTP, TLS-fingerprint spoofing, headless rendering, remote rendering, and CAPTCHA-solving strategies.
SSRF-safe HTTP client -- DNS-pinned httpx client plus URL normalization and domain validation helpers.
Stealth + remote browsers -- a Patchright (undetected Playwright) provider, plus remote render clients (Cloudflare Browser Rendering, self-host browserless) for slim containers that offload JS rendering.
robots.txt compliance -- per-domain cached robots.txt checks before fetching.
LLM selector inference -- optional, env-key-gated CSS-selector inference when built-in selectors fail.
External API adapters -- typed, SSRF-safe clients for Google Drive and MangaDex.

Used by wet-mcp and downstream applications.

Site-specific selectors live in consumer applications. This package provides generic infrastructure only. Consumers supply per-domain cookies and selectors via the environment variables in the Configuration section.

Installation

# From PyPI
uv add n24q02m-web-core

# Or pin to v2.x (current stable line)
uv add "n24q02m-web-core>=2.0.0"

Quick Usage

SearXNG Search

from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search

# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()

# Search with retry, deduplication, and domain filtering
results = await search(
    searxng_url=url,
    query="Python async patterns",
    max_results=10,
    include_domains=["docs.python.org"],
)

for r in results:
    print(f"{r.title}: {r.url}")

# Clean shutdown
await shutdown_searxng()

Multi-Strategy Scraping

from web_core.scraper import ScrapingAgent
from web_core.scraper.strategies import BasicHTTPStrategy, TLSSpoofStrategy

# Initialize the agent with the strategies you want, in escalation order.
# All scraping dependencies (crawl4ai, patchright, curl-cffi, capsolver)
# ship as core dependencies, so every built-in strategy is importable.
agent = ScrapingAgent(strategies={
    "basic": BasicHTTPStrategy(),
    "tls": TLSSpoofStrategy(),
})

# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")

SSRF-Safe HTTP Client

from web_core.http import safe_httpx_client, is_safe_url

# Validate URL before use
assert is_safe_url("https://example.com")  # True
assert not is_safe_url("http://localhost")  # False (SSRF blocked)

# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
    resp = await client.get("https://example.com")

URL Utilities

from web_core.http import normalize_url, strip_tracking_params, is_valid_domain

# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"

# Validate domain names (prevents search operator injection)
is_valid_domain("example.com")   # True
is_valid_domain("localhost")     # False

Configuration

All configuration is read from environment variables. Every variable is optional; omitting one disables the feature it controls (no variable is required to import or use the package).

Search

Variable	Used by	Purpose
`SEARXNG_URL`	`search.runner`	Use an already-running SearXNG instance instead of starting a managed one.
`SEARXNG_USER`	`search.runner`	User the managed SearXNG container runs as (default `nobody`).

Scraping

Variable	Used by	Purpose
`WEB_CORE_DOMAIN_COOKIES`	`scraper.selector_inference`	JSON object `{"domain": {"cookie": "value"}}` of per-domain cookies (e.g. age-gate tokens). Keeps secrets out of source.

LLM selector inference (optional)

When built-in selectors fail to extract content, the scraper can ask an LLM to infer CSS selectors. A provider is auto-detected from whichever key is present; if none is set, inference is skipped silently. Consumers may also inject a custom llm_caller.

Variable	Provider
`GEMINI_API_KEY` / `GOOGLE_API_KEY`	Google Gemini
`OPENAI_API_KEY`	OpenAI
`ANTHROPIC_API_KEY`	Anthropic
`XAI_API_KEY`	xAI
`WEB_CORE_LLM_MODEL`	Override the per-provider default model.
`GOOGLE_CLOUD_PROJECT` / `GOOGLE_CLOUD_LOCATION`	Route Gemini through Vertex AI instead of the public API.

Remote render backends (optional)

Credentials are passed as constructor arguments to the render clients; consumers typically source them from their own config/env:

CFBrowserRenderingClient(account_id, api_token) -- Cloudflare Browser Rendering.
BrowserlessClient(base_url, token=...) -- a self-hosted browserless /content endpoint.

Architecture

src/web_core/
  __init__.py              -- Public API re-exports
  py.typed                 -- PEP 561 type stub marker
  http/                    -- Layer 1: SSRF-safe HTTP primitives
    client.py              -- safe_httpx_client, DNS pinning, IP validation, browser SSRF setup
    url.py                 -- normalize_url, strip_tracking_params, is_valid_domain, extract_domain
  search/                  -- Layer 2: SearXNG search engine
    client.py              -- search() with retry, dedup, domain filtering
    models.py              -- SearchResult, SearchError dataclasses
    runner.py              -- Cross-process SearXNG singleton manager
  scraper/                 -- Layer 2: Multi-strategy scraping agent
    agent.py               -- ScrapingAgent (LangGraph state machine)
    base.py                -- BaseStrategy ABC, ScrapingResult
    cache.py               -- StrategyCache (per-domain performance tracking)
    robots.py              -- RobotsCache (per-domain robots.txt compliance)
    selector_inference.py  -- LLM-based CSS selector inference + domain cookie loading
    state.py               -- ScrapingState TypedDict, ScrapingError
    utils.py               -- Shared scraping helpers
    strategies/            -- Concrete strategy implementations
      api_direct.py        -- API endpoint detection and direct fetch
      basic_http.py        -- Simple httpx GET with SSRF protection
      captcha.py           -- CapSolver-backed captcha bypass
      headless.py          -- Crawl4AI headless browser rendering
      patchright_browser.py -- Patchright stealth-browser rendering
      remote_render.py     -- RemoteRenderStrategy over a RenderClient (CF / browserless)
      tls_spoof.py         -- curl_cffi TLS fingerprint spoofing
  browsers/                -- Layer 2: Browser + remote render clients
    protocol.py            -- BrowserProvider Protocol (structural typing)
    patchright.py          -- Patchright (undetected Playwright) provider
    browserless.py         -- BrowserlessClient (self-host /content render client)
    cf_rendering.py        -- CFBrowserRenderingClient (Cloudflare Browser Rendering)
  adapters/                -- Layer 2: External API adapters (typed, SSRF-safe)
    google_drive.py        -- Google Drive folder/file fetch
    mangadex.py            -- MangaDex client (manga, chapters, images)

Key Design Decisions

SSRF protection: All outbound HTTP goes through safe_httpx_client with DNS pinning to prevent DNS rebinding attacks.
Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure (including past under-rendered JS shells to a render backend).
Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
Structural typing: BrowserProvider and RenderClient use Protocol so implementations don't need inheritance.

Development

Prerequisites

Python 3.13
uv
mise (optional, for task shortcuts)

Setup

git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install

Commands

# Via mise
mise run setup     # uv sync --all-extras
mise run lint      # ruff check + ruff format --check
mise run test      # pytest with coverage
mise run fix       # auto-fix lint + format

# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q

Tests

asyncio_mode = "auto" -- no @pytest.mark.asyncio needed
Coverage threshold: 95% (enforced in pyproject.toml)
Test files mirror source module structure under tests/

License

MIT

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

n24q02m

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.3.0

Jul 1, 2026

2.3.0b2 pre-release

Jun 19, 2026

2.3.0b1 pre-release

Jun 19, 2026

2.2.2b3 pre-release

Jun 10, 2026

2.2.2b2 pre-release

Jun 10, 2026

2.2.2b1 pre-release

Jun 10, 2026

2.2.1

Jun 9, 2026

2.2.1b1 pre-release

Jun 9, 2026

2.2.0

Jun 7, 2026

2.2.0b1 pre-release

Jun 7, 2026

2.1.1

May 28, 2026

2.1.0

May 26, 2026

2.1.0b3 pre-release

May 26, 2026

2.1.0b2 pre-release

May 24, 2026

2.1.0b1 pre-release

May 24, 2026

2.0.1

May 9, 2026

2.0.1b1 pre-release

May 9, 2026

2.0.0

May 9, 2026

2.0.0b1 pre-release

May 9, 2026

1.3.12

May 6, 2026

1.3.12b1 pre-release

May 6, 2026

1.3.11

May 5, 2026

1.3.10

May 5, 2026

1.3.10b2 pre-release

May 5, 2026

1.3.10b1 pre-release

Apr 30, 2026

1.3.9

Apr 29, 2026

1.3.8

Apr 28, 2026

1.3.7

Apr 27, 2026

1.3.6

Apr 24, 2026

1.3.5

Apr 22, 2026

1.3.4

Apr 22, 2026

1.3.3

Apr 22, 2026

1.3.2

Apr 22, 2026

1.3.1

Apr 21, 2026

1.3.0

Apr 21, 2026

1.2.0

Apr 17, 2026

1.1.1b7 pre-release

Apr 6, 2026

1.1.1b6 pre-release

Apr 6, 2026

1.1.1b5 pre-release

Apr 6, 2026

1.1.1b4 pre-release

Apr 6, 2026

1.1.1b3 pre-release

Apr 6, 2026

1.1.1b2 pre-release

Apr 6, 2026

1.1.1b1 pre-release

Apr 6, 2026

1.1.0

Apr 6, 2026

1.1.0b5 pre-release

Apr 6, 2026

1.1.0b4 pre-release

Apr 6, 2026

1.1.0b3 pre-release

Apr 6, 2026

1.1.0b2 pre-release

Apr 6, 2026

1.1.0b1 pre-release

Apr 5, 2026

1.0.1

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

n24q02m_web_core-2.3.0.tar.gz (247.7 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

n24q02m_web_core-2.3.0-py3-none-any.whl (69.5 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file n24q02m_web_core-2.3.0.tar.gz.

File metadata

Download URL: n24q02m_web_core-2.3.0.tar.gz
Upload date: Jul 1, 2026
Size: 247.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for n24q02m_web_core-2.3.0.tar.gz
Algorithm	Hash digest
SHA256	`6b379ef2ce443cb3db4045c3196bdd56ddc4dff6d8e17e4ae963ddf9a165a7f1`
MD5	`5a5e447e30c863051efb17272afecd7a`
BLAKE2b-256	`e65a2adc7426f39e2d099b4594d3bb00554b7d5b8becda6042f055b15465ac32`

See more details on using hashes here.

File details

Details for the file n24q02m_web_core-2.3.0-py3-none-any.whl.

File metadata

Download URL: n24q02m_web_core-2.3.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 69.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for n24q02m_web_core-2.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0b1ef930cfe807601d735fa20011814d068e85dc8a11c7a76dc3329f2ad1455`
MD5	`8464225ae66668d5e4c08b81bd915da9`
BLAKE2b-256	`f66aa73dd4e2dea51364f7c8cf1a75e28792a3de3920c9c299e31178a9bfd4e7`

See more details on using hashes here.

n24q02m-web-core 2.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

web-core

Table of contents

Installation

Quick Usage

SearXNG Search

Multi-Strategy Scraping

SSRF-Safe HTTP Client

URL Utilities

Configuration

Search

Scraping

LLM selector inference (optional)

Remote render backends (optional)

Architecture

Key Design Decisions

Development

Prerequisites

Setup

Commands

Tests

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes