Shared web infrastructure: search, scraping, HTTP security, browsers
Project description
web-core
Shared web infrastructure package for search, scraping, HTTP security, and stealth browsers. Used by knowledge-core and downstream applications.
Installation
# From PyPI
uv add n24q02m-web-core
# Or pin to a specific version
uv add "n24q02m-web-core>=1.0.0"
Quick Usage
SearXNG Search
from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search
# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()
# Search with retry, deduplication, and domain filtering
results = await search(
searxng_url=url,
query="Python async patterns",
max_results=10,
include_domains=["docs.python.org"],
)
for r in results:
print(f"{r.title}: {r.url}")
# Clean shutdown
await shutdown_searxng()
Multi-Strategy Scraping
from web_core.scraper import ScrapingAgent
from web_core.scraper.strategies import BasicHTTPStrategy, TLSSpoofStrategy
# Initialize agent with desired strategies
# Note: Some strategies (e.g., HeadlessStrategy, PatchrightStrategy)
# require optional dependencies like crawl4ai or patchright.
agent = ScrapingAgent(strategies={
"basic": BasicHTTPStrategy(),
"tls": TLSSpoofStrategy(),
})
# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")
SSRF-Safe HTTP Client
from web_core.http import safe_httpx_client, is_safe_url
# Validate URL before use
assert is_safe_url("https://example.com") # True
assert not is_safe_url("http://localhost") # False (SSRF blocked)
# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
resp = await client.get("https://example.com")
URL Utilities
from web_core.http import normalize_url, strip_tracking_params, is_valid_domain
# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"
# Validate domain names (prevents search operator injection)
is_valid_domain("example.com") # True
is_valid_domain("localhost") # False
Architecture
src/web_core/
__init__.py -- Public API re-exports
py.typed -- PEP 561 type stub marker
http/ -- Layer 1: SSRF-safe HTTP primitives
client.py -- safe_httpx_client, DNS pinning, IP validation
url.py -- normalize_url, strip_tracking_params, is_valid_domain
search/ -- Layer 2: SearXNG search engine
client.py -- search() with retry, dedup, domain filtering
models.py -- SearchResult, SearchError dataclasses
runner.py -- Cross-process SearXNG singleton manager
scraper/ -- Layer 2: Multi-strategy scraping agent
agent.py -- ScrapingAgent (LangGraph state machine)
base.py -- BaseStrategy ABC, ScrapingResult
cache.py -- StrategyCache (per-domain performance tracking)
state.py -- ScrapingState TypedDict, ScrapingError
strategies/ -- Concrete strategy implementations
api_direct.py -- API endpoint detection and direct fetch
basic_http.py -- Simple httpx GET with SSRF protection
captcha.py -- CapSolver-backed captcha bypass
headless.py -- Crawl4AI headless browser rendering
tls_spoof.py -- curl_cffi TLS fingerprint spoofing
browsers/ -- Layer 2: Stealth browser abstraction
protocol.py -- BrowserProvider Protocol (structural typing)
patchright.py -- Patchright (undetected Playwright) provider
Key Design Decisions
- SSRF protection: All outbound HTTP goes through
safe_httpx_clientwith DNS pinning to prevent DNS rebinding attacks. - Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure.
- Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
- Structural typing:
BrowserProviderusesProtocolso implementations don't need inheritance.
Development
Prerequisites
Setup
git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install
Commands
# Via mise
mise run setup # uv sync --all-extras
mise run lint # ruff check + ruff format --check
mise run test # pytest with coverage
mise run fix # auto-fix lint + format
# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q
Tests
asyncio_mode = "auto"-- no@pytest.mark.asyncioneeded- Coverage threshold: 95% (enforced in pyproject.toml)
- Test files mirror source module structure under
tests/
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
n24q02m_web_core-1.3.10b2.tar.gz
(206.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file n24q02m_web_core-1.3.10b2.tar.gz.
File metadata
- Download URL: n24q02m_web_core-1.3.10b2.tar.gz
- Upload date:
- Size: 206.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4ef5f36f9d3c9edf49f22c2ec6a6b7a12d06ebf988523e1088c9ddc08f28def
|
|
| MD5 |
f2fc686923908daff4c8e39581646aab
|
|
| BLAKE2b-256 |
e5cab505e1cb05a2f03b1b0e24a692864dfa1bfa37892535ab9e5ca915770f88
|
File details
Details for the file n24q02m_web_core-1.3.10b2-py3-none-any.whl.
File metadata
- Download URL: n24q02m_web_core-1.3.10b2-py3-none-any.whl
- Upload date:
- Size: 56.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7965ef3d5a272fc0b19121f453862f4f5f40c2aa0aed767ad2fb4b24c4c62e49
|
|
| MD5 |
682301fc96d7b354fbb13c66a9ff518d
|
|
| BLAKE2b-256 |
c8a5066b30c77cd00a3c31d0dd9ced8bc464079a2621f25d7ee79cfeecf37522
|