Shared web infrastructure: search, scraping, HTTP security, browsers
Project description
web-core
Shared web infrastructure package for search, scraping, HTTP security, and stealth browsers. Used by knowledge-core and downstream applications.
Installation
This is a private package installed via git+ssh (not published to PyPI):
# Pin to a specific version tag
uv add git+ssh://git@github.com/n24q02m/web-core.git@v0.1.0
# Or latest main
uv add git+ssh://git@github.com/n24q02m/web-core.git
Quick Usage
SearXNG Search
from web_core.search import ensure_searxng, shutdown_searxng
from web_core.search.client import search
# Start/reuse a SearXNG instance (cross-process singleton)
url = await ensure_searxng()
# Search with retry, deduplication, and domain filtering
results = await search(
searxng_url=url,
query="Python async patterns",
max_results=10,
include_domains=["docs.python.org"],
)
for r in results:
print(f"{r.title}: {r.url}")
# Clean shutdown
await shutdown_searxng()
Multi-Strategy Scraping
from web_core.scraper import ScrapingAgent, StrategyRegistry
# Create registry with all available strategies
registry = StrategyRegistry.create_default()
agent = ScrapingAgent(strategies={
name: registry.get(name)
for name in registry.list_strategies()
})
# Scrape with automatic strategy escalation
content = await agent.scrape("https://example.com/article")
SSRF-Safe HTTP Client
from web_core.http import safe_httpx_client, is_safe_url
# Validate URL before use
assert is_safe_url("https://example.com") # True
assert not is_safe_url("http://localhost") # False (SSRF blocked)
# Create client with automatic SSRF protection + DNS pinning
async with safe_httpx_client() as client:
resp = await client.get("https://example.com")
URL Utilities
from web_core.http import normalize_url, strip_tracking_params, is_valid_domain
# Normalize for deduplication (lowercase, strip www/tracking/fragment)
normalize_url("https://WWW.Example.COM/page?utm_source=x#section")
# => "https://example.com/page"
# Validate domain names (prevents search operator injection)
is_valid_domain("example.com") # True
is_valid_domain("localhost") # False
Architecture
src/web_core/
__init__.py -- Public API re-exports
py.typed -- PEP 561 type stub marker
http/ -- Layer 1: SSRF-safe HTTP primitives
client.py -- safe_httpx_client, DNS pinning, IP validation
url.py -- normalize_url, strip_tracking_params, is_valid_domain
search/ -- Layer 2: SearXNG search engine
client.py -- search() with retry, dedup, domain filtering
models.py -- SearchResult, SearchError dataclasses
runner.py -- Cross-process SearXNG singleton manager
scraper/ -- Layer 2: Multi-strategy scraping agent
agent.py -- ScrapingAgent (LangGraph state machine)
base.py -- BaseStrategy ABC, ScrapingResult
cache.py -- StrategyCache (per-domain performance tracking)
registry.py -- StrategyRegistry with lazy-loaded strategies
state.py -- ScrapingState TypedDict, ScrapingError
strategies/ -- Concrete strategy implementations
api_direct.py -- API endpoint detection and direct fetch
basic_http.py -- Simple httpx GET with SSRF protection
captcha.py -- CapSolver-backed captcha bypass
headless.py -- Crawl4AI headless browser rendering
tls_spoof.py -- curl_cffi TLS fingerprint spoofing
browsers/ -- Layer 2: Stealth browser abstraction
protocol.py -- BrowserProvider Protocol (structural typing)
patchright.py -- Patchright (undetected Playwright) provider
Key Design Decisions
- SSRF protection: All outbound HTTP goes through
safe_httpx_clientwith DNS pinning to prevent DNS rebinding attacks. - Strategy escalation: The scraping agent tries strategies in cache-recommended order, validates responses, and automatically escalates on failure.
- Cross-process SearXNG: A file-lock singleton ensures exactly one SearXNG instance runs across all Python processes.
- Structural typing:
BrowserProviderusesProtocolso implementations don't need inheritance.
Development
Prerequisites
Setup
git clone git@github.com:n24q02m/web-core.git
cd web-core
uv sync --all-extras
pre-commit install
Commands
# Via mise
mise run setup # uv sync --all-extras
mise run lint # ruff check + ruff format --check
mise run test # pytest with coverage
mise run fix # auto-fix lint + format
# Direct
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run ty check src/
uv run pytest --cov -q
Tests
asyncio_mode = "auto"-- no@pytest.mark.asyncioneeded- Coverage threshold: 95% (enforced in pyproject.toml)
- Test files mirror source module structure under
tests/
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file n24q02m_web_core-1.0.1.tar.gz.
File metadata
- Download URL: n24q02m_web_core-1.0.1.tar.gz
- Upload date:
- Size: 155.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fcccddc7b7080e2816ff8fcb0db4b6cbd890a8849a119d35740c8b3434081fb
|
|
| MD5 |
003edc7ef15a4d576bb90e4fc17b44ca
|
|
| BLAKE2b-256 |
f5f7d3ca867c865eda059dae983b9e642896524310ca627f9a1eb1d67efc02e2
|
File details
Details for the file n24q02m_web_core-1.0.1-py3-none-any.whl.
File metadata
- Download URL: n24q02m_web_core-1.0.1-py3-none-any.whl
- Upload date:
- Size: 33.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f915af75d5e020aa8f5d053e2c4f40d5474d621f6158508829e06c040b841748
|
|
| MD5 |
2288eaa8888ee452bb48f07d1836988d
|
|
| BLAKE2b-256 |
a12af59e7d353b5f44771bbf9e809325e1df480c6698305deb48190bb774e150
|