DomVault extraction runtime with session identity, residential proxy orchestration, challenge auto-solving, and warm session caching.
Project description
DomVault 1.0.0
DomVault is a production-grade extraction runtime for protected, modern websites. It captures page evidence, preserves session identity, routes through residential proxy leases, solves supported challenge families in-place, and emits a structured manifest for downstream reconstruction workflows.
This package is the extraction boundary only. It does not ship discovery, ranking, vault curation, or code-generation systems. Its job is to get into the page cleanly, extract high-fidelity artifacts, and explain exactly how it did so.
Unified Core Advantage
DomVault ships as a Unified Core runtime: one install with browser-capable anti-bot extraction ready out of the box.
What this means:
- No browser extras split for Playwright, Camoufox, Patchright, or Scrapling.
- Protected-target posture is available immediately after install.
- Escalation between HTTP and browser transports is policy-gated and fully observable.
Competitive position:
- DomVault intentionally accepts a heavier runtime footprint to maximize first-run success on protected targets.
- Tuning happens through runtime policy knobs, not through post-install dependency surgery.
Architecture
DomVault 1.0.0 is packaged as a standard src/domvault Python library with deterministic runtime contracts.
Key architectural points:
- Canonical Layout:
src/domvaultis the packaging and import root. - Runtime Discipline: Browser backends start only when selected by routing or explicit fallback policy.
- Dependency Hygiene: Runtime dependencies stay in core packaging; tooling stays in dev channels.
- Python 3.12+ Compatibility: Metadata and typing are aligned for modern Python versions.
DomVault Features
Session Identity
Each run is bound to a typed SessionIdentity that carries:
- transport profile
- browser profile
- locale and timezone
- viewport
- cookie jar
- solver history
- proxy lease affinity
DomVault keeps browser and HTTP transport aligned to the same logical identity unless policy explicitly rotates it.
Residential Proxy Orchestrator
Protected-mode traffic is routed through a leased proxy identity with:
- sticky session semantics
- leak-control policy enforcement
- DNS-via-proxy preference
- WebRTC blocking for browser paths
- lease scoring, challenge penalties, and ejection thresholds
Challenge Auto-Solving
DomVault detects and routes supported challenge families with explicit policy:
- Cloudflare Turnstile: solve in place when configured
- DataDome: solve in place and reapply session cookies
- Kasada: rotate identity instead of pretending a token flow is safe
Solver providers are environment-driven and provider-ordered. The runtime records which provider was attempted, which one succeeded, and when identity rotation was required instead.
Warm Session Caching
Successful sessions are stored domain-by-domain with:
- browser cookies
- local storage values
- proxy affinity hints
- prior challenge outcomes
- preferred backend/provider history
When a repeat domain is captured, DomVault attempts to restore a warm identity before starting cold. Poisoned or stale sessions are excluded automatically.
Explainable Extraction Provenance
Every capture emits provenance describing:
- transport backend used
- session identity id
- transport profile id
- proxy lease behavior
- challenge routing decisions
- solver provider results
- fallback and degradation reasons
The output is intended to be auditable, not magical.
Installation
Install the runtime package:
pip install .
For development:
pip install -e .
Runtime Tuning Knobs
Unified Core keeps browser capabilities installed by default. You can tune runtime cost and behavior with policy controls:
DOMVAULT_ENABLE_CAMOUFOX_FALLBACK: enable or disable Camoufox fallback attempts.DOMVAULT_ENABLE_PATCHRIGHT_FALLBACK: enable or disable Patchright compatibility fallback.DOMVAULT_CAMOUFOX_TIMEOUT_MSandDOMVAULT_PATCHRIGHT_TIMEOUT_MS: bound browser escalation cost.DOMVAULT_BACKEND_CAMOUFOX_HEADLESSandDOMVAULT_BACKEND_CAMOUFOX_VIRTUAL_DISPLAY: control execution mode.DOMVAULT_TRANSPORT_PROXY_PROTECTED_MODEand related proxy settings: harden protected-target routing.
Public API
DomVault exposes a stable top-level API. Consumers should not need to import from internal modules for normal usage.
from domvault import CaptureResult, SessionIdentity, extract
result: CaptureResult = extract(
"https://example.com",
selector="main",
output_dir="_scraped_raw/example",
)
print(result.capture_status)
print(result.manifest_path)
print(result.target_profile)
Async usage:
from domvault import extract_async
result = await extract_async(
"https://example.com",
selector="main",
output_dir="_scraped_raw/example-async",
)
Primary public exports:
extractextract_asyncCaptureResultDomVaultManifestSessionIdentitySessionStoreWarmSessionRecordRuntimeConfigProxyOrchestrator
Output Model
The runtime writes a manifest and artifact bundle under the selected output directory. Typical outputs include:
manifest.jsonstructured-extraction.json- page HTML
- DOM snapshots
- computed styles
- hydration state
- shadow DOM coverage
- frame tree coverage
- anti-bot signals
- animation and token mapping artifacts
The CaptureResult returned by the API gives you the high-value runtime summary while
the manifest preserves the deeper artifact references.
Environment Configuration
DomVault is configured through environment variables. Important groups include:
Identity
DOMVAULT_IDENTITY_STORAGE_ROOTDOMVAULT_IDENTITY_DEFAULT_LOCALEDOMVAULT_IDENTITY_DEFAULT_TIMEZONEDOMVAULT_IDENTITY_DEFAULT_ACCEPT_LANGUAGE
Transport And Proxy
DOMVAULT_TRANSPORT_HTTP_IMPERSONATIONDOMVAULT_TRANSPORT_PROXY_PROVIDERDOMVAULT_TRANSPORT_PROXY_URLDOMVAULT_TRANSPORT_PROXY_COUNTRYDOMVAULT_TRANSPORT_PROXY_REQUIRE_LEASEDOMVAULT_TRANSPORT_PROXY_BLOCK_WEBRTC
Challenge Solvers
DOMVAULT_CHALLENGE_SOLVER_PROVIDER_ORDERDOMVAULT_CAPSOLVER_API_KEYDOMVAULT_2CAPTCHA_API_KEYDOMVAULT_CHALLENGE_SOLVER_TURNSTILE_ENABLEDDOMVAULT_CHALLENGE_SOLVER_DATADOME_ENABLEDDOMVAULT_CHALLENGE_SOLVER_KASADA_ENABLED
Warm Session Store
DOMVAULT_SESSION_STORE_ROOTDOMVAULT_SESSION_STORE_ENABLEDDOMVAULT_SESSION_STORE_MAX_AGE_HOURS
Operational Notes
- Protected captures are expected to run with a real proxy strategy.
- Solver credentials must be injected through environment variables.
- Warm cache reuse is domain-scoped and identity-scoped.
- Unresolved or poisoned sessions are not silently reused.
- The package is strict-typed and validated with
mypy,ruff, and pytest.
Crawl4AI Worker
The isolated Crawl4AI worker is intentionally not bundled into the main runtime
environment because of dependency constraints around lxml. If you need the offline
worker, install requirements-crawl4ai.txt into a separate virtual environment and
set DOMVAULT_CRAWL4AI_PYTHON to that interpreter path.
CLI
The package also exposes a CLI entrypoint:
domvault clone https://example.com --selector main --output _scraped_raw/example
The CLI is a thin wrapper around the same extraction pipeline used by the Python API.
Release Standard
DomVault 1.0.0 is packaged with:
- a typed public API
- exact dependency pins
- manifest-first extraction outputs
- identity-aware challenge handling
- warm-session persistence for repeat domains
This package is meant for deterministic, explainable extraction under real production pressure, not just best-effort scraping.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file domvault-1.0.1.tar.gz.
File metadata
- Download URL: domvault-1.0.1.tar.gz
- Upload date:
- Size: 136.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12fbc7a6b7bec4f3aa155d74f8f612f44a234dcf6ea8d5087bde08cd4110585d
|
|
| MD5 |
10f2ab6010258a11aca25dc3c77e91cc
|
|
| BLAKE2b-256 |
1231c069e12c9a1c44d7ea642cb481e0685a054fbcb870f8db1b34578e4d053a
|
File details
Details for the file domvault-1.0.1-py3-none-any.whl.
File metadata
- Download URL: domvault-1.0.1-py3-none-any.whl
- Upload date:
- Size: 164.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ca2c82ca5b986fe641c248344ae7fbb79b7f0a3eee6c3fc96ff7122222232a1
|
|
| MD5 |
4390f13647b5feb74b9fd3a08f3efe41
|
|
| BLAKE2b-256 |
65327cb82f538535fcfa64d5342c04716e69cc5bd9e3f53c60ee0fc58e16099d
|