Skip to main content

DomVault extraction runtime with session identity, residential proxy orchestration, challenge auto-solving, and warm session caching.

Project description

DomVault 1.0.0

DomVault is a production-grade extraction runtime for protected, modern websites. It captures page evidence, preserves session identity, routes through residential proxy leases, solves supported challenge families in-place, and emits a structured manifest for downstream reconstruction workflows.

This package is the extraction boundary only. It does not ship discovery, ranking, vault curation, or code-generation systems. Its job is to get into the page cleanly, extract high-fidelity artifacts, and explain exactly how it did so.

Unified Core Advantage

DomVault ships as a Unified Core runtime: one install with browser-capable anti-bot extraction ready out of the box.

What this means:

  • No browser extras split for Playwright, Camoufox, Patchright, or Scrapling.
  • Protected-target posture is available immediately after install.
  • Escalation between HTTP and browser transports is policy-gated and fully observable.

Competitive position:

  • DomVault intentionally accepts a heavier runtime footprint to maximize first-run success on protected targets.
  • Tuning happens through runtime policy knobs, not through post-install dependency surgery.

Architecture

DomVault 1.0.0 is packaged as a standard src/domvault Python library with deterministic runtime contracts.

Key architectural points:

  • Canonical Layout: src/domvault is the packaging and import root.
  • Runtime Discipline: Browser backends start only when selected by routing or explicit fallback policy.
  • Dependency Hygiene: Runtime dependencies stay in core packaging; tooling stays in dev channels.
  • Python 3.12+ Compatibility: Metadata and typing are aligned for modern Python versions.

DomVault Features

Session Identity

Each run is bound to a typed SessionIdentity that carries:

  • transport profile
  • browser profile
  • locale and timezone
  • viewport
  • cookie jar
  • solver history
  • proxy lease affinity

DomVault keeps browser and HTTP transport aligned to the same logical identity unless policy explicitly rotates it.

Residential Proxy Orchestrator

Protected-mode traffic is routed through a leased proxy identity with:

  • sticky session semantics
  • leak-control policy enforcement
  • DNS-via-proxy preference
  • WebRTC blocking for browser paths
  • lease scoring, challenge penalties, and ejection thresholds

Challenge Auto-Solving

DomVault detects and routes supported challenge families with explicit policy:

  • Cloudflare Turnstile: solve in place when configured
  • DataDome: solve in place and reapply session cookies
  • Kasada: rotate identity instead of pretending a token flow is safe

Solver providers are environment-driven and provider-ordered. The runtime records which provider was attempted, which one succeeded, and when identity rotation was required instead.

Warm Session Caching

Successful sessions are stored domain-by-domain with:

  • browser cookies
  • local storage values
  • proxy affinity hints
  • prior challenge outcomes
  • preferred backend/provider history

When a repeat domain is captured, DomVault attempts to restore a warm identity before starting cold. Poisoned or stale sessions are excluded automatically.

Explainable Extraction Provenance

Every capture emits provenance describing:

  • transport backend used
  • session identity id
  • transport profile id
  • proxy lease behavior
  • challenge routing decisions
  • solver provider results
  • fallback and degradation reasons

The output is intended to be auditable, not magical.

Installation

Install the runtime package:

pip install .

For development:

pip install -e .

Runtime Tuning Knobs

Unified Core keeps browser capabilities installed by default. You can tune runtime cost and behavior with policy controls:

  • DOMVAULT_ENABLE_CAMOUFOX_FALLBACK: enable or disable Camoufox fallback attempts.
  • DOMVAULT_ENABLE_PATCHRIGHT_FALLBACK: enable or disable Patchright compatibility fallback.
  • DOMVAULT_CAMOUFOX_TIMEOUT_MS and DOMVAULT_PATCHRIGHT_TIMEOUT_MS: bound browser escalation cost.
  • DOMVAULT_BACKEND_CAMOUFOX_HEADLESS and DOMVAULT_BACKEND_CAMOUFOX_VIRTUAL_DISPLAY: control execution mode.
  • DOMVAULT_TRANSPORT_PROXY_PROTECTED_MODE and related proxy settings: harden protected-target routing.

Public API

DomVault exposes a stable top-level API. Consumers should not need to import from internal modules for normal usage.

from domvault import CaptureResult, SessionIdentity, extract

result: CaptureResult = extract(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example",
)

print(result.capture_status)
print(result.manifest_path)
print(result.target_profile)

Async usage:

from domvault import extract_async

result = await extract_async(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example-async",
)

Primary public exports:

  • extract
  • extract_async
  • CaptureResult
  • DomVaultManifest
  • SessionIdentity
  • SessionStore
  • WarmSessionRecord
  • RuntimeConfig
  • ProxyOrchestrator

Output Model

The runtime writes a manifest and artifact bundle under the selected output directory. Typical outputs include:

  • manifest.json
  • structured-extraction.json
  • page HTML
  • DOM snapshots
  • computed styles
  • hydration state
  • shadow DOM coverage
  • frame tree coverage
  • anti-bot signals
  • animation and token mapping artifacts

The CaptureResult returned by the API gives you the high-value runtime summary while the manifest preserves the deeper artifact references.

Environment Configuration

DomVault is configured through environment variables. Important groups include:

Identity

  • DOMVAULT_IDENTITY_STORAGE_ROOT
  • DOMVAULT_IDENTITY_DEFAULT_LOCALE
  • DOMVAULT_IDENTITY_DEFAULT_TIMEZONE
  • DOMVAULT_IDENTITY_DEFAULT_ACCEPT_LANGUAGE

Transport And Proxy

  • DOMVAULT_TRANSPORT_HTTP_IMPERSONATION
  • DOMVAULT_TRANSPORT_PROXY_PROVIDER
  • DOMVAULT_TRANSPORT_PROXY_URL
  • DOMVAULT_TRANSPORT_PROXY_COUNTRY
  • DOMVAULT_TRANSPORT_PROXY_REQUIRE_LEASE
  • DOMVAULT_TRANSPORT_PROXY_BLOCK_WEBRTC

Challenge Solvers

  • DOMVAULT_CHALLENGE_SOLVER_PROVIDER_ORDER
  • DOMVAULT_CAPSOLVER_API_KEY
  • DOMVAULT_2CAPTCHA_API_KEY
  • DOMVAULT_CHALLENGE_SOLVER_TURNSTILE_ENABLED
  • DOMVAULT_CHALLENGE_SOLVER_DATADOME_ENABLED
  • DOMVAULT_CHALLENGE_SOLVER_KASADA_ENABLED

Warm Session Store

  • DOMVAULT_SESSION_STORE_ROOT
  • DOMVAULT_SESSION_STORE_ENABLED
  • DOMVAULT_SESSION_STORE_MAX_AGE_HOURS

Operational Notes

  • Protected captures are expected to run with a real proxy strategy.
  • Solver credentials must be injected through environment variables.
  • Warm cache reuse is domain-scoped and identity-scoped.
  • Unresolved or poisoned sessions are not silently reused.
  • The package is strict-typed and validated with mypy, ruff, and pytest.

Crawl4AI Worker

The isolated Crawl4AI worker is intentionally not bundled into the main runtime environment because of dependency constraints around lxml. If you need the offline worker, install requirements-crawl4ai.txt into a separate virtual environment and set DOMVAULT_CRAWL4AI_PYTHON to that interpreter path.

CLI

The package also exposes a CLI entrypoint:

domvault clone https://example.com --selector main --output _scraped_raw/example

The CLI is a thin wrapper around the same extraction pipeline used by the Python API.

Release Standard

DomVault 1.0.0 is packaged with:

  • a typed public API
  • exact dependency pins
  • manifest-first extraction outputs
  • identity-aware challenge handling
  • warm-session persistence for repeat domains

This package is meant for deterministic, explainable extraction under real production pressure, not just best-effort scraping.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domvault-1.0.1.tar.gz (136.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domvault-1.0.1-py3-none-any.whl (164.9 kB view details)

Uploaded Python 3

File details

Details for the file domvault-1.0.1.tar.gz.

File metadata

  • Download URL: domvault-1.0.1.tar.gz
  • Upload date:
  • Size: 136.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for domvault-1.0.1.tar.gz
Algorithm Hash digest
SHA256 12fbc7a6b7bec4f3aa155d74f8f612f44a234dcf6ea8d5087bde08cd4110585d
MD5 10f2ab6010258a11aca25dc3c77e91cc
BLAKE2b-256 1231c069e12c9a1c44d7ea642cb481e0685a054fbcb870f8db1b34578e4d053a

See more details on using hashes here.

File details

Details for the file domvault-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: domvault-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 164.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for domvault-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ca2c82ca5b986fe641c248344ae7fbb79b7f0a3eee6c3fc96ff7122222232a1
MD5 4390f13647b5feb74b9fd3a08f3efe41
BLAKE2b-256 65327cb82f538535fcfa64d5342c04716e69cc5bd9e3f53c60ee0fc58e16099d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page