Skip to main content

DomVault extraction runtime with session identity, residential proxy orchestration, challenge auto-solving, and warm session caching.

Project description

DomVault 3.0

DomVault is a production-grade extraction runtime for protected, modern websites. It captures page evidence, preserves session identity, routes through residential proxy leases, solves supported challenge families in-place, and emits a structured manifest for downstream reconstruction workflows.

This package is the extraction boundary only. It does not ship discovery, ranking, vault curation, or code-generation systems. Its job is to get into the page cleanly, extract high-fidelity artifacts, and explain exactly how it did so.

Unified Core Advantage

DomVault ships as a Unified Core runtime: one install with browser-capable anti-bot extraction ready out of the box.

What this means:

  • No browser extras split for Playwright, Camoufox, Patchright, or Scrapling.
  • Protected-target posture is available immediately after install.
  • Escalation between HTTP and browser transports is policy-gated and fully observable.

Competitive position:

  • DomVault intentionally accepts a heavier runtime footprint to maximize first-run success on protected targets.
  • Tuning happens through runtime policy knobs, not through post-install dependency surgery.

Architecture

DomVault 3.0 is packaged as a standard src/domvault Python library with deterministic runtime contracts.

Key architectural points:

  • Canonical Layout: src/domvault is the packaging and import root.
  • Runtime Discipline: Browser backends start only when selected by routing or explicit fallback policy.
  • Dependency Hygiene: Runtime dependencies stay in core packaging; tooling stays in dev channels.
  • Python 3.12+ Compatibility: Metadata and typing are aligned for modern Python versions.

DomVault Features

Session Identity

Each run is bound to a typed SessionIdentity that carries:

  • transport profile
  • browser profile
  • locale and timezone
  • viewport
  • cookie jar
  • solver history
  • proxy lease affinity

DomVault keeps browser and HTTP transport aligned to the same logical identity unless policy explicitly rotates it.

Residential Proxy Orchestrator

Protected-mode traffic is routed through a leased proxy identity with:

  • sticky session semantics
  • leak-control policy enforcement
  • DNS-via-proxy preference
  • WebRTC blocking for browser paths
  • lease scoring, challenge penalties, and ejection thresholds

Challenge Auto-Solving

DomVault detects and routes supported challenge families with explicit policy:

  • Cloudflare Turnstile: solve in place when configured
  • DataDome: solve in place and reapply session cookies
  • Kasada: rotate identity instead of pretending a token flow is safe

Solver providers are environment-driven and provider-ordered. The runtime records which provider was attempted, which one succeeded, and when identity rotation was required instead.

Warm Session Caching

Successful sessions are stored domain-by-domain with:

  • browser cookies
  • local storage values
  • proxy affinity hints
  • prior challenge outcomes
  • preferred backend/provider history

When a repeat domain is captured, DomVault attempts to restore a warm identity before starting cold. Poisoned or stale sessions are excluded automatically.

Explainable Extraction Provenance

Every capture emits provenance describing:

  • transport backend used
  • session identity id
  • transport profile id
  • proxy lease behavior
  • challenge routing decisions
  • solver provider results
  • fallback and degradation reasons

The output is intended to be auditable, not magical.

Installation

Install the runtime package:

pip install .

For development:

pip install -e .

Runtime Tuning Knobs

Unified Core keeps browser capabilities installed by default. You can tune runtime cost and behavior with policy controls:

  • DOMVAULT_ENABLE_CAMOUFOX_FALLBACK: enable or disable Camoufox fallback attempts.
  • DOMVAULT_ENABLE_PATCHRIGHT_FALLBACK: enable or disable Patchright compatibility fallback.
  • DOMVAULT_CAMOUFOX_TIMEOUT_MS and DOMVAULT_PATCHRIGHT_TIMEOUT_MS: bound browser escalation cost.
  • DOMVAULT_BACKEND_CAMOUFOX_HEADLESS and DOMVAULT_BACKEND_CAMOUFOX_VIRTUAL_DISPLAY: control execution mode.
  • DOMVAULT_TRANSPORT_PROXY_PROTECTED_MODE and related proxy settings: harden protected-target routing.

Public API

DomVault exposes a stable top-level API. Consumers should not need to import from internal modules for normal usage.

from domvault import CaptureResult, SessionIdentity, extract

result: CaptureResult = extract(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example",
)

print(result.capture_status)
print(result.manifest_path)
print(result.target_profile)

Async usage:

from domvault import extract_async

result = await extract_async(
    "https://example.com",
    selector="main",
    output_dir="_scraped_raw/example-async",
)

Primary public exports:

  • extract
  • extract_async
  • CaptureResult
  • DomVaultManifest
  • SessionIdentity
  • SessionStore
  • WarmSessionRecord
  • RuntimeConfig
  • ProxyOrchestrator

Output Model

The runtime writes a manifest and artifact bundle under the selected output directory. Typical outputs include:

  • manifest.json
  • structured-extraction.json
  • page HTML
  • DOM snapshots
  • computed styles
  • hydration state
  • shadow DOM coverage
  • frame tree coverage
  • anti-bot signals
  • animation and token mapping artifacts

The CaptureResult returned by the API gives you the high-value runtime summary while the manifest preserves the deeper artifact references.

Environment Configuration

DomVault is configured through environment variables. Important groups include:

Identity

  • DOMVAULT_IDENTITY_STORAGE_ROOT
  • DOMVAULT_IDENTITY_DEFAULT_LOCALE
  • DOMVAULT_IDENTITY_DEFAULT_TIMEZONE
  • DOMVAULT_IDENTITY_DEFAULT_ACCEPT_LANGUAGE

Transport And Proxy

  • DOMVAULT_TRANSPORT_HTTP_IMPERSONATION
  • DOMVAULT_TRANSPORT_PROXY_PROVIDER
  • DOMVAULT_TRANSPORT_PROXY_URL
  • DOMVAULT_TRANSPORT_PROXY_COUNTRY
  • DOMVAULT_TRANSPORT_PROXY_REQUIRE_LEASE
  • DOMVAULT_TRANSPORT_PROXY_BLOCK_WEBRTC

Challenge Solvers

  • DOMVAULT_CHALLENGE_SOLVER_PROVIDER_ORDER
  • DOMVAULT_CAPSOLVER_API_KEY
  • DOMVAULT_2CAPTCHA_API_KEY
  • DOMVAULT_CHALLENGE_SOLVER_TURNSTILE_ENABLED
  • DOMVAULT_CHALLENGE_SOLVER_DATADOME_ENABLED
  • DOMVAULT_CHALLENGE_SOLVER_KASADA_ENABLED

Warm Session Store

  • DOMVAULT_SESSION_STORE_ROOT
  • DOMVAULT_SESSION_STORE_ENABLED
  • DOMVAULT_SESSION_STORE_MAX_AGE_HOURS

Operational Notes

  • Protected captures are expected to run with a real proxy strategy.
  • Solver credentials must be injected through environment variables.
  • Warm cache reuse is domain-scoped and identity-scoped.
  • Unresolved or poisoned sessions are not silently reused.
  • The package is strict-typed and validated with mypy, ruff, and pytest.

Crawl4AI Worker

The isolated Crawl4AI worker is intentionally not bundled into the main runtime environment because of dependency constraints around lxml. If you need the offline worker, install requirements-crawl4ai.txt into a separate virtual environment and set DOMVAULT_CRAWL4AI_PYTHON to that interpreter path.

CLI

The package also exposes a CLI entrypoint:

domvault clone https://example.com --selector main --output _scraped_raw/example

The CLI is a thin wrapper around the same extraction pipeline used by the Python API.

Release Standard

DomVault 3.0 is packaged with:

  • a typed public API
  • exact dependency pins
  • manifest-first extraction outputs
  • identity-aware challenge handling
  • warm-session persistence for repeat domains

This package is meant for deterministic, explainable extraction under real production pressure, not just best-effort scraping.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

domvault-1.0.0.tar.gz (136.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

domvault-1.0.0-py3-none-any.whl (164.9 kB view details)

Uploaded Python 3

File details

Details for the file domvault-1.0.0.tar.gz.

File metadata

  • Download URL: domvault-1.0.0.tar.gz
  • Upload date:
  • Size: 136.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for domvault-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f28a16233d0bd9e545c8aaa5686e66308dd9f5ff47327ec5940475ae28193876
MD5 b9715f38dc46a784c78d4e93e4f6c73a
BLAKE2b-256 cd27fa0fef6838afa91269c4305f073cbd6923a6f25a5e1a90e8f480a3a8475d

See more details on using hashes here.

File details

Details for the file domvault-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: domvault-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 164.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for domvault-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 101d560588699033b29fafa8699787d5a0fbb749cda2cffbd56f39e711c5a620
MD5 412ebaf212d15a0fd8f2857d70cb74dd
BLAKE2b-256 54529081cdfec4a4a46cbaa96e3efcfe0f0e0ac555692079d6c0586c651cca7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page