Skip to main content

Safe, data-driven URL adapter engine for structured content extraction

Project description

Scraplet DSL

Scraplet DSL is a safe, data-driven adapter engine for resolving URLs into structured outputs without executing arbitrary code.

It is designed as a reusable package that can be embedded in other applications (news readers, crawlers, ETL tools, feed resolvers).

Scraplet DSL

Features

  • URL trigger matching by domain and path regex
  • Deterministic step pipeline with strict schema validation
  • HTML extraction with CSS selectors (lxml + cssselect)
  • Regex extraction and replacement operations
  • JSON extraction with dot-path and array wildcard support
  • Date helper steps (datetime, parse_date)
  • Pluggable HTTP fetcher and HTML parser interfaces for testability
  • Lazy HTML parser construction: ScriptEngine() does not import lxml/cssselect until a select step actually runs

Step Types

  • fetch: GET URL into variables (save_body_as, optional save_status_as, optional save_final_url_as, optional retry, optional retry_backoff, optional timeout)
  • select: CSS select text/html/attribute from HTML; mode selects first (default) or all matches
  • regex: search/findall over variable content, with optional numeric or named capture-group selection and optional flags (i/m/s)
  • replace: regex replace in an input variable, with optional flags (i/m/s)
  • assign: assign literal or templated value
  • assert: fail if variable is missing/empty
  • set_url: rewrite input_url (and derived input_host/input_path/input_query) for subsequent steps
  • datetime: write current datetime with custom format
  • parse_date: parse human dates into normalized format, with optional month_names for locale-specific month names (bypasses dateparser)
  • json: parse JSON and extract by path (items[0].name, items[*].name); scalar values (int/float/bool) are preserved, and invalid array access on non-lists resolves to ""
  • output: produce final output map

URL rewriting

set_url and replace (when output = "input_url") keep the derived URL parts in sync. After rewriting input_url, the variables input_host, input_path, and input_query are re-derived from the new URL, so later steps and output templates see consistent values.

set_url is URL-aware:

  • A value containing ${input_url} is treated as path manipulation: the prefix/suffix around the reference is appended to the base path, the base query string is preserved, and the fragment is dropped. ${input_url}/rss against https://example.com/news?c=1 becomes https://example.com/news/rss?c=1. archive${input_url}/rss becomes https://example.com/archive/news/rss?c=1.
  • A value with a scheme (e.g. https://other.example/feed) is treated as an absolute override.
  • A relative value without ${input_url} (e.g. /feed) is joined onto the current input_url via urljoin, with the base's query and fragment stripped.

Regex safety

Adapter regexes are compiled with a conservative safety guard.

  • regex.pattern, replace.pattern, and url_triggers.path_patterns reject nested repeated subpatterns such as ^(a+)+$, which are a common source of catastrophic backtracking in Python's re engine. Possessive quantifiers (a++) and atomic groups ((?>...)) suppress backtracking and are accepted.
  • The guard is enforced during adapter load for bundle-defined adapters, and on first execution for step instances created directly in Python.
  • This is a heuristic safe-subset check, not a complete ReDoS analyser: it targets the nested-quantifier shape and does not catch every backtracking bomb (e.g. ambiguous alternation like (a|a)+). Keep adapter regexes simple and treat third-party-sourced adapters as untrusted input unless you review them.

JSON extraction typing

json step results preserve the underlying JSON scalar types. A path like items[*].id returns a list of int/float/bool values as they appear in the source, not their stringified form. Nested arrays are preserved as nested lists.

When a path applies [index] or [*] to a non-list value, the step returns an empty string rather than serializing the object at that path.

Datetime step timezone

The datetime step emits timestamps from datetime.now(datetime.UTC), i.e. a tz-aware UTC value. This avoids the cross-timezone correctness footgun that came from the previous naive datetime.now() (which silently used the host's local time). To force a specific offset in the formatted string, use the %z/%Z directives — e.g. format = "%Y-%m-%dT%H:%M:%S%z" produces a trailing +0000 for UTC.

Installation

Scraplet DSL supports Python 3.11, 3.12, and 3.13. The 3.11 floor is intentional: the loader uses the stdlib tomllib module. CI runs format, lint, tests, build, and installed-wheel smoke checks on all supported Python versions.

From PyPI

pip install scraplet-dsl

This installs the library with the default httpx-based HTTP fetcher and the lxml-based HTML parser. Both are runtime dependencies, so a fresh pip install is enough to start resolving URLs.

Local editable install (for development)

git clone <your-fork-or-mirror-url>
cd Scraplet-DSL
pip install -e .

If you use Poetry, poetry install works the same way.

For Library Users

Once installed, the minimal end-to-end flow is:

from scraplet_dsl import ScriptEngine, load_adapter_bundle
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(...)  # see "Adapter Bundle Example" below
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)

The bundle is a plain Python dict (or a TOML file loaded the same way); its schema is described in Adapter Schema.

Adapter Bundle Example

A runnable end-to-end example, defining one adapter in a Python dict bundle and resolving a URL through it:

from scraplet_dsl import load_adapter_bundle, ScriptEngine
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(
    {
        "schema_version": 1,
        "adapters": [
            {
                "name": "example_article",
                "priority": 10,
                "url_triggers": {"domains": ["example.com"], "path_patterns": [r"^/news/"]},
                "steps": [
                    {"type": "fetch", "url_var": "input_url", "save_body_as": "html", "retry": 2, "retry_backoff": 0.5, "timeout": 20},
                    {"type": "select", "html_var": "html", "selector": "h1", "output": "title"},
                    {"type": "output", "output": {"title": "${title}"}},
                ],
            }
        ],
    }
)

url = "https://example.com/news/123"
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)

Adapter Schema (bundle mode)

Top-level keys:

  • schema_version: must be integer 1 (true/false are rejected)
  • adapters: list of adapter definitions

Adapter keys:

  • name: unique adapter name (duplicates are rejected at load time)
  • priority: integer priority; lower value wins when multiple adapters match
  • url_triggers.domains: non-empty list of domains
  • url_triggers.path_patterns: optional list of valid regex path filters (compiled during load and rejected if they use blocked nested-repeat forms)
  • steps: ordered list of step tables
  • headers: optional table of HTTP headers attached to every fetch the adapter issues (validated as a str -> str map; values override the engine-level fetcher headers)

Selected numeric validation rules:

  • fetch.retry: integer >= 0
  • fetch.retry_backoff: number >= 0 (defaults to 0.5; retries sleep retry_backoff * attempt_number seconds)
  • fetch.timeout: number > 0; when omitted, the active fetcher's default is used (HttpxFetcher defaults to 20.0 seconds). This caps a single HTTP request and is separate from the adapter-wide resolution deadline (see Resolution Deadline).
  • replace.count: integer >= 0

Error Model

  • ScrapletError: base class for all errors below
  • ScriptValidationError: invalid schema or step declaration
  • ExecutionError: runtime step failure
  • MissingDependencyError: optional dependency missing at runtime

Runtime errors include adapter and step context through ScriptEngine.resolve.

Resolution Deadline

ScriptEngine.resolve accepts an optional timeout= keyword argument that caps the total wall-clock budget of a single resolve call:

result = engine.resolve(adapter, url, timeout=10.0)

When timeout is set, an internal monotonic deadline is propagated through ExecutionContext:

  • The engine checks the deadline before each step, so a runaway adapter aborts in bounded time with an ExecutionError referencing the step that was skipped.
  • FetchStep checks the deadline before each fetch attempt and refuses to start another attempt if no budget is left for its retry backoff. The retry-backoff sleep is also capped to the remaining budget, so a slow or wedged fetch cannot exhaust the budget while sleeping between retries.
  • A step that does not respect the deadline (e.g. a misbehaving custom step) is still bounded: subsequent steps and retries will see the deadline already exceeded and abort.

timeout must be > 0 when given. The default (None) preserves the prior behavior — no deadline is propagated and ExecutionContext.deadline stays unset.

Network Hardening (HttpxFetcher)

The default HttpxFetcher enforces a small security policy on every request:

  • Scheme allowlist. Only http:// and https:// URLs are accepted. Unsupported schemes (ftp://, file://, ...) are rejected before the client opens a socket. The same check is applied to every redirect Location.
  • Domain allowlist. When allow_domains=... is configured, the host of the initial URL and every redirect target is validated against the allowlist before the request is issued. Disallowed hosts are never contacted.
  • Private-network blocking. Private, loopback, link-local, multicast, reserved, and unspecified addresses are allowed by default for backward compatibility. Set block_private_networks=True to reject IP literals and hostnames that resolve to those address ranges before any request is issued.
  • Redirect cap. Up to max_redirects (default 5) redirects are followed. Beyond that, the fetch fails with httpx.HTTPError("too many redirects").
  • Streaming size cap. Requests are issued in streaming mode, and response bodies are read in chunks via iter_bytes(). The fetch is aborted as soon as more than max_bytes (default 2_000_000) bytes are accumulated, so oversized responses cannot be fully buffered into memory.
  • UTF-8 only. Bodies that fail UTF-8 decoding are rejected rather than silently producing mojibake.
  • HTTP status codes are data, not errors. A non-2xx response (404, 410, 500, ...) is returned as a normal FetchResult carrying its status, body, and headers. Only transport-level and policy failures (unsupported scheme, disallowed host, redirect cap, oversized body, invalid UTF-8, timeout/connect error) raise. Use fetch.save_status_as plus an assert or output template if an adapter needs to branch on the status.

These guarantees apply to the final response (after redirects) as well as every intermediate hop.

Important: without allow_domains=..., HttpxFetcher allows any valid http:// or https:// host. If adapters or input URLs can come from untrusted third parties, configure allow_domains and consider enabling block_private_networks=True:

from scraplet_dsl.http import HttpxFetcher

fetcher = HttpxFetcher(
    allow_domains=("example.com",),
    block_private_networks=True,
)

Private-network blocking resolves hostnames before each request and redirect. This is a useful SSRF guard, but it is not a complete network sandbox; deploy network-level egress controls for high-risk untrusted adapter execution.

If a fetch step omits timeout, the active fetcher's default is used. The default HttpxFetcher uses 20.0 seconds per request. This caps an individual HTTP request and is distinct from the adapter-wide resolution deadline passed to ScriptEngine.resolve(timeout=...) (see Resolution Deadline).

Development

poetry install
poetry run pytest

License

Scraplet DSL is distributed under the Apache License 2.0. See LICENSE for details.

External contributions are accepted under the contribution terms in CONTRIBUTING.md.

Compatibility Policy

  • scraplet-dsl is pre-1.0 and may take breaking package/API changes in minor releases when that helps the project move faster
  • Adapter-schema breaking changes must be deliberate and must bump schema_version
  • User-visible changes and breakages should be recorded in CHANGELOG.md

Project Layout

  • src/scraplet_dsl/engine.py: adapter matching and execution
  • src/scraplet_dsl/loader.py: schema parsing and validation
  • src/scraplet_dsl/steps.py: step implementations
  • src/scraplet_dsl/http.py: fetcher protocol and default httpx fetcher
  • src/scraplet_dsl/html.py: parser protocol and lxml implementation
  • src/scraplet_dsl/variables.py: variable store and template helpers
  • src/scraplet_dsl/types.py: shared types (Value, Variables, ResolutionResult)
  • src/scraplet_dsl/errors.py: error hierarchy
  • src/scraplet_dsl/regex_utils.py: regex flag parsing and the nested-repeat safety guard

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraplet_dsl-0.1.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraplet_dsl-0.1.0-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file scraplet_dsl-0.1.0.tar.gz.

File metadata

  • Download URL: scraplet_dsl-0.1.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0

File hashes

Hashes for scraplet_dsl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 129bbfc5a30f84a7a83fcc73fb49f1964d44ea01205d94ccc2f145efc1da246c
MD5 d91aad9b3c431e1047275057fcb014cc
BLAKE2b-256 f624adc4255bd19dd6c16177660c2da6cbccb63c50aade28175d5c90b7b7b761

See more details on using hashes here.

File details

Details for the file scraplet_dsl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scraplet_dsl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0

File hashes

Hashes for scraplet_dsl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d99a17580cbfd980004694807be5fd0aec337dce9b6cc45a5fbe93f4a79f6fbb
MD5 5306624b860a26a970ffa478bd56bafd
BLAKE2b-256 783ee06effe6df7f4e8c9e14dbc77c54ded3155ea25bed8542f553496729f1bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page