Safe, data-driven URL adapter engine for structured content extraction

These details have not been verified by PyPI

Project links

Project description

Scraplet DSL

Scraplet DSL is a safe, data-driven adapter engine for resolving URLs into structured outputs without executing arbitrary code.

It is designed as a reusable package that can be embedded in other applications (news readers, crawlers, ETL tools, feed resolvers).

Scraplet DSL

Features

URL trigger matching by domain and path regex
Deterministic step pipeline with strict schema validation
HTML extraction with CSS selectors (lxml + cssselect)
Regex extraction and replacement operations
JSON extraction with dot-path and array wildcard support
Date helper steps (datetime, parse_date)
Pluggable HTTP fetcher and HTML parser interfaces for testability
Lazy HTML parser construction: ScriptEngine() does not import lxml/cssselect until a select step actually runs

Step Types

fetch: GET URL into variables (save_body_as, optional save_status_as, optional save_final_url_as, optional retry, optional retry_backoff, optional timeout)
select: CSS select text/html/attribute from HTML; mode selects first (default) or all matches
regex: search/findall over variable content, with optional numeric or named capture-group selection and optional flags (i/m/s)
replace: regex replace in an input variable, with optional flags (i/m/s)
assign: assign literal or templated value
assert: fail if variable is missing/empty
set_url: rewrite input_url (and derived input_host/input_path/input_query) for subsequent steps
datetime: write current datetime with custom format
parse_date: parse human dates into normalized format, with optional month_names for locale-specific month names (bypasses dateparser)
json: parse JSON and extract by path (items[0].name, items[*].name); scalar values (int/float/bool) are preserved, and invalid array access on non-lists resolves to ""
output: produce final output map

URL rewriting

set_url and replace (when output = "input_url") keep the derived URL parts in sync. After rewriting input_url, the variables input_host, input_path, and input_query are re-derived from the new URL, so later steps and output templates see consistent values.

set_url is URL-aware:

A value containing ${input_url} is treated as path manipulation: the prefix/suffix around the reference is appended to the base path, the base query string is preserved, and the fragment is dropped. ${input_url}/rss against https://example.com/news?c=1 becomes https://example.com/news/rss?c=1. archive${input_url}/rss becomes https://example.com/archive/news/rss?c=1.
A value with a scheme (e.g. https://other.example/feed) is treated as an absolute override.
A relative value without ${input_url} (e.g. /feed) is joined onto the current input_url via urljoin, with the base's query and fragment stripped.

Regex safety

Adapter regexes are compiled with a conservative safety guard.

regex.pattern, replace.pattern, and url_triggers.path_patterns reject nested repeated subpatterns such as ^(a+)+$, which are a common source of catastrophic backtracking in Python's re engine. Possessive quantifiers (a++) and atomic groups ((?>...)) suppress backtracking and are accepted.
The guard is enforced during adapter load for bundle-defined adapters, and on first execution for step instances created directly in Python.
This is a heuristic safe-subset check, not a complete ReDoS analyser: it targets the nested-quantifier shape and does not catch every backtracking bomb (e.g. ambiguous alternation like (a|a)+). Keep adapter regexes simple and treat third-party-sourced adapters as untrusted input unless you review them.

JSON extraction typing

json step results preserve the underlying JSON scalar types. A path like items[*].id returns a list of int/float/bool values as they appear in the source, not their stringified form. Nested arrays are preserved as nested lists.

When a path applies [index] or [*] to a non-list value, the step returns an empty string rather than serializing the object at that path.

Datetime step timezone

The datetime step emits timestamps from datetime.now(datetime.UTC), i.e. a tz-aware UTC value. This avoids the cross-timezone correctness footgun that came from the previous naive datetime.now() (which silently used the host's local time). To force a specific offset in the formatted string, use the %z/%Z directives — e.g. format = "%Y-%m-%dT%H:%M:%S%z" produces a trailing +0000 for UTC.

Installation

Scraplet DSL supports Python 3.11, 3.12, and 3.13. The 3.11 floor is intentional: the loader uses the stdlib tomllib module. CI runs format, lint, tests, build, and installed-wheel smoke checks on all supported Python versions.

From PyPI

pip install scraplet-dsl

This installs the library with the default httpx-based HTTP fetcher and the lxml-based HTML parser. Both are runtime dependencies, so a fresh pip install is enough to start resolving URLs.

Local editable install (for development)

git clone <your-fork-or-mirror-url>
cd Scraplet-DSL
pip install -e .

If you use Poetry, poetry install works the same way.

For Library Users

Once installed, the minimal end-to-end flow is:

from scraplet_dsl import ScriptEngine, load_adapter_bundle
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(...)  # see "Adapter Bundle Example" below
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)

The bundle is a plain Python dict (or a TOML file loaded the same way); its schema is described in Adapter Schema.

Adapter Bundle Example

A runnable end-to-end example, defining one adapter in a Python dict bundle and resolving a URL through it:

from scraplet_dsl import load_adapter_bundle, ScriptEngine
from scraplet_dsl.engine import select_adapter

bundle = load_adapter_bundle(
    {
        "schema_version": 1,
        "adapters": [
            {
                "name": "example_article",
                "priority": 10,
                "url_triggers": {"domains": ["example.com"], "path_patterns": [r"^/news/"]},
                "steps": [
                    {"type": "fetch", "url_var": "input_url", "save_body_as": "html", "retry": 2, "retry_backoff": 0.5, "timeout": 20},
                    {"type": "select", "html_var": "html", "selector": "h1", "output": "title"},
                    {"type": "output", "output": {"title": "${title}"}},
                ],
            }
        ],
    }
)

url = "https://example.com/news/123"
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
    raise RuntimeError("No adapter matched the URL")

result = ScriptEngine().resolve(adapter, url)
print(result.output)

Adapter Schema (bundle mode)

Top-level keys:

schema_version: must be integer 1 (true/false are rejected)
adapters: list of adapter definitions

Adapter keys:

name: unique adapter name (duplicates are rejected at load time)
priority: integer priority; lower value wins when multiple adapters match
url_triggers.domains: non-empty list of domains
url_triggers.path_patterns: optional list of valid regex path filters (compiled during load and rejected if they use blocked nested-repeat forms)
steps: ordered list of step tables
headers: optional table of HTTP headers attached to every fetch the adapter issues (validated as a str -> str map; values override the engine-level fetcher headers)

Selected numeric validation rules:

fetch.retry: integer >= 0
fetch.retry_backoff: number >= 0 (defaults to 0.5; retries sleep retry_backoff * attempt_number seconds)
fetch.timeout: number > 0; when omitted, the active fetcher's default is used (HttpxFetcher defaults to 20.0 seconds). This caps a single HTTP request and is separate from the adapter-wide resolution deadline (see Resolution Deadline).
replace.count: integer >= 0

Error Model

ScrapletError: base class for all errors below
ScriptValidationError: invalid schema or step declaration
ExecutionError: runtime step failure
MissingDependencyError: optional dependency missing at runtime

Runtime errors include adapter and step context through ScriptEngine.resolve.

Resolution Deadline

ScriptEngine.resolve accepts an optional timeout= keyword argument that caps the total wall-clock budget of a single resolve call:

result = engine.resolve(adapter, url, timeout=10.0)

When timeout is set, an internal monotonic deadline is propagated through ExecutionContext:

The engine checks the deadline before each step, so a runaway adapter aborts in bounded time with an ExecutionError referencing the step that was skipped.
FetchStep checks the deadline before each fetch attempt and refuses to start another attempt if no budget is left for its retry backoff. The retry-backoff sleep is also capped to the remaining budget, so a slow or wedged fetch cannot exhaust the budget while sleeping between retries.
A step that does not respect the deadline (e.g. a misbehaving custom step) is still bounded: subsequent steps and retries will see the deadline already exceeded and abort.

timeout must be > 0 when given. The default (None) preserves the prior behavior — no deadline is propagated and ExecutionContext.deadline stays unset.

Network Hardening (`HttpxFetcher`)

The default HttpxFetcher enforces a small security policy on every request:

Scheme allowlist. Only http:// and https:// URLs are accepted. Unsupported schemes (ftp://, file://, ...) are rejected before the client opens a socket. The same check is applied to every redirect Location.
Domain allowlist. When allow_domains=... is configured, the host of the initial URL and every redirect target is validated against the allowlist before the request is issued. Disallowed hosts are never contacted.
Private-network blocking. Private, loopback, link-local, multicast, reserved, and unspecified addresses are allowed by default for backward compatibility. Set block_private_networks=True to reject IP literals and hostnames that resolve to those address ranges before any request is issued.
Redirect cap. Up to max_redirects (default 5) redirects are followed. Beyond that, the fetch fails with httpx.HTTPError("too many redirects").
Streaming size cap. Requests are issued in streaming mode, and response bodies are read in chunks via iter_bytes(). The fetch is aborted as soon as more than max_bytes (default 2_000_000) bytes are accumulated, so oversized responses cannot be fully buffered into memory.
UTF-8 only. Bodies that fail UTF-8 decoding are rejected rather than silently producing mojibake.
HTTP status codes are data, not errors. A non-2xx response (404, 410, 500, ...) is returned as a normal FetchResult carrying its status, body, and headers. Only transport-level and policy failures (unsupported scheme, disallowed host, redirect cap, oversized body, invalid UTF-8, timeout/connect error) raise. Use fetch.save_status_as plus an assert or output template if an adapter needs to branch on the status.

These guarantees apply to the final response (after redirects) as well as every intermediate hop.

Important: without allow_domains=..., HttpxFetcher allows any valid http:// or https:// host. If adapters or input URLs can come from untrusted third parties, configure allow_domains and consider enabling block_private_networks=True:

from scraplet_dsl.http import HttpxFetcher

fetcher = HttpxFetcher(
    allow_domains=("example.com",),
    block_private_networks=True,
)

Private-network blocking resolves hostnames before each request and redirect. This is a useful SSRF guard, but it is not a complete network sandbox; deploy network-level egress controls for high-risk untrusted adapter execution.

If a fetch step omits timeout, the active fetcher's default is used. The default HttpxFetcher uses 20.0 seconds per request. This caps an individual HTTP request and is distinct from the adapter-wide resolution deadline passed to ScriptEngine.resolve(timeout=...) (see Resolution Deadline).

Development

poetry install
poetry run pytest

License

Scraplet DSL is distributed under the Apache License 2.0. See LICENSE for details.

External contributions are accepted under the contribution terms in CONTRIBUTING.md.

Compatibility Policy

scraplet-dsl is pre-1.0 and may take breaking package/API changes in minor releases when that helps the project move faster
Adapter-schema breaking changes must be deliberate and must bump schema_version
User-visible changes and breakages should be recorded in CHANGELOG.md

Project Layout

src/scraplet_dsl/engine.py: adapter matching and execution
src/scraplet_dsl/loader.py: schema parsing and validation
src/scraplet_dsl/steps.py: step implementations
src/scraplet_dsl/http.py: fetcher protocol and default httpx fetcher
src/scraplet_dsl/html.py: parser protocol and lxml implementation
src/scraplet_dsl/variables.py: variable store and template helpers
src/scraplet_dsl/types.py: shared types (Value, Variables, ResolutionResult)
src/scraplet_dsl/errors.py: error hierarchy
src/scraplet_dsl/regex_utils.py: regex flag parsing and the nested-repeat safety guard

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraplet_dsl-0.1.0.tar.gz (30.9 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraplet_dsl-0.1.0-py3-none-any.whl (30.1 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file scraplet_dsl-0.1.0.tar.gz.

File metadata

Download URL: scraplet_dsl-0.1.0.tar.gz
Upload date: Jul 1, 2026
Size: 30.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0

File hashes

Hashes for scraplet_dsl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`129bbfc5a30f84a7a83fcc73fb49f1964d44ea01205d94ccc2f145efc1da246c`
MD5	`d91aad9b3c431e1047275057fcb014cc`
BLAKE2b-256	`f624adc4255bd19dd6c16177660c2da6cbccb63c50aade28175d5c90b7b7b761`

See more details on using hashes here.

File details

Details for the file scraplet_dsl-0.1.0-py3-none-any.whl.

File metadata

Download URL: scraplet_dsl-0.1.0-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 30.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0

File hashes

Hashes for scraplet_dsl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d99a17580cbfd980004694807be5fd0aec337dce9b6cc45a5fbe93f4a79f6fbb`
MD5	`5306624b860a26a970ffa478bd56bafd`
BLAKE2b-256	`783ee06effe6df7f4e8c9e14dbc77c54ded3155ea25bed8542f553496729f1bc`

See more details on using hashes here.

scraplet-dsl 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Scraplet DSL

Features

Step Types

URL rewriting

Regex safety

JSON extraction typing

Datetime step timezone

Installation

From PyPI

Local editable install (for development)

For Library Users

Adapter Bundle Example

Adapter Schema (bundle mode)

Error Model

Resolution Deadline

Network Hardening (HttpxFetcher)

Development

License

Compatibility Policy

Project Layout

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Network Hardening (`HttpxFetcher`)