Safe, data-driven URL adapter engine for structured content extraction
Project description
Scraplet DSL
Scraplet DSL is a safe, data-driven adapter engine for resolving URLs into structured outputs without executing arbitrary code.
It is designed as a reusable package that can be embedded in other applications (news readers, crawlers, ETL tools, feed resolvers).
Features
- URL trigger matching by domain and path regex
- Deterministic step pipeline with strict schema validation
- HTML extraction with CSS selectors (
lxml+cssselect) - Regex extraction and replacement operations
- JSON extraction with dot-path and array wildcard support
- Date helper steps (
datetime,parse_date) - Pluggable HTTP fetcher and HTML parser interfaces for testability
- Lazy HTML parser construction:
ScriptEngine()does not importlxml/cssselectuntil aselectstep actually runs
Step Types
fetch: GET URL into variables (save_body_as, optionalsave_status_as, optionalsave_final_url_as, optionalretry, optionalretry_backoff, optionaltimeout)select: CSS select text/html/attribute from HTML;modeselectsfirst(default) orallmatchesregex: search/findall over variable content, with optional numeric or named capture-group selection and optionalflags(i/m/s)replace: regex replace in an input variable, with optionalflags(i/m/s)assign: assign literal or templated valueassert: fail if variable is missing/emptyset_url: rewriteinput_url(and derivedinput_host/input_path/input_query) for subsequent stepsdatetime: write current datetime with custom formatparse_date: parse human dates into normalized format, with optionalmonth_namesfor locale-specific month names (bypassesdateparser)json: parse JSON and extract by path (items[0].name,items[*].name); scalar values (int/float/bool) are preserved, and invalid array access on non-lists resolves to""output: produce final output map
URL rewriting
set_url and replace (when output = "input_url") keep the derived URL parts
in sync. After rewriting input_url, the variables input_host, input_path,
and input_query are re-derived from the new URL, so later steps and output
templates see consistent values.
set_url is URL-aware:
- A value containing
${input_url}is treated as path manipulation: the prefix/suffix around the reference is appended to the base path, the base query string is preserved, and the fragment is dropped.${input_url}/rssagainsthttps://example.com/news?c=1becomeshttps://example.com/news/rss?c=1.archive${input_url}/rssbecomeshttps://example.com/archive/news/rss?c=1. - A value with a scheme (e.g.
https://other.example/feed) is treated as an absolute override. - A relative value without
${input_url}(e.g./feed) is joined onto the currentinput_urlviaurljoin, with the base's query and fragment stripped.
Regex safety
Adapter regexes are compiled with a conservative safety guard.
regex.pattern,replace.pattern, andurl_triggers.path_patternsreject nested repeated subpatterns such as^(a+)+$, which are a common source of catastrophic backtracking in Python'sreengine. Possessive quantifiers (a++) and atomic groups ((?>...)) suppress backtracking and are accepted.- The guard is enforced during adapter load for bundle-defined adapters, and on first execution for step instances created directly in Python.
- This is a heuristic safe-subset check, not a complete ReDoS analyser: it
targets the nested-quantifier shape and does not catch every backtracking
bomb (e.g. ambiguous alternation like
(a|a)+). Keep adapter regexes simple and treat third-party-sourced adapters as untrusted input unless you review them.
JSON extraction typing
json step results preserve the underlying JSON scalar types. A path like
items[*].id returns a list of int/float/bool values as they appear in
the source, not their stringified form. Nested arrays are preserved as nested
lists.
When a path applies [index] or [*] to a non-list value, the step returns an
empty string rather than serializing the object at that path.
Datetime step timezone
The datetime step emits timestamps from datetime.now(datetime.UTC), i.e.
a tz-aware UTC value. This avoids the cross-timezone correctness footgun that
came from the previous naive datetime.now() (which silently used the host's
local time). To force a specific offset in the formatted string, use the
%z/%Z directives — e.g. format = "%Y-%m-%dT%H:%M:%S%z" produces a
trailing +0000 for UTC.
Installation
Scraplet DSL supports Python 3.11, 3.12, and 3.13. The 3.11 floor is
intentional: the loader uses the stdlib tomllib module. CI runs format,
lint, tests, build, and installed-wheel smoke checks on all supported Python
versions.
From PyPI
pip install scraplet-dsl
This installs the library with the default httpx-based HTTP fetcher and the
lxml-based HTML parser. Both are runtime dependencies, so a fresh
pip install is enough to start resolving URLs.
Local editable install (for development)
git clone <your-fork-or-mirror-url>
cd Scraplet-DSL
pip install -e .
If you use Poetry, poetry install works the same way.
For Library Users
Once installed, the minimal end-to-end flow is:
from scraplet_dsl import ScriptEngine, load_adapter_bundle
from scraplet_dsl.engine import select_adapter
bundle = load_adapter_bundle(...) # see "Adapter Bundle Example" below
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
raise RuntimeError("No adapter matched the URL")
result = ScriptEngine().resolve(adapter, url)
print(result.output)
The bundle is a plain Python dict (or a TOML file loaded the same way); its schema is described in Adapter Schema.
Adapter Bundle Example
A runnable end-to-end example, defining one adapter in a Python dict bundle and resolving a URL through it:
from scraplet_dsl import load_adapter_bundle, ScriptEngine
from scraplet_dsl.engine import select_adapter
bundle = load_adapter_bundle(
{
"schema_version": 1,
"adapters": [
{
"name": "example_article",
"priority": 10,
"url_triggers": {"domains": ["example.com"], "path_patterns": [r"^/news/"]},
"steps": [
{"type": "fetch", "url_var": "input_url", "save_body_as": "html", "retry": 2, "retry_backoff": 0.5, "timeout": 20},
{"type": "select", "html_var": "html", "selector": "h1", "output": "title"},
{"type": "output", "output": {"title": "${title}"}},
],
}
],
}
)
url = "https://example.com/news/123"
adapter = select_adapter(bundle.adapters, url)
if adapter is None:
raise RuntimeError("No adapter matched the URL")
result = ScriptEngine().resolve(adapter, url)
print(result.output)
Adapter Schema (bundle mode)
Top-level keys:
schema_version: must be integer1(true/falseare rejected)adapters: list of adapter definitions
Adapter keys:
name: unique adapter name (duplicates are rejected at load time)priority: integer priority; lower value wins when multiple adapters matchurl_triggers.domains: non-empty list of domainsurl_triggers.path_patterns: optional list of valid regex path filters (compiled during load and rejected if they use blocked nested-repeat forms)steps: ordered list of step tablesheaders: optional table of HTTP headers attached to everyfetchthe adapter issues (validated as astr -> strmap; values override the engine-level fetcher headers)
Selected numeric validation rules:
fetch.retry: integer>= 0fetch.retry_backoff: number>= 0(defaults to0.5; retries sleepretry_backoff * attempt_numberseconds)fetch.timeout: number> 0; when omitted, the active fetcher's default is used (HttpxFetcherdefaults to20.0seconds). This caps a single HTTP request and is separate from the adapter-wide resolution deadline (see Resolution Deadline).replace.count: integer>= 0
Error Model
ScrapletError: base class for all errors belowScriptValidationError: invalid schema or step declarationExecutionError: runtime step failureMissingDependencyError: optional dependency missing at runtime
Runtime errors include adapter and step context through ScriptEngine.resolve.
Resolution Deadline
ScriptEngine.resolve accepts an optional timeout= keyword argument that
caps the total wall-clock budget of a single resolve call:
result = engine.resolve(adapter, url, timeout=10.0)
When timeout is set, an internal monotonic deadline is propagated through
ExecutionContext:
- The engine checks the deadline before each step, so a runaway adapter aborts
in bounded time with an
ExecutionErrorreferencing the step that was skipped. FetchStepchecks the deadline before each fetch attempt and refuses to start another attempt if no budget is left for its retry backoff. The retry-backoff sleep is also capped to the remaining budget, so a slow or wedged fetch cannot exhaust the budget while sleeping between retries.- A step that does not respect the deadline (e.g. a misbehaving custom step) is still bounded: subsequent steps and retries will see the deadline already exceeded and abort.
timeout must be > 0 when given. The default (None) preserves the prior
behavior — no deadline is propagated and ExecutionContext.deadline stays
unset.
Network Hardening (HttpxFetcher)
The default HttpxFetcher enforces a small security policy on every request:
- Scheme allowlist. Only
http://andhttps://URLs are accepted. Unsupported schemes (ftp://,file://, ...) are rejected before the client opens a socket. The same check is applied to every redirectLocation. - Domain allowlist. When
allow_domains=...is configured, the host of the initial URL and every redirect target is validated against the allowlist before the request is issued. Disallowed hosts are never contacted. - Private-network blocking. Private, loopback, link-local, multicast,
reserved, and unspecified addresses are allowed by default for backward
compatibility. Set
block_private_networks=Trueto reject IP literals and hostnames that resolve to those address ranges before any request is issued. - Redirect cap. Up to
max_redirects(default5) redirects are followed. Beyond that, the fetch fails withhttpx.HTTPError("too many redirects"). - Streaming size cap. Requests are issued in streaming mode, and response
bodies are read in chunks via
iter_bytes(). The fetch is aborted as soon as more thanmax_bytes(default2_000_000) bytes are accumulated, so oversized responses cannot be fully buffered into memory. - UTF-8 only. Bodies that fail UTF-8 decoding are rejected rather than silently producing mojibake.
- HTTP status codes are data, not errors. A non-2xx response (404, 410,
500, ...) is returned as a normal
FetchResultcarrying its status, body, and headers. Only transport-level and policy failures (unsupported scheme, disallowed host, redirect cap, oversized body, invalid UTF-8, timeout/connect error) raise. Usefetch.save_status_asplus anassertor output template if an adapter needs to branch on the status.
These guarantees apply to the final response (after redirects) as well as every intermediate hop.
Important: without allow_domains=..., HttpxFetcher allows any valid
http:// or https:// host. If adapters or input URLs can come from untrusted
third parties, configure allow_domains and consider enabling
block_private_networks=True:
from scraplet_dsl.http import HttpxFetcher
fetcher = HttpxFetcher(
allow_domains=("example.com",),
block_private_networks=True,
)
Private-network blocking resolves hostnames before each request and redirect. This is a useful SSRF guard, but it is not a complete network sandbox; deploy network-level egress controls for high-risk untrusted adapter execution.
If a fetch step omits timeout, the active fetcher's default is used. The
default HttpxFetcher uses 20.0 seconds per request. This caps an individual
HTTP request and is distinct from the adapter-wide resolution deadline passed
to ScriptEngine.resolve(timeout=...) (see Resolution Deadline).
Development
poetry install
poetry run pytest
License
Scraplet DSL is distributed under the Apache License 2.0. See LICENSE for details.
External contributions are accepted under the contribution terms in CONTRIBUTING.md.
Compatibility Policy
scraplet-dslis pre-1.0 and may take breaking package/API changes in minor releases when that helps the project move faster- Adapter-schema breaking changes must be deliberate and must bump
schema_version - User-visible changes and breakages should be recorded in
CHANGELOG.md
Project Layout
src/scraplet_dsl/engine.py: adapter matching and executionsrc/scraplet_dsl/loader.py: schema parsing and validationsrc/scraplet_dsl/steps.py: step implementationssrc/scraplet_dsl/http.py: fetcher protocol and defaulthttpxfetchersrc/scraplet_dsl/html.py: parser protocol andlxmlimplementationsrc/scraplet_dsl/variables.py: variable store and template helperssrc/scraplet_dsl/types.py: shared types (Value,Variables,ResolutionResult)src/scraplet_dsl/errors.py: error hierarchysrc/scraplet_dsl/regex_utils.py: regex flag parsing and the nested-repeat safety guard
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraplet_dsl-0.1.0.tar.gz.
File metadata
- Download URL: scraplet_dsl-0.1.0.tar.gz
- Upload date:
- Size: 30.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
129bbfc5a30f84a7a83fcc73fb49f1964d44ea01205d94ccc2f145efc1da246c
|
|
| MD5 |
d91aad9b3c431e1047275057fcb014cc
|
|
| BLAKE2b-256 |
f624adc4255bd19dd6c16177660c2da6cbccb63c50aade28175d5c90b7b7b761
|
File details
Details for the file scraplet_dsl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scraplet_dsl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.4.1 CPython/3.11.9 Darwin/25.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d99a17580cbfd980004694807be5fd0aec337dce9b6cc45a5fbe93f4a79f6fbb
|
|
| MD5 |
5306624b860a26a970ffa478bd56bafd
|
|
| BLAKE2b-256 |
783ee06effe6df7f4e8c9e14dbc77c54ded3155ea25bed8542f553496729f1bc
|