schema-driven HTML extractor

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Project description

scrapling-schema

Schema-driven HTML extractor. Define extraction specs in Python (with full IDE type hints) or YAML, and get structured JSON out.

Install

pip install scrapling-schema

Requirements

Python >= 3.10
scrapling >= 0.4
PyYAML >= 6.0

Python API

Python type spec (recommended)

from scrapling_schema import Schema, Field, Options, Clear, RegexSub

spec = Schema(
    options=Options(clear=Clear(remove_tags=["script", "style"])),
    fields={
        "products": Field(
            css=".product",
            type="array<object>",
            fields={
                "sku":   Field(css="SELF", type="string", attr="data-sku"),
                "name":  Field(css=".name", type="string"),
                "url":   Field(css="a.link", type="string", attr="href"),
                "price": Field(css=".price", type="number", transform=[
                    RegexSub(pattern=r"[^0-9.]+"),
                ]),
                "tags":  Field(css=".tags li", type="array<string>"),
            },
        )
    },
)

result = spec.extract(html)
json_schema = spec.json_schema(title="Products")

Boolean fields (`type: "boolean"`)

Boolean output is derived from type, not a transform. The extractor coerces common truthy/falsey values:

truthy: "true", "t", "yes", "y", "on", "1" (case-insensitive, surrounding whitespace ignored)
falsey: "false", "f", "no", "n", "off", "0"
numbers: 1 → True, 0 → False (other numbers become None)

Python example:

from scrapling_schema import Schema, Field

html = "<span class='in-stock'> yes </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean")})

data = spec.extract(html)
assert data["in_stock"] is True

If you want invalid/missing values to fail fast, set nullable=False:

from scrapling_schema import Schema, Field, ValidationError

html = "<span class='in-stock'> maybe </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean", nullable=False)})

try:
    spec.extract(html)
except ValidationError:
    pass

YAML spec

options:
  clear:
    remove_tags: ["script", "style"]

fields:
  products:
    css: ".product"
    type: "array<object>"
    fields:
      sku:
        css: "SELF"
        type: "string"
        attr: "data-sku"
      name:
        css: ".name"
        type: "string"
      price:
        css: ".price"
        type: "number"
        transform:
          - regex_sub: { pattern: "[^0-9.]+", repl: "" }
  in_stock:
    css: ".in-stock"
    type: "boolean"

from scrapling_schema import extract_from_yaml

result = extract_from_yaml(html, yaml_spec)

CLI

scrapling-schema --spec spec.yml --html-file page.html
scrapling-schema --spec spec.yml --schema

Field reference

param	type	description
`css`	`str`	CSS selector. Use `"SELF"` to select the context node itself
`attr`	`str`	Extract an attribute value (or special `"innerHTML"`)
`type`	`str`	Output type: `"string"
`nullable`	`bool`	If `false`, missing values raise `ValidationError`
`defaultValue`	`any`	Fallback value used when the extracted value is empty
`fields`	`dict`	Nested fields (for `object` / `array<object>`)
`transform`	`list`	Transform pipeline (see below)
`callback`	`callable`	Field-level post-processing hook (Python API only)
`outputSchema`	`dict`	Override JSON Schema for this field (useful when `callback` changes the output type/shape)
`required`	`bool`	Raise `ValidationError` if value is empty

Notes:

type is required for every field.
Arrays must use type: "array<...>" (no items: and no list:).
attr supports special values:
- "innerHTML": extract HTML string from the selected node.
- "ownText": extract direct text for the selected node (excludes descendant text).

Transform reference

transform	shorthand	description
`RegexSub(pattern, repl)`	—	Regex substitution
`Split(delimiter)`	—	Split string into array items (requires `type:"array<...>"`)

Notes:

String outputs are stripped automatically (no transform needed).
Use field-level defaultValue for fallbacks (defaults are not supported inside transforms).

When to use `transform` vs `callback`

Both are meant for post-processing, but they work at different levels and have different ergonomics.

Use `transform` for value-centric pipelines

Good fit when you want a predictable, reusable pipeline on a single extracted value (e.g., regex cleanup, split).

Order of operations (scalar fields):

Extract raw string
Apply transform pipeline
Apply type coercion (number/integer/boolean)
Apply callback (if any)

Example: remove currency symbols before coercing to number:

from scrapling_schema import Schema, Field, RegexSub

spec = Schema(
    fields={
        "price": Field(
            css=".price",
            type="number",
            transform=[RegexSub(pattern=r"[^0-9.]+", repl="")],
        )
    }
)
data = spec.extract(html)

Use `callback` for whole-field logic (filtering, sorting, aggregation)

callback receives the final extracted value for the field:

scalar field → the scalar value (str|int|float|bool|None)
array<...> field → the whole list
object field → the whole dict

This is a better fit for list-level operations or aggregations.

When `callback` changes the output type/shape

callback is an arbitrary Python function, so the library cannot reliably infer a JSON Schema for its return value. If your callback changes the type/shape (e.g., list → object, object → string), set outputSchema on the field to keep spec.json_schema() in sync with the actual output.

Example: filter a list of objects (keep only items you care about):

from scrapling_schema.types import Schema, Field

def keep_only_a(items: list[dict]) -> list[dict]:
    return [item for item in items if "A" in item["name"]]

spec = Schema(
    fields={
        "products": Field(
            css=".item",
            type="array<object>",
            callback=keep_only_a,
            fields={
                "name": Field(css=".name", type="string"),
            },
        )
    }
)
data = spec.extract(html)

`array<object>` special case: `transform` is per-item

For type: "array<object>", transform is applied to each extracted object (each list element). If a transform returns None, the item is dropped from the list.

from scrapling_schema import extract

def drop_product_a(item: dict) -> dict | None:
    return None if item.get("name") == "Product A" else item

spec = {
    "fields": {
        "products": {
            "css": ".item",
            "type": "array<object>",
            "transform": [drop_product_a],
            "fields": {"name": {"css": ".name", "type": "string"}},
        }
    }
}
data = extract(html, spec)

YAML note

YAML specs support only the built-in transform steps (e.g., regex_sub, split). Python callables (transform: [my_fn] / callback: my_fn) are only supported via the Python API (typed Schema/Field or a Python dict spec), not via YAML text.

Testing

Install the dev dependencies (in a virtualenv) and run the test suite:

python -m pip install -e ".[dev]"
python -m pytest

Run a single test file:

python -m pytest tests/test_extractor.py

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

seadfeng

Release history Release notifications | RSS feed

This version

1.1.3

Mar 5, 2026

1.1.2

Mar 5, 2026

1.1.1

Mar 5, 2026

1.1.0

Mar 4, 2026

0.1.1

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapling_schema-1.1.3.tar.gz (22.8 kB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapling_schema-1.1.3-py3-none-any.whl (15.2 kB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file scrapling_schema-1.1.3.tar.gz.

File metadata

Download URL: scrapling_schema-1.1.3.tar.gz
Upload date: Mar 5, 2026
Size: 22.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapling_schema-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`7fbba9575ceb9196a0a73146feadb55df12f83a8b272c69aab5671b0be846923`
MD5	`d55db532d47a48317d0299de3059c47f`
BLAKE2b-256	`d8028faa0962a83408a45903a31cf87890488ca4fabd79f782bc6602abe292f2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-1.1.3.tar.gz:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapling_schema-1.1.3.tar.gz
- Subject digest: 7fbba9575ceb9196a0a73146feadb55df12f83a8b272c69aab5671b0be846923
- Sigstore transparency entry: 1041656969
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: aimscrape/scrapling-schema@5dbe31f8242f0621ee5e3c48246344dca854cef2
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/aimscrape
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5dbe31f8242f0621ee5e3c48246344dca854cef2
- Trigger Event: release

File details

Details for the file scrapling_schema-1.1.3-py3-none-any.whl.

File metadata

Download URL: scrapling_schema-1.1.3-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapling_schema-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05250c7bba0d99dbef30d75936b078b28bedb8781c963858bb368ced05ed82a5`
MD5	`4fbf28f4603000de84af114f66c548da`
BLAKE2b-256	`79c1a0276856045332f8a09b2114c944b683f3874f22fee699ff6d7891f6dfba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-1.1.3-py3-none-any.whl:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapling_schema-1.1.3-py3-none-any.whl
- Subject digest: 05250c7bba0d99dbef30d75936b078b28bedb8781c963858bb368ced05ed82a5
- Sigstore transparency entry: 1041656989
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: aimscrape/scrapling-schema@5dbe31f8242f0621ee5e3c48246344dca854cef2
- Branch / Tag: refs/tags/v1.1.3
- Owner: https://github.com/aimscrape
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5dbe31f8242f0621ee5e3c48246344dca854cef2
- Trigger Event: release

scrapling-schema 1.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

scrapling-schema

Install

Requirements

Python API

Python type spec (recommended)

Boolean fields (type: "boolean")

YAML spec

CLI

Field reference

Transform reference

When to use transform vs callback

Use transform for value-centric pipelines

Use callback for whole-field logic (filtering, sorting, aggregation)

When callback changes the output type/shape

array<object> special case: transform is per-item

YAML note

Testing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Boolean fields (`type: "boolean"`)

When to use `transform` vs `callback`

Use `transform` for value-centric pipelines

Use `callback` for whole-field logic (filtering, sorting, aggregation)

When `callback` changes the output type/shape

`array<object>` special case: `transform` is per-item