Skip to main content

schema-driven HTML extractor

Project description

scrapling-schema

Schema-driven HTML extractor. Define extraction specs in Python (with full IDE type hints) or YAML, and get structured JSON out.

Install

pip install scrapling-schema

Requirements

  • Python >= 3.10
  • scrapling >= 0.4
  • PyYAML >= 6.0

Python API

Python type spec (recommended)

from scrapling_schema import Schema, Field, Options, Clear, RegexSub

spec = Schema(
    options=Options(clear=Clear(remove_tags=["script", "style"])),
    fields={
        "products": Field(
            css=".product",
            type="array<object>",
            fields={
                "sku":   Field(css="SELF", type="string", attr="data-sku"),
                "name":  Field(css=".name", type="string"),
                "url":   Field(css="a.link", type="string", attr="href"),
                "price": Field(css=".price", type="number", transform=[
                    RegexSub(pattern=r"[^0-9.]+"),
                ]),
                "tags":  Field(css=".tags li", type="array<string>"),
            },
        )
    },
)

result = spec.extract(html)
json_schema = spec.json_schema(title="Products")

Boolean fields (type: "boolean")

Boolean output is derived from type, not a transform. The extractor coerces common truthy/falsey values:

  • truthy: "true", "t", "yes", "y", "on", "1" (case-insensitive, surrounding whitespace ignored)
  • falsey: "false", "f", "no", "n", "off", "0"
  • numbers: 1True, 0False (other numbers become None)

Python example:

from scrapling_schema import Schema, Field

html = "<span class='in-stock'> yes </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean")})

data = spec.extract(html)
assert data["in_stock"] is True

If you want invalid/missing values to fail fast, set nullable=False:

from scrapling_schema import Schema, Field, ValidationError

html = "<span class='in-stock'> maybe </span>"
spec = Schema(fields={"in_stock": Field(css=".in-stock", type="boolean", nullable=False)})

try:
    spec.extract(html)
except ValidationError:
    pass

YAML spec

options:
  clear:
    remove_tags: ["script", "style"]

fields:
  products:
    css: ".product"
    type: "array<object>"
    fields:
      sku:
        css: "SELF"
        type: "string"
        attr: "data-sku"
      name:
        css: ".name"
        type: "string"
      price:
        css: ".price"
        type: "number"
        transform:
          - regex_sub: { pattern: "[^0-9.]+", repl: "" }
  in_stock:
    css: ".in-stock"
    type: "boolean"
from scrapling_schema import extract_from_yaml

result = extract_from_yaml(html, yaml_spec)

CLI

scrapling-schema --spec spec.yml --html-file page.html
scrapling-schema --spec spec.yml --schema

Field reference

param type description
css str CSS selector. Use "SELF" to select the context node itself
attr str Extract an attribute value (or special "innerHTML")
type str Output type: `"string"
nullable bool If false, missing values raise ValidationError
defaultValue any Fallback value used when the extracted value is empty
fields dict Nested fields (for object / array<object>)
transform list Transform pipeline (see below)
callback callable Field-level post-processing hook (Python API only)
outputSchema dict Override JSON Schema for this field (useful when callback changes the output type/shape)
required bool Raise ValidationError if value is empty

Notes:

  • type is required for every field.
  • Arrays must use type: "array<...>" (no items: and no list:).
  • attr supports special values:
    • "innerHTML": extract HTML string from the selected node.
    • "ownText": extract direct text for the selected node (excludes descendant text).

Transform reference

transform shorthand description
RegexSub(pattern, repl) Regex substitution
Split(delimiter) Split string into array items (requires type:"array<...>")

Notes:

  • String outputs are stripped automatically (no transform needed).
  • Use field-level defaultValue for fallbacks (defaults are not supported inside transforms).

When to use transform vs callback

Both are meant for post-processing, but they work at different levels and have different ergonomics.

Use transform for value-centric pipelines

Good fit when you want a predictable, reusable pipeline on a single extracted value (e.g., regex cleanup, split).

Order of operations (scalar fields):

  1. Extract raw string
  2. Apply transform pipeline
  3. Apply type coercion (number/integer/boolean)
  4. Apply callback (if any)

Example: remove currency symbols before coercing to number:

from scrapling_schema import Schema, Field, RegexSub

spec = Schema(
    fields={
        "price": Field(
            css=".price",
            type="number",
            transform=[RegexSub(pattern=r"[^0-9.]+", repl="")],
        )
    }
)
data = spec.extract(html)

Use callback for whole-field logic (filtering, sorting, aggregation)

callback receives the final extracted value for the field:

  • scalar field → the scalar value (str|int|float|bool|None)
  • array<...> field → the whole list
  • object field → the whole dict

This is a better fit for list-level operations or aggregations.

When callback changes the output type/shape

callback is an arbitrary Python function, so the library cannot reliably infer a JSON Schema for its return value. If your callback changes the type/shape (e.g., list → object, object → string), set outputSchema on the field to keep spec.json_schema() in sync with the actual output.

Example: filter a list of objects (keep only items you care about):

from scrapling_schema.types import Schema, Field

def keep_only_a(items: list[dict]) -> list[dict]:
    return [item for item in items if "A" in item["name"]]

spec = Schema(
    fields={
        "products": Field(
            css=".item",
            type="array<object>",
            callback=keep_only_a,
            fields={
                "name": Field(css=".name", type="string"),
            },
        )
    }
)
data = spec.extract(html)

array<object> special case: transform is per-item

For type: "array<object>", transform is applied to each extracted object (each list element). If a transform returns None, the item is dropped from the list.

from scrapling_schema import extract

def drop_product_a(item: dict) -> dict | None:
    return None if item.get("name") == "Product A" else item

spec = {
    "fields": {
        "products": {
            "css": ".item",
            "type": "array<object>",
            "transform": [drop_product_a],
            "fields": {"name": {"css": ".name", "type": "string"}},
        }
    }
}
data = extract(html, spec)

YAML note

YAML specs support only the built-in transform steps (e.g., regex_sub, split). Python callables (transform: [my_fn] / callback: my_fn) are only supported via the Python API (typed Schema/Field or a Python dict spec), not via YAML text.

Testing

Install the dev dependencies (in a virtualenv) and run the test suite:

python -m pip install -e ".[dev]"
python -m pytest

Run a single test file:

python -m pytest tests/test_extractor.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapling_schema-1.1.3.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapling_schema-1.1.3-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file scrapling_schema-1.1.3.tar.gz.

File metadata

  • Download URL: scrapling_schema-1.1.3.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapling_schema-1.1.3.tar.gz
Algorithm Hash digest
SHA256 7fbba9575ceb9196a0a73146feadb55df12f83a8b272c69aab5671b0be846923
MD5 d55db532d47a48317d0299de3059c47f
BLAKE2b-256 d8028faa0962a83408a45903a31cf87890488ca4fabd79f782bc6602abe292f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-1.1.3.tar.gz:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapling_schema-1.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapling_schema-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 05250c7bba0d99dbef30d75936b078b28bedb8781c963858bb368ced05ed82a5
MD5 4fbf28f4603000de84af114f66c548da
BLAKE2b-256 79c1a0276856045332f8a09b2114c944b683f3874f22fee699ff6d7891f6dfba

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-1.1.3-py3-none-any.whl:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page