Skip to main content

schema-driven HTML extractor

Project description

scrapling-schema

Schema-driven HTML extractor. Define extraction specs in Python (with full IDE type hints) or YAML, and get structured JSON out.

Install

pip install scrapling-schema

Requirements

  • Python >= 3.10
  • scrapling >= 0.4
  • PyYAML >= 6.0

Python API

Python type spec (recommended)

from scrapling_schema import Schema, Field, Options, Clear, RegexSub

spec = Schema(
    options=Options(clear=Clear(remove_tags=["script", "style"])),
    fields={
        "products": Field(
            css=".product",
            type="array<object>",
            fields={
                "sku":   Field(css="SELF", type="string", attr="data-sku"),
                "name":  Field(css=".name", type="string"),
                "url":   Field(css="a.link", type="string", attr="href"),
                "price": Field(css=".price", type="number", transform=[
                    RegexSub(pattern=r"[^0-9.]+"),
                ]),
                "tags":  Field(css=".tags li", type="array<string>"),
            },
        )
    },
)

result = spec.extract(html)
json_schema = spec.json_schema(title="Products")

YAML spec

options:
  clear:
    remove_tags: ["script", "style"]

fields:
  products:
    css: ".product"
    type: "array<object>"
    fields:
      sku:
        css: "SELF"
        type: "string"
        attr: "data-sku"
      name:
        css: ".name"
        type: "string"
      price:
        css: ".price"
        type: "number"
        transform:
          - regex_sub: { pattern: "[^0-9.]+", repl: "" }
from scrapling_schema import extract_from_yaml

result = extract_from_yaml(html, yaml_spec)

CLI

scrapling-schema --spec spec.yml --html-file page.html
scrapling-schema --spec spec.yml --schema

Field reference

param type description
css str CSS selector. Use "SELF" to select the context node itself
attr str Extract an attribute value (or special "innerHTML")
type str Output type: `"string"
nullable bool If false, missing values raise ValidationError
defaultValue any Fallback value used when the extracted value is empty
fields dict Nested fields (for object / array<object>)
transform list Transform pipeline (see below)
required bool Raise ValidationError if value is empty

Notes:

  • type is required for every field.
  • Arrays must use type: "array<...>" (no items: and no list:).
  • attr supports special values:
    • "innerHTML": extract HTML string from the selected node.
    • "ownText": extract direct text for the selected node (excludes descendant text).

Transform reference

transform shorthand description
RegexSub(pattern, repl) Regex substitution
Split(delimiter) Split string into array items (requires type:"array<...>")

Notes:

  • String outputs are stripped automatically (no transform needed).
  • Use field-level defaultValue for fallbacks (defaults are not supported inside transforms).

Testing

Install the dev dependencies (in a virtualenv) and run the test suite:

python -m pip install -e ".[dev]"
python -m pytest

Run a single test file:

python -m pytest tests/test_extractor.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapling_schema-1.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapling_schema-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapling_schema-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93962b0a4604e02e2f96894f36712a6f5d53be3d7d410940631b326f8ab3cb98
MD5 ddbe687358841701549d8aa94cad7d72
BLAKE2b-256 f95c4c660e2bb491e49c1fb20da26c83623502d6b2267580ad8a4aefd22b020e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-1.1.1-py3-none-any.whl:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page