Skip to main content

schema-driven HTML extractor

Project description

scrapling-schema

Schema-driven HTML extractor. Define extraction specs in Python (with full IDE type hints) or YAML, and get structured JSON out.

Install

pip install git+https://github.com/aimscrape/scrapling-schema.git

Requirements

  • Python >= 3.10
  • scrapling >= 0.4
  • PyYAML >= 6.0

Python API

Python type spec (recommended)

from scrapling_schema import Schema, Field, Options, Clear, RegexSub

spec = Schema(
    options=Options(clear=Clear(remove_tags=["script", "style"])),
    fields={
        "products": Field(
            css=".product",
            list=True,
            fields={
                "sku":   Field(css="SELF", attr="data-sku"),
                "name":  Field(css=".name", text=True, transform=["strip"]),
                "url":   Field(css="a.link", attr="href"),
                "price": Field(css=".price", text=True, transform=[
                    RegexSub(pattern=r"[^0-9.]+"),
                    "to_float",
                ]),
                "tags":  Field(css=".tags li", list=True, text=True, transform=["strip"]),
            },
        )
    },
)

result = spec.extract(html)
json_schema = spec.json_schema(title="Products")

YAML spec

options:
  clear:
    remove_tags: ["script", "style"]

fields:
  products:
    css: ".product"
    list: true
    fields:
      sku:
        css: "SELF"
        attr: "data-sku"
      name:
        css: ".name"
        text: true
        transform: ["strip"]
      price:
        css: ".price"
        text: true
        transform:
          - regex_sub: { pattern: "[^0-9.]+", repl: "" }
          - to_float
from scrapling_schema import extract_from_yaml

result = extract_from_yaml(html, yaml_spec)

CLI

scrapling-schema --spec spec.yml --html-file page.html
scrapling-schema --spec spec.yml --schema

Field reference

param type description
css str CSS selector. Use "SELF" to select the context node itself
text bool Extract text content
attr str Extract an attribute value
html bool Extract outer HTML
list bool Return a list of matched nodes
fields dict Nested fields (object or list of objects)
transform list Transform pipeline (see below)
required bool Raise ValidationError if value is empty

Transform reference

transform shorthand description
"strip" Strip whitespace
"to_int" Convert to integer
"to_float" Convert to float
RegexSub(pattern, repl) Regex substitution
Split(delimiter) Split string into list (requires list=True)
Default(value) Fallback value when result is empty

Testing

Install the dev dependencies (in a virtualenv) and run the test suite:

python -m pip install -e ".[dev]"
python -m pytest

Run a single test file:

python -m pytest tests/test_extractor.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapling_schema-0.1.1.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapling_schema-0.1.1-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapling_schema-0.1.1.tar.gz.

File metadata

  • Download URL: scrapling_schema-0.1.1.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapling_schema-0.1.1.tar.gz
Algorithm Hash digest
SHA256 885794fbae0a861d09ac892567108c941bc7f603abe705340a847bdfb1514d34
MD5 e25071a27d66b18dbb1d5b0e6469fd7e
BLAKE2b-256 32a9282fee0f7d531c83a890627d50676d497eefc3f0404c7fadb42408dd326e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-0.1.1.tar.gz:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapling_schema-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapling_schema-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 03e15cb4d92c9e2f083028aa3223b596b2284029a795e5e61ae8e09aa5543f31
MD5 30c9e001c4407845e7437b6c0d6b5d18
BLAKE2b-256 263f9b015d1ea8325aa301de87630e1f2c3931173c8d29e1be0483398d758b6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapling_schema-0.1.1-py3-none-any.whl:

Publisher: publish.yml on aimscrape/scrapling-schema

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page