Skip to main content

YAML-driven HTML extractor powered by selectolax.

Project description

selectolax-tree

Parse HTML into structured data using a YAML spec.

Powered by AimScrape.

中文文档:README-ZH.md

Versioned docs:

  • docs/0.1/USAGE.md (applies to 0.1.x)
  • Changelog: CHANGELOG.md

Dependencies

This tool is built on top of:

  • selectolax (HTML parsing + CSS selectors)
  • PyYAML (YAML spec parsing)

They are declared as install dependencies, so pip will install them automatically.

Install (recommended: venv)

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e ".[dev]"

Install via pip

From GitHub

python -m pip install "git+https://github.com/aimscrape/selectolax-tree.git"

From PyPI

python -m pip install selectolax-tree

YAML Spec (minimal)

fields:
  title:
    css: "h1"
    text: true

  link_hrefs:
    css: "a"
    list: true
    attr: "href"

  items:
    css: ".item"
    list: true
    fields:
      name: { css: ".name", text: true }
      url:  { css: "a", attr: "href" }

Python usage

from selectolax_tree import extract_from_yaml

data = extract_from_yaml(html, yaml_spec_str)

CLI

selectolax-tree --spec spec.yml --html-file page.html

Examples

Runnable scenario examples live in example/:

selectolax-tree --spec example/article/spec.yml --html-file example/article/page.html
selectolax-tree --spec example/product_list/spec.yml --html-file example/product_list/page.html
selectolax-tree --spec example/profiles/spec.yml --html-file example/profiles/page.html

Releasing (maintainers)

See RELEASING.md for PyPI publishing and GitHub Actions release automation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

selectolax_tree-0.1.2.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

selectolax_tree-0.1.2-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file selectolax_tree-0.1.2.tar.gz.

File metadata

  • Download URL: selectolax_tree-0.1.2.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for selectolax_tree-0.1.2.tar.gz
Algorithm Hash digest
SHA256 5f0019cbb252bc6a9d4676de9a761900c37545562cc10b5c518f6ce98771a034
MD5 f10be8230377f8888c45d90c5409dfe8
BLAKE2b-256 0c8f8ead62d68fc9e86c1b34e9d9e2c9899ff41356ba8062c3c75e5700f7f507

See more details on using hashes here.

Provenance

The following attestation bundles were made for selectolax_tree-0.1.2.tar.gz:

Publisher: publish.yml on aimscrape/selectolax-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file selectolax_tree-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for selectolax_tree-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d1475374cf19424c5f53416f443e769851a7610fe45e68a8291f0b07bea12b90
MD5 2277e76c791044278262678c1e4d3b28
BLAKE2b-256 c25362c4e55a7f7a4df6ba9487b9509266777d8021718d55f5848b7acd112906

See more details on using hashes here.

Provenance

The following attestation bundles were made for selectolax_tree-0.1.2-py3-none-any.whl:

Publisher: publish.yml on aimscrape/selectolax-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page