Skip to main content

Daft native extension for HTML processing — parsing, extraction, and transformation operators

Project description

daft-html

A native Daft extension for HTML processing. Built on scraper / html5ever, so it handles real-world malformed HTML correctly.

Installation

pip install daft-html

Requires Python ≥ 3.10 and daft ≥ 0.4.

Quick start

import daft
import daft_html
from daft_html import html_to_text, html_get_title, html_extract_links
from daft import col

sess = daft.Session()
sess.load_extension(daft_html)

with sess:
    df = daft.from_pydict({
        "html": [
            "<html><head><title>Hello</title></head><body><p>World</p></body></html>",
            "<p>Just <b>text</b> here. <a href='https://example.com'>link</a></p>",
        ]
    })

    result = df.select(
        html_to_text(col("html")).alias("text"),
        html_get_title(col("html")).alias("title"),
        html_extract_links(col("html")).alias("links"),
    ).collect()

    # +-----------------+---------+---------------------------+
    # | text            | title   | links                     |
    # +-----------------+---------+---------------------------+
    # | World           | Hello   | []                        |
    # | Just text here. | None    | [https://example.com]     |
    # +-----------------+---------+---------------------------+

Operators

Document-level

Function Signature Description
html_to_text(expr) String → String Extract plain text, discard all tags
html_get_title(expr) String → String Extract <title> text
html_text_ratio(expr) String → Float64 Ratio of visible text chars to raw HTML bytes
html_extract_meta(expr, name) (String, str) → String <meta name|property="…"> content value
html_extract_links(expr) String → List[String] All <a href="…"> URLs
html_extract_tables(expr) String → List[String] <table> elements as Markdown strings

CSS-selector

Function Signature Description
html_extract_text(expr, selector) (String, str) → String Inner text of first matching element
html_get_attribute(expr, selector, attr) (String, str, str) → String Attribute value of first matching element
html_has_element(expr, selector) (String, str) → Bool True if selector matches at least one element
html_count_elements(expr, selector) (String, str) → Int64 Count of elements matching selector

Development

Requires uv and a Rust nightly toolchain (see rust-toolchain.toml).

make build   # cargo build (debug) + pip install -e .
make test    # pytest tests/ -v

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

daft_html-0.1.0.tar.gz (29.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ x86-64

daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl (1.0 MB view details)

Uploaded Python 3manylinux: glibc 2.28+ ARM64

daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl (853.8 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl (893.0 kB view details)

Uploaded Python 3macOS 10.14+ x86-64

File details

Details for the file daft_html-0.1.0.tar.gz.

File metadata

  • Download URL: daft_html-0.1.0.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for daft_html-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb4ca4b07f12410d2e6a1e1bd5439c43d7cc994fc0c9025b85cf83777fb4c315
MD5 bdf816872a06830f19ec86973d162785
BLAKE2b-256 db63f404500deb2115943f8d2840cd83a4760d39febd1933713ada11a718e3ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for daft_html-0.1.0.tar.gz:

Publisher: publish-package.yml on daft-engine/daft-html

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f2b9a7cfbe0fca1f78dfbc236171b7e31fc94dff95e21c11de56e37c71232538
MD5 7f367912caa69bc800aad7d52d8672c5
BLAKE2b-256 0813343fd727dcd1fb8a7e8490ac4fa6008b9a58ee2c178d79c4adf097b296d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl:

Publisher: publish-package.yml on daft-engine/daft-html

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 806b5f409ec1bd4266cc9f9a14ee2aec1cd97d5a010546da1f6f0d423be17050
MD5 a1ff6f6ea98ac10b75a0415402bb8804
BLAKE2b-256 dd5a110bbbbf050110692d76a2650554f87a7c9f35c6b3bdf271ff55028a7415

See more details on using hashes here.

Provenance

The following attestation bundles were made for daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl:

Publisher: publish-package.yml on daft-engine/daft-html

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4347c431bf55dffb26334277ad4ed6a25017e942720cfc19be40c16cbaed0114
MD5 ca08ee529ffb4a8639a4007ce8fb5a39
BLAKE2b-256 19a55b096d4f2ed2cb046861e01b8d2da63a0f835a6f27d34846f6cb1d3d1ad4

See more details on using hashes here.

Provenance

The following attestation bundles were made for daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl:

Publisher: publish-package.yml on daft-engine/daft-html

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 d14ceed582a4b47160611b57bd5fb95549ef2ff0a6c6fdce87d6705964c5ce23
MD5 647f26815f38c2aed7ad8fb2996ddfff
BLAKE2b-256 562361cca5829a6007fdfb1a2c12bceb4ad46f63b9617019eb1f89d97d2c1e42

See more details on using hashes here.

Provenance

The following attestation bundles were made for daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl:

Publisher: publish-package.yml on daft-engine/daft-html

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page