Daft native extension for HTML processing — parsing, extraction, and transformation operators
Project description
daft-html
A native Daft extension for HTML processing. Built on scraper / html5ever, so it handles real-world malformed HTML correctly.
Installation
pip install daft-html
Requires Python ≥ 3.10 and daft ≥ 0.4.
Quick start
import daft
import daft_html
from daft_html import html_to_text, html_get_title, html_extract_links
from daft import col
sess = daft.Session()
sess.load_extension(daft_html)
with sess:
df = daft.from_pydict({
"html": [
"<html><head><title>Hello</title></head><body><p>World</p></body></html>",
"<p>Just <b>text</b> here. <a href='https://example.com'>link</a></p>",
]
})
result = df.select(
html_to_text(col("html")).alias("text"),
html_get_title(col("html")).alias("title"),
html_extract_links(col("html")).alias("links"),
).collect()
# +-----------------+---------+---------------------------+
# | text | title | links |
# +-----------------+---------+---------------------------+
# | World | Hello | [] |
# | Just text here. | None | [https://example.com] |
# +-----------------+---------+---------------------------+
Operators
Document-level
| Function | Signature | Description |
|---|---|---|
html_to_text(expr) |
String → String | Extract plain text, discard all tags |
html_get_title(expr) |
String → String | Extract <title> text |
html_text_ratio(expr) |
String → Float64 | Ratio of visible text chars to raw HTML bytes |
html_extract_meta(expr, name) |
(String, str) → String | <meta name|property="…"> content value |
html_extract_links(expr) |
String → List[String] | All <a href="…"> URLs |
html_extract_tables(expr) |
String → List[String] | <table> elements as Markdown strings |
CSS-selector
| Function | Signature | Description |
|---|---|---|
html_extract_text(expr, selector) |
(String, str) → String | Inner text of first matching element |
html_get_attribute(expr, selector, attr) |
(String, str, str) → String | Attribute value of first matching element |
html_has_element(expr, selector) |
(String, str) → Bool | True if selector matches at least one element |
html_count_elements(expr, selector) |
(String, str) → Int64 | Count of elements matching selector |
Development
Requires uv and a Rust nightly toolchain (see rust-toolchain.toml).
make build # cargo build (debug) + pip install -e .
make test # pytest tests/ -v
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file daft_html-0.1.0.tar.gz.
File metadata
- Download URL: daft_html-0.1.0.tar.gz
- Upload date:
- Size: 29.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb4ca4b07f12410d2e6a1e1bd5439c43d7cc994fc0c9025b85cf83777fb4c315
|
|
| MD5 |
bdf816872a06830f19ec86973d162785
|
|
| BLAKE2b-256 |
db63f404500deb2115943f8d2840cd83a4760d39febd1933713ada11a718e3ca
|
Provenance
The following attestation bundles were made for daft_html-0.1.0.tar.gz:
Publisher:
publish-package.yml on daft-engine/daft-html
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daft_html-0.1.0.tar.gz -
Subject digest:
fb4ca4b07f12410d2e6a1e1bd5439c43d7cc994fc0c9025b85cf83777fb4c315 - Sigstore transparency entry: 1405282318
- Sigstore integration time:
-
Permalink:
daft-engine/daft-html@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/daft-engine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: Python 3, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2b9a7cfbe0fca1f78dfbc236171b7e31fc94dff95e21c11de56e37c71232538
|
|
| MD5 |
7f367912caa69bc800aad7d52d8672c5
|
|
| BLAKE2b-256 |
0813343fd727dcd1fb8a7e8490ac4fa6008b9a58ee2c178d79c4adf097b296d7
|
Provenance
The following attestation bundles were made for daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl:
Publisher:
publish-package.yml on daft-engine/daft-html
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daft_html-0.1.0-py3-none-manylinux_2_28_x86_64.whl -
Subject digest:
f2b9a7cfbe0fca1f78dfbc236171b7e31fc94dff95e21c11de56e37c71232538 - Sigstore transparency entry: 1405282462
- Sigstore integration time:
-
Permalink:
daft-engine/daft-html@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/daft-engine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
806b5f409ec1bd4266cc9f9a14ee2aec1cd97d5a010546da1f6f0d423be17050
|
|
| MD5 |
a1ff6f6ea98ac10b75a0415402bb8804
|
|
| BLAKE2b-256 |
dd5a110bbbbf050110692d76a2650554f87a7c9f35c6b3bdf271ff55028a7415
|
Provenance
The following attestation bundles were made for daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl:
Publisher:
publish-package.yml on daft-engine/daft-html
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daft_html-0.1.0-py3-none-manylinux_2_28_aarch64.whl -
Subject digest:
806b5f409ec1bd4266cc9f9a14ee2aec1cd97d5a010546da1f6f0d423be17050 - Sigstore transparency entry: 1405282580
- Sigstore integration time:
-
Permalink:
daft-engine/daft-html@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/daft-engine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl.
File metadata
- Download URL: daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 853.8 kB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4347c431bf55dffb26334277ad4ed6a25017e942720cfc19be40c16cbaed0114
|
|
| MD5 |
ca08ee529ffb4a8639a4007ce8fb5a39
|
|
| BLAKE2b-256 |
19a55b096d4f2ed2cb046861e01b8d2da63a0f835a6f27d34846f6cb1d3d1ad4
|
Provenance
The following attestation bundles were made for daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl:
Publisher:
publish-package.yml on daft-engine/daft-html
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daft_html-0.1.0-py3-none-macosx_11_0_arm64.whl -
Subject digest:
4347c431bf55dffb26334277ad4ed6a25017e942720cfc19be40c16cbaed0114 - Sigstore transparency entry: 1405282384
- Sigstore integration time:
-
Permalink:
daft-engine/daft-html@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/daft-engine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl.
File metadata
- Download URL: daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl
- Upload date:
- Size: 893.0 kB
- Tags: Python 3, macOS 10.14+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d14ceed582a4b47160611b57bd5fb95549ef2ff0a6c6fdce87d6705964c5ce23
|
|
| MD5 |
647f26815f38c2aed7ad8fb2996ddfff
|
|
| BLAKE2b-256 |
562361cca5829a6007fdfb1a2c12bceb4ad46f63b9617019eb1f89d97d2c1e42
|
Provenance
The following attestation bundles were made for daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl:
Publisher:
publish-package.yml on daft-engine/daft-html
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daft_html-0.1.0-py3-none-macosx_10_14_x86_64.whl -
Subject digest:
d14ceed582a4b47160611b57bd5fb95549ef2ff0a6c6fdce87d6705964c5ce23 - Sigstore transparency entry: 1405282525
- Sigstore integration time:
-
Permalink:
daft-engine/daft-html@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/daft-engine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-package.yml@839c33aba0728640a5f32472dae3e8e9fbb023f1 -
Trigger Event:
release
-
Statement type: