LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.
Project description
extracthero
Extract accurate, structured facts from messy real-world content — raw HTML, screenshots, PDFs, JSON blobs or plain text — with almost zero compromise.
Why extracthero?
| Pain-point | extracthero’s answer |
|---|---|
| DOM spaghetti (ads, nav bars, JS widgets) pollutes extraction. | DomReducer reduces the most-common HTML tags into a compact, linear corpus, stripping layout noise and script cruft while keeping the text you care about. |
| HTML→Markdown conversions drop dynamic/JS-rendered elements. | DomReducer’s tag-level reduction keeps content that markdown pass-throughs often lose. |
| LLM prompts that just say “extract price” are brittle. | Extracthero asks you to fill an ItemToExtract dataclass that includes the field’s name, desc, and optional text_rules, so the LLM knows the full context and returns sniper-accurate results. |
| One-shot LLM calls are hard to debug and expensive. | Two-phase pipeline: FilterHero isolates the minimal fragment; ParseHero turns it into JSON. Fail fast and retry only the phase that broke. |
| Post-hoc validation is messy. | Regex/type guards live inside each ItemToExtract; a failed field flips success=False, so you can retry or send to manual review. |
Key ideas
1 Schema-first extraction
from extracthero import ItemToExtract
price = ItemToExtract(
name="price",
desc="currency-prefixed current product price",
regex_validator=r"€\d+\.\d{2}",
text_rules=[
"Ignore crossed-out promotional prices.",
"Return the live price only."
],
example="€49.99"
)
2 DomReducer > HTML→Markdown
- Works directly on the DOM tree.
- Removes scripts, ads, banners; keeps relevant tags.
- Shrinks a 40 kB e-commerce page to <3 kB of clean, LLM-ready text.
3 Two-phase pipeline
Raw input ──▶ FilterHero (shrinks & isolates) ──▶ ParseHero (JSON) ──▶ dict + metrics
Features
- Multi-modal input – raw HTML, JSON, Python dicts, screenshots (vision LLM in roadmap).
- Spatial context – layout coordinates stored so an LLM “sees” element proximity.
- LLM-agnostic – default wrapper targets OpenAI; swap in any
.filter_via_llm/.parse_via_llmservice. - Per-field validation – regex, required/optional, custom lambdas.
- Usage metering – token counts & cost returned with every operation.
- Opt-in strictness – force LLM even for dicts (
enforce_llm_based_*) or skip HTML reduction (reduce_html=False).
Installation
pip install extracthero
Quick-start
from extracthero import Extractor, ItemToExtract
html = open("product-page.html").read()
fields = [
ItemToExtract(name="title", desc="product title", example="Wireless Keyboard"),
ItemToExtract(
name="price",
desc="currency-prefixed price",
regex_validator=r"€\d+\.\d{2}",
example="€49.99"
),
]
hero = Extractor()
result = hero.extract(html, fields, text_type="html")
print("✅ success:", result.success)
print(result.parse_op.content)
Typical HTML workflow
- Scrape or load the raw HTML.
- DomReducer trims it to a minimal fragment but keeps required tags.
- FilterHero sees only that reduced text, calling the LLM once (or per-field) to keep the lines that mention title, price, SKU, etc.
- ParseHero builds a schema-driven prompt and emits strict JSON.
- Regex guard – invalid prices (
"129.50") are rejected for lacking “€”. - ExtractOp bundles both steps plus token/cost metrics for budgeting.
Roadmap
| Status | Feature |
|---|---|
| ✅ | Sync FilterHero & ParseHero |
| 🟡 | Async heroes for high-throughput pipelines |
| 🟡 | Built-in key:value fallback parser |
| 🟡 | Vision-LLM screenshot mode |
| 🟡 | Pydantic schema-driven auto-prompts & auto-regex |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extracthero-0.0.8.tar.gz
(22.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracthero-0.0.8.tar.gz.
File metadata
- Download URL: extracthero-0.0.8.tar.gz
- Upload date:
- Size: 22.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f92027af4ba9b644b580599a047a2668256643212360ab6a2d02f53acd17bda1
|
|
| MD5 |
4f6aa7a1e04844e0266565e0e243d3e2
|
|
| BLAKE2b-256 |
b637a5030574aafb59fc1e6cbb00b346f28fc736ee54f9c55e6e1c22008b7e0c
|
File details
Details for the file extracthero-0.0.8-py3-none-any.whl.
File metadata
- Download URL: extracthero-0.0.8-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1622b7ac2e2ccecaf8fa1e55b95cbd93e6e9a6af406301a037d61f1f80f6eeb
|
|
| MD5 |
7bca45ca5c9a2e79a7d3bbf9e0a0e366
|
|
| BLAKE2b-256 |
2c2a463566844891b0df170ff4db8b6d2ee1f77b304a8e9d8b62efc9b7e4f63f
|