Skip to main content

LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

Project description

extracthero

Extract accurate, structured facts from messy real-world content — raw HTML, screenshots, PDFs, JSON blobs or plain text — with almost zero compromise.


Why extracthero?

Pain-point extracthero’s answer
DOM spaghetti (ads, nav bars, JS widgets) pollutes extraction. DomReducer reduces the most-common HTML tags into a compact, linear corpus, stripping layout noise and script cruft while keeping the text you care about.
HTML→Markdown conversions drop dynamic/JS-rendered elements. DomReducer’s tag-level reduction keeps content that markdown pass-throughs often lose.
LLM prompts that just say “extract price” are brittle. Extracthero asks you to fill an ItemToExtract dataclass that includes the field’s name, desc, and optional text_rules, so the LLM knows the full context and returns sniper-accurate results.
One-shot LLM calls are hard to debug and expensive. Two-phase pipeline: FilterHero isolates the minimal fragment; ParseHero turns it into JSON. Fail fast and retry only the phase that broke.
Post-hoc validation is messy. Regex/type guards live inside each ItemToExtract; a failed field flips success=False, so you can retry or send to manual review.

Key ideas

1 Schema-first extraction

from extracthero import ItemToExtract

price = ItemToExtract(
    name="price",
    desc="currency-prefixed current product price",
    regex_validator=r"€\d+\.\d{2}",
    text_rules=[
        "Ignore crossed-out promotional prices.",
        "Return the live price only."
    ],
    example="€49.99"
)

2 DomReducer > HTML→Markdown

  • Works directly on the DOM tree.
  • Removes scripts, ads, banners; keeps relevant tags.
  • Shrinks a 40 kB e-commerce page to <3 kB of clean, LLM-ready text.

3 Two-phase pipeline

Raw input  ──▶  FilterHero  (shrinks & isolates)  ──▶  ParseHero  (JSON) ──▶  dict + metrics

Features

  • Multi-modal input – raw HTML, JSON, Python dicts, screenshots (vision LLM in roadmap).
  • Spatial context – layout coordinates stored so an LLM “sees” element proximity.
  • LLM-agnostic – default wrapper targets OpenAI; swap in any .filter_via_llm / .parse_via_llm service.
  • Per-field validation – regex, required/optional, custom lambdas.
  • Usage metering – token counts & cost returned with every operation.
  • Opt-in strictness – force LLM even for dicts (enforce_llm_based_*) or skip HTML reduction (reduce_html=False).

Installation

pip install extracthero

Quick-start

from extracthero import Extractor, ItemToExtract

html = open("product-page.html").read()

fields = [
    ItemToExtract(name="title", desc="product title", example="Wireless Keyboard"),
    ItemToExtract(
        name="price",
        desc="currency-prefixed price",
        regex_validator=r"€\d+\.\d{2}",
        example="€49.99"
    ),
]

hero   = Extractor()
result = hero.extract(html, fields, text_type="html")

print("✅ success:", result.success)
print(result.parse_op.content)

Typical HTML workflow

  1. Scrape or load the raw HTML.
  2. DomReducer trims it to a minimal fragment but keeps required tags.
  3. FilterHero sees only that reduced text, calling the LLM once (or per-field) to keep the lines that mention title, price, SKU, etc.
  4. ParseHero builds a schema-driven prompt and emits strict JSON.
  5. Regex guard – invalid prices ("129.50") are rejected for lacking “€”.
  6. ExtractOp bundles both steps plus token/cost metrics for budgeting.

Roadmap

Status Feature
Sync FilterHero & ParseHero
🟡 Async heroes for high-throughput pipelines
🟡 Built-in key:value fallback parser
🟡 Vision-LLM screenshot mode
🟡 Pydantic schema-driven auto-prompts & auto-regex

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.0.7.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracthero-0.0.7-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file extracthero-0.0.7.tar.gz.

File metadata

  • Download URL: extracthero-0.0.7.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.7.tar.gz
Algorithm Hash digest
SHA256 8bc0daea20a2a354c3cc7e5d8eee9f79ad96c824d2445f10590b82daa6d4484d
MD5 dae7e7ec6e68d8d76f8e3f390dc0283d
BLAKE2b-256 08060768829ed89cb6f0b2902c7334aa028c6020d8f95ab01b81f6f0b67bcbc3

See more details on using hashes here.

File details

Details for the file extracthero-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: extracthero-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a83466e6833bcb218f06bda12a14632ec04f662a635cd69e3ff8c3b0e52d73dc
MD5 70d7bf241156a5b7e86bd089becba4f6
BLAKE2b-256 fc5fae5fa03e3ccca9684ddc4f4be267940dd66e948b8eca059ab347b8057f22

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page