Skip to main content

LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

Project description

extracthero

Extract accurate, structured facts from messy real-world content — raw HTML, screenshots, PDFs, JSON blobs or plain text — with almost zero compromise.

--

Why extracthero?

Pain-point extracthero's answer
DOM spaghetti (ads, nav bars, JS widgets) pollutes extraction. Markdown converters drop dynamic/JS-rendered elements. We use a rule-based DomReducer to remove non-content related HTML tags. This process is custom tailored to not destroy any structural data including tables etc. In general this gives us 20% reduction in size. Markdown converting operations are too vague to trust for prod and they usually dismiss useful data.
Needle in haystack is common problem. If you overwork a LLM, it can hallucinate or start outputting unstructured garbage which breaks production. We define extraction in 2 phases. First phase is context aware filtering, and second phase is parsing this filtered data. Since LLM processes less data, the attention mechanism works better as well and more accurate results.
LLM prompts that just say "extract price" are brittle because in real life scenarios extraction logic is more complex and dependent on other variables. Extracthero asks you to fill WhatToRetain specifications that include the field's name, desc, and optional text_rules, so the LLM knows the full context and returns sniper-accurate results.
In real life, source data comes in different formats (JSON, strings, dicts, HTML) and each requires different optimization strategies. ExtractHero handles each data format intelligently. You can input JSON and if it can extract keys directly, it will use a fast-path. If it doesn't find what you need, you can use fallback mechanisms to route it to LLM processing for extraction.
Post-hoc validation is messy. Regex/type guards live inside each WhatToRetain; a failed field flips success=False, so you can retry or send to manual review.

Key ideas

1 Schema-first extraction

from extracthero import WhatToRetain

price_spec = WhatToRetain(
    name="price",
    desc="currency-prefixed current product price",
    regex_validator=r"€\d+\.\d{2}",
    text_rules=[
        "Ignore crossed-out promotional prices",
        "Return the live price only"
    ],
    example="€49.99"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.1.5.tar.gz (36.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracthero-0.1.5-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file extracthero-0.1.5.tar.gz.

File metadata

  • Download URL: extracthero-0.1.5.tar.gz
  • Upload date:
  • Size: 36.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.1.5.tar.gz
Algorithm Hash digest
SHA256 5f764108c16d0bac295eea4a8743870d5910ba901757685c8692397236677e18
MD5 b1e6aace89a6f8cddf7d31ff9969e192
BLAKE2b-256 57ba5d69432d55dabd61e1a17bebc007c1896a806e3880c3abe9b5016e5289a9

See more details on using hashes here.

File details

Details for the file extracthero-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: extracthero-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a20803f04465da3b6bda9be82a0b6e3605256ad447a8ba054d2fa01a3a938ca3
MD5 613960e2e40f2605bb0fc028bdaffa2d
BLAKE2b-256 0b9241bb6dae1d52b38b8e57a58e010f261acf3d646dde99e51af2dae9f9602f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page