LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

These details have not been verified by PyPI

Project description

extracthero

Extract accurate, structured facts from messy real-world content — raw HTML, screenshots, PDFs, JSON blobs or plain text — with almost zero compromise.

Why extracthero?

Pain-point	extracthero's answer
DOM spaghetti (ads, nav bars, JS widgets) pollutes extraction. Markdown converters drop dynamic/JS-rendered elements.	We use a rule-based DomReducer to remove non-content related HTML tags. This process is custom tailored to not destroy any structural data including tables etc. In general this gives us 20% reduction in size. Markdown converting operations are too vague to trust for prod and they usually dismiss useful data.
Needle in haystack is common problem. If you overwork a LLM, it can hallucinate or start outputting unstructured garbage which breaks production.	We define extraction in 2 phases. First phase is context aware filtering, and second phase is parsing this filtered data. Since LLM processes less data, the attention mechanism works better as well and more accurate results.
LLM prompts that just say "extract price" are brittle because in real life scenarios extraction logic is more complex and dependent on other variables.	Extracthero asks you to fill `WhatToRetain` specifications that include the field's `name`, `desc`, and optional `text_rules`, so the LLM knows the full context and returns sniper-accurate results.
In real life, source data comes in different formats (JSON, strings, dicts, HTML) and each requires different optimization strategies.	ExtractHero handles each data format intelligently. You can input JSON and if it can extract keys directly, it will use a fast-path. If it doesn't find what you need, you can use fallback mechanisms to route it to LLM processing for extraction.
Post-hoc validation is messy.	Regex/type guards live inside each `WhatToRetain`; a failed field flips `success=False`, so you can retry or send to manual review.

Key ideas

1 Schema-first extraction

from extracthero import WhatToRetain

price_spec = WhatToRetain(
    name="price",
    desc="currency-prefixed current product price",
    regex_validator=r"€\d+\.\d{2}",
    text_rules=[
        "Ignore crossed-out promotional prices",
        "Return the live price only"
    ],
    example="€49.99"
)

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.8

Sep 18, 2025

0.1.7

Sep 17, 2025

0.1.6

Sep 15, 2025

0.1.5.1

Sep 11, 2025

0.1.5

Sep 11, 2025

0.1.4

Sep 3, 2025

0.1.3

Jul 14, 2025

This version

0.1.2

Jul 14, 2025

0.1.1

Jul 8, 2025

0.0.9

Jul 2, 2025

0.0.8

Jul 2, 2025

0.0.7

Jul 2, 2025

0.0.6

Jun 18, 2025

0.0.5

Jun 13, 2025

0.0.4

Jun 13, 2025

0.0.3

Jun 12, 2025

0.0.1

Jun 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.1.2.tar.gz (35.9 kB view details)

Uploaded Jul 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extracthero-0.1.2-py3-none-any.whl (41.4 kB view details)

Uploaded Jul 14, 2025 Python 3

File details

Details for the file extracthero-0.1.2.tar.gz.

File metadata

Download URL: extracthero-0.1.2.tar.gz
Upload date: Jul 14, 2025
Size: 35.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`93313aea0d18ba11440c97e26950846e5da3de47010181025af4f0fc278d42f0`
MD5	`88cdd958362ece82c60fdf7de558aaef`
BLAKE2b-256	`c34bc7c07d8633482ce8fde7dafe5a4fbdef405ad8c1ce86335213458f949407`

See more details on using hashes here.

File details

Details for the file extracthero-0.1.2-py3-none-any.whl.

File metadata

Download URL: extracthero-0.1.2-py3-none-any.whl
Upload date: Jul 14, 2025
Size: 41.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8baaa859bafa4e5949a9f0d77c7ac4b19f5aa854ad3b0bf8d4e0f44332b9739f`
MD5	`daf8256d64c9630312e03de1e4b1ece4`
BLAKE2b-256	`807337aaec8e68b883c37d4bd7d491aa6acabcdf8c67eab2ec813608dd1a6c6a`

See more details on using hashes here.

extracthero 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

extracthero

Why extracthero?

Key ideas

1 Schema-first extraction

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes