LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.
Project description
extracthero
Extract accurate, structured facts from messy real-world content — raw HTML, screenshots, PDFs, JSON blobs or plain text — with almost zero compromise.
--
Why extracthero?
| Pain-point | extracthero's answer |
|---|---|
| DOM spaghetti (ads, nav bars, JS widgets) pollutes extraction. Markdown converters drop dynamic/JS-rendered elements. | We use a rule-based DomReducer to remove non-content related HTML tags. This process is custom tailored to not destroy any structural data including tables etc. In general this gives us 20% reduction in size. Markdown converting operations are too vague to trust for prod and they usually dismiss useful data. |
| Needle in haystack is common problem. If you overwork a LLM, it can hallucinate or start outputting unstructured garbage which breaks production. | We define extraction in 2 phases. First phase is context aware filtering, and second phase is parsing this filtered data. Since LLM processes less data, the attention mechanism works better as well and more accurate results. |
| LLM prompts that just say "extract price" are brittle because in real life scenarios extraction logic is more complex and dependent on other variables. | Extracthero asks you to fill WhatToRetain specifications that include the field's name, desc, and optional text_rules, so the LLM knows the full context and returns sniper-accurate results. |
| In real life, source data comes in different formats (JSON, strings, dicts, HTML) and each requires different optimization strategies. | ExtractHero handles each data format intelligently. You can input JSON and if it can extract keys directly, it will use a fast-path. If it doesn't find what you need, you can use fallback mechanisms to route it to LLM processing for extraction. |
| Post-hoc validation is messy. | Regex/type guards live inside each WhatToRetain; a failed field flips success=False, so you can retry or send to manual review. |
Key ideas
1 Schema-first extraction
from extracthero import WhatToRetain
price_spec = WhatToRetain(
name="price",
desc="currency-prefixed current product price",
regex_validator=r"€\d+\.\d{2}",
text_rules=[
"Ignore crossed-out promotional prices",
"Return the live price only"
],
example="€49.99"
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
extracthero-0.1.5.1.tar.gz
(36.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracthero-0.1.5.1.tar.gz.
File metadata
- Download URL: extracthero-0.1.5.1.tar.gz
- Upload date:
- Size: 36.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11d18ee03c33e01f5419be66d77bd6edf82c1e81597f6c3070f9f43eb33ed9ff
|
|
| MD5 |
c0dc08e063398f2c36d50564f48fb9fb
|
|
| BLAKE2b-256 |
83e47c46ff75edc8a6490c0797d4c955cf3231482672f35056066db1ceed6637
|
File details
Details for the file extracthero-0.1.5.1-py3-none-any.whl.
File metadata
- Download URL: extracthero-0.1.5.1-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
439fbb8f384f3ce268e60027ad34f74d55f914c67c9bcef7b250d8c826941790
|
|
| MD5 |
36daa5e012a2d0d645f1c2930350af60
|
|
| BLAKE2b-256 |
8aeab4b2877a132058b7d4bf0178435549c62447d00b595547e967e53ad3336a
|