LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.
Project description
# extracthero
Now you can extract information from any data with almost zero compromise.
## Features
- **Multi-modal input**: ingest raw HTML strings, screenshot image files, or paired HTML+image inputs
- **Spatial context preservation**: maintain element positions to extract data that depends on layout or visual proximity
- **LLM-powered extraction**: plug in your favorite LLM (e.g. OpenAI, Anthropic Claude) for flexible schema-driven parsing
- **Optional validation**: enable built-in rules or custom validators to enforce types, ranges and required fields
- **Extensible pipeline**: hook into pre- and post-processing steps
## Installation
```bash
pip install extracthero
Quickstart
from extracthero import Extractor, ExtractConfig
# 1. Initialize with your LLM backend and schema
config = ExtractConfig(
llm_backend="openai",
prompt_template="Extract product titles and prices from the page",
validation_enabled=True
)
extractor = Extractor(config)
# 2a. Extract from raw HTML
html = "<html>…</html>"
result_html = extractor.extract_from_html(html)
print(result_html)
# 2b. Extract from a screenshot
result_img = extractor.extract_from_screenshot("page.png")
print(result_img)
# 2c. Extract using both HTML + screenshot (preserve layout)
result_combo = extractor.extract_from_both(html, "page.png")
print(result_combo)
Configuration
| Option | Type | Default | Description |
|---|---|---|---|
llm_backend |
str |
"openai" |
Which LLM to use (“openai”, “anthropic”, etc.) |
prompt_template |
str |
None |
Template guiding the LLM’s extraction instructions |
validation_enabled |
bool |
False |
Turn on built-in schema and rule validation |
ocr_engine |
str |
"tesseract" |
OCR engine for screenshot text extraction |
spatial_threshold |
float |
0.5 |
Minimum layout-overlap ratio to consider two elements related |
from extracthero import ExtractConfig
config = ExtractConfig(
llm_backend="anthropic",
validation_enabled=False,
spatial_threshold=0.7
)
Validation Logic
When validation_enabled=True, extracted fields are checked against:
- Type rules (e.g. string, number, date)
- Range checks (e.g. price ≥ 0, date within the last year)
- Presence (required vs. optional fields)
Customize or extend:
from extracthero.validation import Validator, FieldRule
# custom rule: price must be < 1 000 000
class PriceRule(FieldRule):
def validate(self, value):
return isinstance(value, (int, float)) and value < 1_000_000
config.custom_validators = {
"price": PriceRule()
}
CLI Interface
# Extract from HTML file
extracthero html input.html --schema product_schema.json --output out.json
# Extract from screenshot
extracthero image page.png --prompt "Get titles" --output out.json
# Combined mode
extracthero combo input.html page.png --output out.json
Use extracthero --help for full options and flags.
Examples
- E-commerce product scraper
- News article metadata extraction
- Invoice data capture with layout
See the examples/ folder for ready-to-run demos.
Contributing
- Fork the repo
- Create a feature branch (
git checkout -b feat/my-feature) - Implement & test
- Submit a pull request
Please follow our code style guide and write tests for new features.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extracthero-0.0.1.tar.gz.
File metadata
- Download URL: extracthero-0.0.1.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbcae2ebf4bb2fbe5fb0f3602f1fe8b79f88871a67531d4cc5b6048f96efb18a
|
|
| MD5 |
bdb4acbe900aa33c25b0add033dd5a0c
|
|
| BLAKE2b-256 |
5facac7f76e16a8f8e8bb4b2514c21e0d38b9694390522d45d2721b9beaaf239
|
File details
Details for the file extracthero-0.0.1-py3-none-any.whl.
File metadata
- Download URL: extracthero-0.0.1-py3-none-any.whl
- Upload date:
- Size: 2.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
336e3a5e34fbfb4107df3bb90008d4bd46c30ca1f7f7cc4efdb10f13f4bf4274
|
|
| MD5 |
ffc00d0c5f19d8799d5d99b388860fb0
|
|
| BLAKE2b-256 |
bf74fc174690d861d28efa25f6ddc32578eb88fde29634286eec2044bb174f57
|