extracthero

LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

# extracthero

Now you can extract information from any data with almost zero compromise. 


## Features

- **Multi-modal input**: ingest raw HTML strings, screenshot image files, or paired HTML+image inputs  
- **Spatial context preservation**: maintain element positions to extract data that depends on layout or visual proximity  
- **LLM-powered extraction**: plug in your favorite LLM (e.g. OpenAI, Anthropic Claude) for flexible schema-driven parsing  
- **Optional validation**: enable built-in rules or custom validators to enforce types, ranges and required fields  
- **Extensible pipeline**: hook into pre- and post-processing steps

## Installation

```bash
pip install extracthero

Quickstart

from extracthero import Extractor, ExtractConfig

# 1. Initialize with your LLM backend and schema
config = ExtractConfig(
    llm_backend="openai",
    prompt_template="Extract product titles and prices from the page",
    validation_enabled=True
)
extractor = Extractor(config)

# 2a. Extract from raw HTML
html = "<html>…</html>"
result_html = extractor.extract_from_html(html)
print(result_html)

# 2b. Extract from a screenshot
result_img = extractor.extract_from_screenshot("page.png")
print(result_img)

# 2c. Extract using both HTML + screenshot (preserve layout)
result_combo = extractor.extract_from_both(html, "page.png")
print(result_combo)

Configuration

Option	Type	Default	Description
`llm_backend`	`str`	`"openai"`	Which LLM to use (“openai”, “anthropic”, etc.)
`prompt_template`	`str`	`None`	Template guiding the LLM’s extraction instructions
`validation_enabled`	`bool`	`False`	Turn on built-in schema and rule validation
`ocr_engine`	`str`	`"tesseract"`	OCR engine for screenshot text extraction
`spatial_threshold`	`float`	`0.5`	Minimum layout-overlap ratio to consider two elements related

from extracthero import ExtractConfig

config = ExtractConfig(
    llm_backend="anthropic",
    validation_enabled=False,
    spatial_threshold=0.7
)

Validation Logic

When validation_enabled=True, extracted fields are checked against:

Type rules (e.g. string, number, date)
Range checks (e.g. price ≥ 0, date within the last year)
Presence (required vs. optional fields)

Customize or extend:

from extracthero.validation import Validator, FieldRule

# custom rule: price must be < 1 000 000
class PriceRule(FieldRule):
    def validate(self, value):
        return isinstance(value, (int, float)) and value < 1_000_000

config.custom_validators = {
    "price": PriceRule()
}

CLI Interface

# Extract from HTML file
extracthero html input.html --schema product_schema.json --output out.json

# Extract from screenshot
extracthero image page.png --prompt "Get titles" --output out.json

# Combined mode
extracthero combo input.html page.png --output out.json

Use extracthero --help for full options and flags.

Examples

E-commerce product scraper
News article metadata extraction
Invoice data capture with layout

See the examples/ folder for ready-to-run demos.

Contributing

Fork the repo
Create a feature branch (git checkout -b feat/my-feature)
Implement & test
Submit a pull request

Please follow our code style guide and write tests for new features.

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.8

Sep 18, 2025

0.1.7

Sep 17, 2025

0.1.6

Sep 15, 2025

0.1.5.1

Sep 11, 2025

0.1.5

Sep 11, 2025

0.1.4

Sep 3, 2025

0.1.3

Jul 14, 2025

0.1.2

Jul 14, 2025

0.1.1

Jul 8, 2025

0.0.9

Jul 2, 2025

0.0.8

Jul 2, 2025

0.0.7

Jul 2, 2025

0.0.6

Jun 18, 2025

0.0.5

Jun 13, 2025

0.0.4

Jun 13, 2025

0.0.3

Jun 12, 2025

This version

0.0.1

Jun 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.0.1.tar.gz (3.1 kB view details)

Uploaded Jun 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extracthero-0.0.1-py3-none-any.whl (2.9 kB view details)

Uploaded Jun 11, 2025 Python 3

File details

Details for the file extracthero-0.0.1.tar.gz.

File metadata

Download URL: extracthero-0.0.1.tar.gz
Upload date: Jun 11, 2025
Size: 3.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`bbcae2ebf4bb2fbe5fb0f3602f1fe8b79f88871a67531d4cc5b6048f96efb18a`
MD5	`bdb4acbe900aa33c25b0add033dd5a0c`
BLAKE2b-256	`5facac7f76e16a8f8e8bb4b2514c21e0d38b9694390522d45d2721b9beaaf239`

See more details on using hashes here.

File details

Details for the file extracthero-0.0.1-py3-none-any.whl.

File metadata

Download URL: extracthero-0.0.1-py3-none-any.whl
Upload date: Jun 11, 2025
Size: 2.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`336e3a5e34fbfb4107df3bb90008d4bd46c30ca1f7f7cc4efdb10f13f4bf4274`
MD5	`ffc00d0c5f19d8799d5d99b388860fb0`
BLAKE2b-256	`bf74fc174690d861d28efa25f6ddc32578eb88fde29634286eec2044bb174f57`

See more details on using hashes here.

extracthero 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Quickstart

Configuration

Validation Logic

CLI Interface

Examples

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes