Skip to main content

LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

Project description

# extracthero

Now you can extract information from any data with almost zero compromise. 


## Features

- **Multi-modal input**: ingest raw HTML strings, screenshot image files, or paired HTML+image inputs  
- **Spatial context preservation**: maintain element positions to extract data that depends on layout or visual proximity  
- **LLM-powered extraction**: plug in your favorite LLM (e.g. OpenAI, Anthropic Claude) for flexible schema-driven parsing  
- **Optional validation**: enable built-in rules or custom validators to enforce types, ranges and required fields  
- **Extensible pipeline**: hook into pre- and post-processing steps

## Installation

```bash
pip install extracthero

Quickstart

from extracthero import Extractor, ExtractConfig

# 1. Initialize with your LLM backend and schema
config = ExtractConfig(
    llm_backend="openai",
    prompt_template="Extract product titles and prices from the page",
    validation_enabled=True
)
extractor = Extractor(config)

# 2a. Extract from raw HTML
html = "<html>…</html>"
result_html = extractor.extract_from_html(html)
print(result_html)

# 2b. Extract from a screenshot
result_img = extractor.extract_from_screenshot("page.png")
print(result_img)

# 2c. Extract using both HTML + screenshot (preserve layout)
result_combo = extractor.extract_from_both(html, "page.png")
print(result_combo)

Configuration

Option Type Default Description
llm_backend str "openai" Which LLM to use (“openai”, “anthropic”, etc.)
prompt_template str None Template guiding the LLM’s extraction instructions
validation_enabled bool False Turn on built-in schema and rule validation
ocr_engine str "tesseract" OCR engine for screenshot text extraction
spatial_threshold float 0.5 Minimum layout-overlap ratio to consider two elements related
from extracthero import ExtractConfig

config = ExtractConfig(
    llm_backend="anthropic",
    validation_enabled=False,
    spatial_threshold=0.7
)

Validation Logic

When validation_enabled=True, extracted fields are checked against:

  • Type rules (e.g. string, number, date)
  • Range checks (e.g. price ≥ 0, date within the last year)
  • Presence (required vs. optional fields)

Customize or extend:

from extracthero.validation import Validator, FieldRule

# custom rule: price must be < 1 000 000
class PriceRule(FieldRule):
    def validate(self, value):
        return isinstance(value, (int, float)) and value < 1_000_000

config.custom_validators = {
    "price": PriceRule()
}

CLI Interface

# Extract from HTML file
extracthero html input.html --schema product_schema.json --output out.json

# Extract from screenshot
extracthero image page.png --prompt "Get titles" --output out.json

# Combined mode
extracthero combo input.html page.png --output out.json

Use extracthero --help for full options and flags.

Examples

  1. E-commerce product scraper
  2. News article metadata extraction
  3. Invoice data capture with layout

See the examples/ folder for ready-to-run demos.

Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Implement & test
  4. Submit a pull request

Please follow our code style guide and write tests for new features.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.0.1.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracthero-0.0.1-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file extracthero-0.0.1.tar.gz.

File metadata

  • Download URL: extracthero-0.0.1.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bbcae2ebf4bb2fbe5fb0f3602f1fe8b79f88871a67531d4cc5b6048f96efb18a
MD5 bdb4acbe900aa33c25b0add033dd5a0c
BLAKE2b-256 5facac7f76e16a8f8e8bb4b2514c21e0d38b9694390522d45d2721b9beaaf239

See more details on using hashes here.

File details

Details for the file extracthero-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: extracthero-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 2.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 336e3a5e34fbfb4107df3bb90008d4bd46c30ca1f7f7cc4efdb10f13f4bf4274
MD5 ffc00d0c5f19d8799d5d99b388860fb0
BLAKE2b-256 bf74fc174690d861d28efa25f6ddc32578eb88fde29634286eec2044bb174f57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page