Skip to main content

LLM-driven extraction from raw HTML and website screenshots, preserving spatial context with optional validation.

Project description

# extracthero

Now you can extract information from any data with almost zero compromise. 


## Features

- **Multi-modal input**: ingest raw HTML strings, screenshot image files, or paired HTML+image inputs  
- **Spatial context preservation**: maintain element positions to extract data that depends on layout or visual proximity  
- **LLM-powered extraction**: plug in your favorite LLM (e.g. OpenAI, Anthropic Claude) for flexible schema-driven parsing  
- **Optional validation**: enable built-in rules or custom validators to enforce types, ranges and required fields  
- **Extensible pipeline**: hook into pre- and post-processing steps

## Installation

```bash
pip install extracthero

Quickstart

from extracthero import Extractor, ExtractConfig

# 1. Initialize with your LLM backend and schema
config = ExtractConfig(
    llm_backend="openai",
    prompt_template="Extract product titles and prices from the page",
    validation_enabled=True
)
extractor = Extractor(config)

# 2a. Extract from raw HTML
html = "<html>…</html>"
result_html = extractor.extract_from_html(html)
print(result_html)

# 2b. Extract from a screenshot
result_img = extractor.extract_from_screenshot("page.png")
print(result_img)

# 2c. Extract using both HTML + screenshot (preserve layout)
result_combo = extractor.extract_from_both(html, "page.png")
print(result_combo)

Configuration

Option Type Default Description
llm_backend str "openai" Which LLM to use (“openai”, “anthropic”, etc.)
prompt_template str None Template guiding the LLM’s extraction instructions
validation_enabled bool False Turn on built-in schema and rule validation
ocr_engine str "tesseract" OCR engine for screenshot text extraction
spatial_threshold float 0.5 Minimum layout-overlap ratio to consider two elements related
from extracthero import ExtractConfig

config = ExtractConfig(
    llm_backend="anthropic",
    validation_enabled=False,
    spatial_threshold=0.7
)

Validation Logic

When validation_enabled=True, extracted fields are checked against:

  • Type rules (e.g. string, number, date)
  • Range checks (e.g. price ≥ 0, date within the last year)
  • Presence (required vs. optional fields)

Customize or extend:

from extracthero.validation import Validator, FieldRule

# custom rule: price must be < 1 000 000
class PriceRule(FieldRule):
    def validate(self, value):
        return isinstance(value, (int, float)) and value < 1_000_000

config.custom_validators = {
    "price": PriceRule()
}

CLI Interface

# Extract from HTML file
extracthero html input.html --schema product_schema.json --output out.json

# Extract from screenshot
extracthero image page.png --prompt "Get titles" --output out.json

# Combined mode
extracthero combo input.html page.png --output out.json

Use extracthero --help for full options and flags.

Examples

  1. E-commerce product scraper
  2. News article metadata extraction
  3. Invoice data capture with layout

See the examples/ folder for ready-to-run demos.

Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/my-feature)
  3. Implement & test
  4. Submit a pull request

Please follow our code style guide and write tests for new features.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracthero-0.0.3.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracthero-0.0.3-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file extracthero-0.0.3.tar.gz.

File metadata

  • Download URL: extracthero-0.0.3.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.3.tar.gz
Algorithm Hash digest
SHA256 32c86e39792ab3798e592b39c7399c2e9f8338e349f1610bc62c30c7dea0d2a5
MD5 7de9142c0348d5e9f6a8300729ef3d12
BLAKE2b-256 5b529b84cf248b1282bf90113b2564c4473d871823d7fbd3860ef120ff9bc826

See more details on using hashes here.

File details

Details for the file extracthero-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: extracthero-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for extracthero-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0b9160c40971a32c8f6f3ff3dcf61078568ff03712489d958269dfe1f8263717
MD5 8316162be3b2aa046608e13b489e1710
BLAKE2b-256 6062bc7f6ff1f8856a8f8ee121d1109d20fc38f4933c40ea54091b96f78ed537

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page