Turn any webpage into structured data using LLMs

Project description

LLM Scraper (Python)

LLM Scraper Python is a Python library that allows you to extract structured data from any webpage using LLMs. This is a Python port of the popular TypeScript LLM Scraper.

[!IMPORTANT] This is a Python implementation of the original TypeScript LLM Scraper library, providing the same powerful functionality with Python-native APIs and both sync/async support.

[!TIP] Under the hood, it uses structured output generation to convert pages to structured data. You can find more about this approach here.

Features

Dual API Support: Both synchronous and asynchronous operations
OpenAI Integration: Built-in support for OpenAI GPT models with structured outputs
Extensible: Protocol-based design allows custom LLM providers
Schema Flexibility: Supports both Pydantic models and JSON Schema
Type Safety: Full type-safety with Python type hints
Playwright Integration: Built on the robust Playwright framework
Multiple Formats: 6 content processing modes including image support
Code Generation: Generate reusable JavaScript extraction code
Error Handling: Comprehensive validation and error reporting

Supported Content Formats:

html - Pre-processed HTML (cleaned, scripts/styles removed)
raw_html - Raw HTML (no processing)
markdown - HTML converted to markdown
text - Extracted readable text (using Readability.js)
image - Page screenshot for multi-modal models
custom - User-defined extraction function

Make sure to give the original project a star! ⭐️

Getting Started

Installation

Install the package and dependencies:

pip install llm_scraper_py

Install Playwright browsers:

playwright install

Quick Setup

import os
from llm_scraper_py import LLMScraper, OpenAIModel

# Initialize OpenAI model
llm = OpenAIModel(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY")  # or pass directly
)

# Create scraper instance
scraper = LLMScraper(llm)

Examples

Async Example (Recommended)

Extract top stories from Hacker News using async/await:

import asyncio
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from typing import List
from llm_scraper_py import LLMScraper, OpenAIModel

# Define the data structure using Pydantic
class Story(BaseModel):
    title: str
    points: int
    by: str
    comments_url: str = Field(alias="commentsURL")

class HackerNewsData(BaseModel):
    top: List[Story] = Field(
        max_length=5,
        description="Top 5 stories on Hacker News"
    )

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()

        # Initialize LLM and scraper
        llm = OpenAIModel(model="gpt-4o")
        scraper = LLMScraper(llm)

        # Navigate and scrape
        await page.goto("https://news.ycombinator.com")
        result = await scraper.arun(page, HackerNewsData, {"format": "html"})

        # Display results
        print("Top Stories:")
        for story in result.data["top"]:
            print(f"- {story['title']} ({story['points']} points by {story['by']})")

        await browser.close()

asyncio.run(main())

Sync Example

For simpler use cases, use the synchronous API:

from playwright.sync_api import sync_playwright
from pydantic import BaseModel
from llm_scraper_py import LLMScraper, OpenAIModel

class ArticleData(BaseModel):
    title: str
    content: str
    author: str = None

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()

        # Initialize LLM and scraper
        llm = OpenAIModel(model="gpt-4o-mini")
        scraper = LLMScraper(llm)

        # Navigate and scrape
        page.goto("https://example-blog.com/article")
        result = scraper.run(page, ArticleData, {"format": "text"})

        print(f"Title: {result['data']['title']}")
        print(f"Author: {result['data']['author']}")

        browser.close()

main()

Schema Options

Using JSON Schema

You can use JSON Schema instead of Pydantic models:

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "availability": {"type": "boolean"},
        "features": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5
        }
    },
    "required": ["title", "price"]
}

# Works with both sync and async
result = await scraper.arun(page, schema, {"format": "html"})
# or
result = scraper.run(page, schema, {"format": "html"})

Pydantic Models (Recommended)

Pydantic provides better type safety and validation:

from pydantic import BaseModel, Field
from typing import List, Optional

class Product(BaseModel):
    title: str
    price: float = Field(gt=0, description="Price must be positive")
    availability: bool
    features: List[str] = Field(max_length=5)
    description: Optional[str] = None

result = await scraper.arun(page, Product)

Content Format Options

The scraper supports different content processing formats:

# HTML (default) - cleaned HTML with scripts/styles removed
result = await scraper.arun(page, schema, {"format": "html"})

# Raw HTML - unprocessed HTML content
result = await scraper.arun(page, schema, {"format": "raw_html"})

# Markdown - HTML converted to markdown format
result = await scraper.arun(page, schema, {"format": "markdown"})

# Text - extracted readable text using Readability.js
result = await scraper.arun(page, schema, {"format": "text"})

# Image - page screenshot for multi-modal models
result = await scraper.arun(page, schema, {"format": "image"})

# Custom - user-defined extraction function
def extract_custom_data(page):
    return page.locator(".main-content").inner_text()

result = await scraper.arun(page, schema, {
    "format": "custom",
    "formatFunction": extract_custom_data
})

Code Generation

Generate reusable JavaScript code for data extraction:

from pydantic import BaseModel

class ProductInfo(BaseModel):
    name: str
    price: float
    rating: float

# Generate extraction code (async)
result = await scraper.agenerate(page, ProductInfo)
generated_code = result.code

# Or synchronous
result = scraper.generate(page, ProductInfo)
generated_code = result.code

# Execute the generated code on any similar page
extracted_data = await page.evaluate(generated_code)

# Validate and use the data
product = ProductInfo.model_validate(extracted_data)
print(f"Product: {product.name}, Price: ${product.price}")

The generated code is a self-contained JavaScript function that can be reused across similar pages without additional LLM calls.

Advanced Configuration

LLM Options

Customize the LLM behavior with detailed options:

from llm_scraper_py import ScraperLLMOptions

options = ScraperLLMOptions(
    format="html",
    prompt="Extract the data carefully and accurately",
    temperature=0.1,        # Lower = more deterministic
    maxTokens=2000,         # Response length limit
    topP=0.9,              # Nucleus sampling
    mode="json"            # Response format hint
)

result = await scraper.arun(page, schema, options)

Generation Options

For code generation, use specialized options:

from llm_scraper_py import ScraperGenerateOptions

gen_options = ScraperGenerateOptions(
    format="html",
    prompt="Generate efficient extraction code",
    temperature=0.2
)

result = await scraper.agenerate(page, schema, gen_options)

Error Handling

The library provides comprehensive error handling:

from llm_scraper_py import LLMScraper, OpenAIModel
from pydantic import ValidationError
from playwright.async_api import TimeoutError

try:
    llm = OpenAIModel(model="gpt-4o")
    scraper = LLMScraper(llm)

    result = await scraper.arun(page, schema, {"format": "html"})

except ValidationError as e:
    print(f"Schema validation failed: {e}")
except TimeoutError:
    print("Page load timeout")
except ValueError as e:
    print(f"Configuration error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Custom LLM Providers

Implement custom LLM providers using the LanguageModel protocol:

from llm_scraper_py import LanguageModel, LLMScraper
from typing import Dict, Any, Optional, AsyncGenerator
from pydantic import BaseModel

class CustomLLMProvider:
    """Example custom LLM provider implementation"""

    def __init__(self, api_key: str, base_url: str):
        self.api_key = api_key
        self.base_url = base_url

    # Sync methods
    def generate_json(
        self,
        messages: list[dict],
        schema: BaseModel,
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        top_p: Optional[float] = None,
        mode: Optional[str] = None,
    ) -> Dict[str, Any]:
        # Implement your JSON generation logic
        # Return structured data matching the schema
        pass

    def generate_text(
        self,
        messages: list[dict],
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        top_p: Optional[float] = None,
    ) -> str:
        # Implement your text generation logic
        pass

    # Async methods
    async def agenerate_json(
        self,
        messages: list[dict],
        schema: BaseModel,
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        top_p: Optional[float] = None,
        mode: Optional[str] = None,
    ) -> Dict[str, Any]:
        # Async version of generate_json
        pass

    async def agenerate_text(
        self,
        messages: list[dict],
        temperature: Optional[float] = None,
        max_tokens: Optional[int] = None,
        top_p: Optional[float] = None,
    ) -> str:
        # Async version of generate_text
        pass

    # Streaming methods (optional)
    def stream_json(self, *args, **kwargs):
        raise NotImplementedError("Streaming not supported")

    async def astream_json(self, *args, **kwargs):
        raise NotImplementedError("Streaming not supported")

# Use your custom provider
custom_llm = CustomLLMProvider(api_key="your-key", base_url="https://api.example.com")
scraper = LLMScraper(custom_llm)

API Reference

LLMScraper Methods

Async Methods (Recommended):

arun(page, schema, options=None) - Extract structured data asynchronously
agenerate(page, schema, options=None) - Generate extraction code asynchronously
astream(page, schema, options=None) - Stream partial results (not implemented)

Sync Methods:

run(page, schema, options=None) - Extract structured data synchronously
generate(page, schema, options=None) - Generate extraction code synchronously
stream(page, schema, options=None) - Stream partial results (not implemented)

Response Format

All extraction methods return a dictionary with:

{
    "data": {...},      # Extracted data matching your schema
    "url": "https://..." # Source page URL
}

Generation methods return:

{
    "code": "...",      # Generated JavaScript code
    "url": "https://..." # Source page URL
}

Installation & Dependencies

pip install llm_scraper_py

Core Dependencies:

playwright - Web automation and browser control
pydantic - Data validation and serialization
openai - OpenAI API client (for built-in OpenAI support)
jsonschema - JSON Schema validation

Sync vs Async Usage

When to Use Async (Recommended)

Use async methods for:

Better performance with multiple concurrent scraping tasks
Integration with async web frameworks (FastAPI, aiohttp)
Non-blocking operations in async applications

import asyncio
from playwright.async_api import async_playwright

async def scrape_multiple_pages(urls):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        tasks = []

        for url in urls:
            page = await browser.new_page()
            await page.goto(url)
            task = scraper.arun(page, schema)
            tasks.append(task)

        results = await asyncio.gather(*tasks)
        await browser.close()
        return results

When to Use Sync

Use sync methods for:

Simple scripts and one-off tasks
Integration with sync codebases
Learning and prototyping

from playwright.sync_api import sync_playwright

def scrape_single_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        result = scraper.run(page, schema)
        browser.close()
        return result

Performance Tips

Reuse browser instances when scraping multiple pages
Use async methods for concurrent operations
Choose appropriate content formats - text is fastest, image is slowest
Set reasonable token limits to control costs and response times
Use code generation for repeated scraping of similar pages

Contributing

We welcome contributions! This project is a Python port of the original TypeScript LLM Scraper by mishushakov.

Ways to contribute:

Report bugs and request features via GitHub issues
Submit pull requests for improvements
Add support for new LLM providers
Improve documentation and examples
Write tests for edge cases

License

This project follows the same license as the original LLM Scraper project.

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Aug 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_scraper_py-0.4.0.tar.gz (26.7 kB view details)

Uploaded Aug 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_scraper_py-0.4.0-py3-none-any.whl (30.9 kB view details)

Uploaded Aug 25, 2025 Python 3

File details

Details for the file llm_scraper_py-0.4.0.tar.gz.

File metadata

Download URL: llm_scraper_py-0.4.0.tar.gz
Upload date: Aug 25, 2025
Size: 26.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_scraper_py-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`300ac9dbed58a386268d48bcc426c71688d64852b4be13a84a0cbd85263c0f2f`
MD5	`a9f94ece2455f74ccbf5ea50dc475024`
BLAKE2b-256	`31b2ffe775c9b00125ad72e4883c665fee89bfcbf9bd4d2076c2fd6886eb5a03`

See more details on using hashes here.

File details

Details for the file llm_scraper_py-0.4.0-py3-none-any.whl.

File metadata

Download URL: llm_scraper_py-0.4.0-py3-none-any.whl
Upload date: Aug 25, 2025
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_scraper_py-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77fba59a383f3ecfb34ce3fd5393045a9e0942776346847d81c65eaf2d4ca2ee`
MD5	`f5355d0f51de0dba3fa17905257e41c7`
BLAKE2b-256	`454401f18a18dff9e579d4f25c4488deaaa84cb16392e2cfa24337b93133f945`

See more details on using hashes here.

llm-scraper-py 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LLM Scraper (Python)

Features

Getting Started

Installation

Quick Setup

Examples

Async Example (Recommended)

Sync Example

Schema Options

Using JSON Schema

Pydantic Models (Recommended)

Content Format Options

Code Generation

Advanced Configuration

LLM Options

Generation Options

Error Handling

Custom LLM Providers

API Reference

LLMScraper Methods

Response Format

Installation & Dependencies

Sync vs Async Usage

When to Use Async (Recommended)

When to Use Sync

Performance Tips

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes