Turn any webpage into structured data using LLMs
Project description
LLM Scraper (Python)
LLM Scraper Python is a Python library that allows you to extract structured data from any webpage using LLMs. This is a Python port of the popular TypeScript LLM Scraper.
[!IMPORTANT] This is a Python implementation of the original TypeScript LLM Scraper library, providing the same powerful functionality with Python-native APIs and both sync/async support.
[!TIP] Under the hood, it uses structured output generation to convert pages to structured data. You can find more about this approach here.
Features
- Dual API Support: Both synchronous and asynchronous operations
- OpenAI Integration: Built-in support for OpenAI GPT models with structured outputs
- Extensible: Protocol-based design allows custom LLM providers
- Schema Flexibility: Supports both Pydantic models and JSON Schema
- Type Safety: Full type-safety with Python type hints
- Playwright Integration: Built on the robust Playwright framework
- Multiple Formats: 6 content processing modes including image support
- Code Generation: Generate reusable JavaScript extraction code
- Error Handling: Comprehensive validation and error reporting
Supported Content Formats:
html- Pre-processed HTML (cleaned, scripts/styles removed)raw_html- Raw HTML (no processing)markdown- HTML converted to markdowntext- Extracted readable text (using Readability.js)image- Page screenshot for multi-modal modelscustom- User-defined extraction function
Make sure to give the original project a star! ⭐️
Getting Started
Installation
- Install the package and dependencies:
pip install llm_scraper_py
- Install Playwright browsers:
playwright install
Quick Setup
import os
from llm_scraper_py import LLMScraper, OpenAIModel
# Initialize OpenAI model
llm = OpenAIModel(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY") # or pass directly
)
# Create scraper instance
scraper = LLMScraper(llm)
Examples
Async Example (Recommended)
Extract top stories from Hacker News using async/await:
import asyncio
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from typing import List
from llm_scraper_py import LLMScraper, OpenAIModel
# Define the data structure using Pydantic
class Story(BaseModel):
title: str
points: int
by: str
comments_url: str = Field(alias="commentsURL")
class HackerNewsData(BaseModel):
top: List[Story] = Field(
max_length=5,
description="Top 5 stories on Hacker News"
)
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
# Initialize LLM and scraper
llm = OpenAIModel(model="gpt-4o")
scraper = LLMScraper(llm)
# Navigate and scrape
await page.goto("https://news.ycombinator.com")
result = await scraper.arun(page, HackerNewsData, {"format": "html"})
# Display results
print("Top Stories:")
for story in result.data["top"]:
print(f"- {story['title']} ({story['points']} points by {story['by']})")
await browser.close()
asyncio.run(main())
Sync Example
For simpler use cases, use the synchronous API:
from playwright.sync_api import sync_playwright
from pydantic import BaseModel
from llm_scraper_py import LLMScraper, OpenAIModel
class ArticleData(BaseModel):
title: str
content: str
author: str = None
def main():
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Initialize LLM and scraper
llm = OpenAIModel(model="gpt-4o-mini")
scraper = LLMScraper(llm)
# Navigate and scrape
page.goto("https://example-blog.com/article")
result = scraper.run(page, ArticleData, {"format": "text"})
print(f"Title: {result['data']['title']}")
print(f"Author: {result['data']['author']}")
browser.close()
main()
Schema Options
Using JSON Schema
You can use JSON Schema instead of Pydantic models:
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "boolean"},
"features": {
"type": "array",
"items": {"type": "string"},
"maxItems": 5
}
},
"required": ["title", "price"]
}
# Works with both sync and async
result = await scraper.arun(page, schema, {"format": "html"})
# or
result = scraper.run(page, schema, {"format": "html"})
Pydantic Models (Recommended)
Pydantic provides better type safety and validation:
from pydantic import BaseModel, Field
from typing import List, Optional
class Product(BaseModel):
title: str
price: float = Field(gt=0, description="Price must be positive")
availability: bool
features: List[str] = Field(max_length=5)
description: Optional[str] = None
result = await scraper.arun(page, Product)
Content Format Options
The scraper supports different content processing formats:
# HTML (default) - cleaned HTML with scripts/styles removed
result = await scraper.arun(page, schema, {"format": "html"})
# Raw HTML - unprocessed HTML content
result = await scraper.arun(page, schema, {"format": "raw_html"})
# Markdown - HTML converted to markdown format
result = await scraper.arun(page, schema, {"format": "markdown"})
# Text - extracted readable text using Readability.js
result = await scraper.arun(page, schema, {"format": "text"})
# Image - page screenshot for multi-modal models
result = await scraper.arun(page, schema, {"format": "image"})
# Custom - user-defined extraction function
def extract_custom_data(page):
return page.locator(".main-content").inner_text()
result = await scraper.arun(page, schema, {
"format": "custom",
"formatFunction": extract_custom_data
})
Code Generation
Generate reusable JavaScript code for data extraction:
from pydantic import BaseModel
class ProductInfo(BaseModel):
name: str
price: float
rating: float
# Generate extraction code (async)
result = await scraper.agenerate(page, ProductInfo)
generated_code = result.code
# Or synchronous
result = scraper.generate(page, ProductInfo)
generated_code = result.code
# Execute the generated code on any similar page
extracted_data = await page.evaluate(generated_code)
# Validate and use the data
product = ProductInfo.model_validate(extracted_data)
print(f"Product: {product.name}, Price: ${product.price}")
The generated code is a self-contained JavaScript function that can be reused across similar pages without additional LLM calls.
Advanced Configuration
LLM Options
Customize the LLM behavior with detailed options:
from llm_scraper_py import ScraperLLMOptions
options = ScraperLLMOptions(
format="html",
prompt="Extract the data carefully and accurately",
temperature=0.1, # Lower = more deterministic
maxTokens=2000, # Response length limit
topP=0.9, # Nucleus sampling
mode="json" # Response format hint
)
result = await scraper.arun(page, schema, options)
Generation Options
For code generation, use specialized options:
from llm_scraper_py import ScraperGenerateOptions
gen_options = ScraperGenerateOptions(
format="html",
prompt="Generate efficient extraction code",
temperature=0.2
)
result = await scraper.agenerate(page, schema, gen_options)
Error Handling
The library provides comprehensive error handling:
from llm_scraper_py import LLMScraper, OpenAIModel
from pydantic import ValidationError
from playwright.async_api import TimeoutError
try:
llm = OpenAIModel(model="gpt-4o")
scraper = LLMScraper(llm)
result = await scraper.arun(page, schema, {"format": "html"})
except ValidationError as e:
print(f"Schema validation failed: {e}")
except TimeoutError:
print("Page load timeout")
except ValueError as e:
print(f"Configuration error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Custom LLM Providers
Implement custom LLM providers using the LanguageModel protocol:
from llm_scraper_py import LanguageModel, LLMScraper
from typing import Dict, Any, Optional, AsyncGenerator
from pydantic import BaseModel
class CustomLLMProvider:
"""Example custom LLM provider implementation"""
def __init__(self, api_key: str, base_url: str):
self.api_key = api_key
self.base_url = base_url
# Sync methods
def generate_json(
self,
messages: list[dict],
schema: BaseModel,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
mode: Optional[str] = None,
) -> Dict[str, Any]:
# Implement your JSON generation logic
# Return structured data matching the schema
pass
def generate_text(
self,
messages: list[dict],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
) -> str:
# Implement your text generation logic
pass
# Async methods
async def agenerate_json(
self,
messages: list[dict],
schema: BaseModel,
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
mode: Optional[str] = None,
) -> Dict[str, Any]:
# Async version of generate_json
pass
async def agenerate_text(
self,
messages: list[dict],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
top_p: Optional[float] = None,
) -> str:
# Async version of generate_text
pass
# Streaming methods (optional)
def stream_json(self, *args, **kwargs):
raise NotImplementedError("Streaming not supported")
async def astream_json(self, *args, **kwargs):
raise NotImplementedError("Streaming not supported")
# Use your custom provider
custom_llm = CustomLLMProvider(api_key="your-key", base_url="https://api.example.com")
scraper = LLMScraper(custom_llm)
API Reference
LLMScraper Methods
Async Methods (Recommended):
arun(page, schema, options=None)- Extract structured data asynchronouslyagenerate(page, schema, options=None)- Generate extraction code asynchronouslyastream(page, schema, options=None)- Stream partial results (not implemented)
Sync Methods:
run(page, schema, options=None)- Extract structured data synchronouslygenerate(page, schema, options=None)- Generate extraction code synchronouslystream(page, schema, options=None)- Stream partial results (not implemented)
Response Format
All extraction methods return a dictionary with:
{
"data": {...}, # Extracted data matching your schema
"url": "https://..." # Source page URL
}
Generation methods return:
{
"code": "...", # Generated JavaScript code
"url": "https://..." # Source page URL
}
Installation & Dependencies
pip install llm_scraper_py
Core Dependencies:
playwright- Web automation and browser controlpydantic- Data validation and serializationopenai- OpenAI API client (for built-in OpenAI support)jsonschema- JSON Schema validation
Sync vs Async Usage
When to Use Async (Recommended)
Use async methods for:
- Better performance with multiple concurrent scraping tasks
- Integration with async web frameworks (FastAPI, aiohttp)
- Non-blocking operations in async applications
import asyncio
from playwright.async_api import async_playwright
async def scrape_multiple_pages(urls):
async with async_playwright() as p:
browser = await p.chromium.launch()
tasks = []
for url in urls:
page = await browser.new_page()
await page.goto(url)
task = scraper.arun(page, schema)
tasks.append(task)
results = await asyncio.gather(*tasks)
await browser.close()
return results
When to Use Sync
Use sync methods for:
- Simple scripts and one-off tasks
- Integration with sync codebases
- Learning and prototyping
from playwright.sync_api import sync_playwright
def scrape_single_page(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
result = scraper.run(page, schema)
browser.close()
return result
Performance Tips
- Reuse browser instances when scraping multiple pages
- Use async methods for concurrent operations
- Choose appropriate content formats -
textis fastest,imageis slowest - Set reasonable token limits to control costs and response times
- Use code generation for repeated scraping of similar pages
Contributing
We welcome contributions! This project is a Python port of the original TypeScript LLM Scraper by mishushakov.
Ways to contribute:
- Report bugs and request features via GitHub issues
- Submit pull requests for improvements
- Add support for new LLM providers
- Improve documentation and examples
- Write tests for edge cases
License
This project follows the same license as the original LLM Scraper project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_scraper_py-0.4.0.tar.gz.
File metadata
- Download URL: llm_scraper_py-0.4.0.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
300ac9dbed58a386268d48bcc426c71688d64852b4be13a84a0cbd85263c0f2f
|
|
| MD5 |
a9f94ece2455f74ccbf5ea50dc475024
|
|
| BLAKE2b-256 |
31b2ffe775c9b00125ad72e4883c665fee89bfcbf9bd4d2076c2fd6886eb5a03
|
File details
Details for the file llm_scraper_py-0.4.0-py3-none-any.whl.
File metadata
- Download URL: llm_scraper_py-0.4.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77fba59a383f3ecfb34ce3fd5393045a9e0942776346847d81c65eaf2d4ca2ee
|
|
| MD5 |
f5355d0f51de0dba3fa17905257e41c7
|
|
| BLAKE2b-256 |
454401f18a18dff9e579d4f25c4488deaaa84cb16392e2cfa24337b93133f945
|