AI-powered web content extraction with Large Language Models
Project description
🤖 LLM WebExtract
Turn any website into structured data using the power of AI
Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? That's exactly why I built this tool. It combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.
🎯 What does this actually do?
Instead of writing complex parsing rules for every website, this tool:
- Scrapes the webpage using Playwright (handles modern JavaScript sites)
- Feeds the content to an LLM (local via Ollama, or cloud via OpenAI/Anthropic)
- Gets back structured data - topics, entities, summaries, key facts, and more
Think of it as having an AI assistant that reads web pages and summarizes them for you.
🚀 Getting Started
Installation
pip install llm-webextract
playwright install chromium
Want to use OpenAI or Anthropic instead of local models?
pip install llm-webextract[openai] # For GPT models
pip install llm-webextract[anthropic] # For Claude models
pip install llm-webextract[all] # Everything
Quick Examples
Command Line (easiest way to start):
# Extract content from any URL
llm-webextract extract "https://news.ycombinator.com"
# Pretty formatted output
llm-webextract extract "https://example.com" --format pretty
# Test your setup
llm-webextract test
Python Code:
import webextract
# Simple one-liner (requires Ollama running locally)
result = webextract.quick_extract("https://news.bbc.co.uk")
print(f"Summary: {result.summary}")
print(f"Key topics: {result.topics}")
# Or use cloud providers
result = webextract.extract_with_openai(
"https://techcrunch.com",
api_key="sk-your-key-here"
)
🛠 Configuration Options
Using Different LLM Providers
Local with Ollama (default):
from webextract import WebExtractor, ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_model("llama3:8b") # or any model you have
.build()
)
OpenAI GPT:
extractor = WebExtractor(
ConfigBuilder()
.with_openai(api_key="sk-...", model="gpt-4")
.build()
)
Anthropic Claude:
extractor = WebExtractor(
ConfigBuilder()
.with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
.build()
)
Pre-built Profiles
I've included some ready-to-use configurations for common scenarios:
from webextract import ConfigProfiles
# For news articles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
# For research papers
research_extractor = WebExtractor(ConfigProfiles.research_papers())
# For e-commerce sites
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())
📊 What You Get Back
The LLM analyzes the content and returns structured data like:
- Summary - Clean, concise overview
- Topics - Main themes and subjects
- Entities - People, companies, locations mentioned
- Key Facts - Important information and takeaways
- Sentiment - Overall tone (positive/negative/neutral)
- Category - Content classification
- Important Dates - Key dates found in the content
Example output:
{
"summary": "Article discusses the latest developments in AI technology...",
"topics": ["artificial intelligence", "machine learning", "tech industry"],
"entities": ["OpenAI", "San Francisco", "Sam Altman"],
"sentiment": "positive",
"key_facts": ["New model released", "Performance improvements", "Beta testing"],
"category": "technology",
"confidence_score": 0.92
}
⚙️ Environment Setup
You can configure defaults using environment variables:
export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"
🏗 How It Works
- Modern Web Scraping - Uses Playwright to handle JavaScript, SPAs, and modern websites
- Smart Content Processing - Removes ads, navigation, and focuses on main content
- LLM Analysis - Feeds clean content to your chosen LLM for intelligent extraction
- Structured Output - Returns consistent, structured data you can actually use
🤔 Why I Built This
I was tired of:
- Writing custom scrapers for every website
- Dealing with HTML parsing edge cases
- Manually extracting insights from content
- Working with inconsistent data formats
This tool solves all of that by letting the LLM do the heavy lifting of understanding and structuring content.
🛡 Requirements
- Python 3.8+
- One of:
- Ollama running locally (free, private)
- OpenAI API key (paid, powerful)
- Anthropic API key (paid, great reasoning)
🔧 Advanced Usage
Custom extraction prompts:
llm-webextract extract "https://example.com" \
--prompt "Focus on extracting pricing and contact information"
Batch processing:
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
for url in urls:
result = extractor.extract(url)
# Process each result
Error handling:
try:
result = extractor.extract("https://problematic-site.com")
except ExtractionError as e:
print(f"Failed to extract: {e}")
🤝 Contributing
Found a bug? Want to add a feature? PRs are welcome!
For Contributors:
- 📖 Read our Development Guide for commit conventions, versioning, and release processes
- 🐛 Report bugs by opening an issue with detailed reproduction steps
- 💡 Suggest features by opening a discussion or issue
- 🔧 Submit PRs following our coding standards and commit message format
Quick Start for Contributors:
# Fork and clone the repo
git clone https://github.com/yourusername/llm-scraper.git
cd llm-scraper
# Install in development mode
pip install -e ".[dev]"
# Run tests and quality checks
python -m pytest && python -m black --check . && python -m flake8 --config .flake8
- Fork the repo
- Create a feature branch
- Make your changes
- Add tests if possible
- Submit a PR
📄 License
MIT License - feel free to use this in your projects!
🙏 Thanks
Built with some amazing tools:
- Ollama - Local LLM inference
- Playwright - Modern web scraping
- Beautiful Soup - HTML parsing
- Pydantic - Data validation
- Typer - CLI framework
Got questions? Open an issue - I'm happy to help!
Find this useful? Give it a ⭐ - it really helps!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_webextract-1.1.1.tar.gz.
File metadata
- Download URL: llm_webextract-1.1.1.tar.gz
- Upload date:
- Size: 37.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ee1c7a8d6a21e8c452677a5d05e62e18360080622bec31243233bd0e8655cc0c
|
|
| MD5 |
8da3d17acf8adad31042083b332291b3
|
|
| BLAKE2b-256 |
492b197aa79535e6ca8a5e0fc946b2e68e22d9ee01908d8369be5dea5db1e146
|
File details
Details for the file llm_webextract-1.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_webextract-1.1.1-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd55995e05afdcc1cb3ed400626a918c2e5196a3111d30e4c8a9f5d7e0cca8d2
|
|
| MD5 |
0c3feeecfcfe704211860454f89ba651
|
|
| BLAKE2b-256 |
8375c650a793bc61bf7d31b6e36a11d58b333b345f4d9ff51c38f253f00f95ea
|