AI-powered web content extraction with Large Language Models

These details have not been verified by PyPI

Project links

Project description

🤖 LLM WebExtract

AI-Powered Web Content Extraction - Turn any website into structured data using Large Language Models

Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? LLM WebExtract combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.

🎯 What Does This Do?

Instead of writing complex parsing rules for every website, this tool:

🌐 Scrapes webpages using Playwright (handles modern JavaScript sites)
🧠 Feeds content to AI (local via Ollama, or cloud via OpenAI/Anthropic)
📊 Returns structured data - topics, entities, summaries, key facts, and more

Think of it as having an AI assistant that reads web pages and summarizes them for you.

⭐ Key Features

🔄 Multi-Provider Support: Works with Ollama (local), OpenAI, and Anthropic
🚀 Modern Web Scraping: Handles JavaScript-heavy sites with Playwright
📋 Pre-built Profiles: Ready configurations for news, research, e-commerce
🛡️ Robust Error Handling: Specific exceptions for different failure types
⚡ Batch Processing: Extract from multiple URLs concurrently
🎛️ Flexible Configuration: Environment variables, custom prompts, schemas
💾 Smart Caching: Avoid re-processing the same URLs

🚀 Quick Start

Installation

# Basic installation
pip install llm-webextract
playwright install chromium

# With cloud providers
pip install llm-webextract[openai]     # For GPT models
pip install llm-webextract[anthropic]  # For Claude models  
pip install llm-webextract[all]        # Everything

30-Second Example

# Command line (requires local Ollama)
llm-webextract extract "https://news.ycombinator.com"

# Test your setup
llm-webextract test

# Python - Local Ollama
import webextract

result = webextract.quick_extract("https://techcrunch.com")
print(f"Summary: {result.summary}")
print(f"Topics: {result.topics}")

🛠️ Configuration & Usage

Provider Setup

🏠 Local with Ollama (Free & Private)

from webextract import WebExtractor, ConfigBuilder

extractor = WebExtractor(
    ConfigBuilder()
    .with_ollama("llama3.2")  # or any model you have
    .build()
)

result = extractor.extract("https://example.com")

☁️ OpenAI GPT

extractor = WebExtractor(
    ConfigBuilder()
    .with_openai(api_key="sk-...", model="gpt-4o-mini")
    .build()
)

🧠 Anthropic Claude

extractor = WebExtractor(
    ConfigBuilder()
    .with_anthropic(api_key="sk-ant-...", model="claude-3-5-sonnet-20241022")
    .build()
)

Pre-built Profiles

from webextract import ConfigProfiles, WebExtractor

# Optimized for different content types
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())

Environment Variables

Set defaults to avoid repeating configuration:

export WEBEXTRACT_LLM_PROVIDER="openai"
export WEBEXTRACT_MODEL="gpt-4o-mini"
export WEBEXTRACT_API_KEY="sk-your-key"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_REQUEST_TIMEOUT="45"

📊 What You Get Back

The AI analyzes content and returns structured data:

{
  "summary": "Article discusses the latest developments in AI technology...",
  "topics": ["artificial intelligence", "machine learning", "tech industry"],
  "entities": {
    "people": ["Sam Altman", "Satya Nadella"],
    "organizations": ["OpenAI", "Microsoft", "Google"],
    "locations": ["San Francisco", "Silicon Valley"]
  },
  "sentiment": "positive",
  "key_facts": [
    "New model shows 40% improvement in reasoning",
    "Beta testing starts next month",
    "Open source version planned for 2024"
  ],
  "category": "technology",
  "important_dates": ["2024-03-15", "Q2 2024"],
  "statistics": ["40% improvement", "$10B investment"],
  "confidence": 0.89
}

🔧 Advanced Usage

Custom Extraction Schema

schema = {
    "product_name": "Extract the main product name",
    "price": "Extract the current price",
    "rating": "Extract average rating (number only)",
    "reviews_count": "Extract total number of reviews",
    "key_features": "List main product features"
}

result = extractor.extract_with_custom_schema(
    "https://amazon.com/product/...", 
    schema
)

Batch Processing

urls = [
    "https://techcrunch.com/article1",
    "https://venturebeat.com/article2", 
    "https://theverge.com/article3"
]

results = extractor.extract_batch(urls, max_workers=3)
for result in results:
    if result and result.is_successful:
        print(f"{result.url}: {result.get_summary()}")

Error Handling

from webextract import (
    WebExtractor, 
    ExtractionError, 
    ScrapingError, 
    LLMError,
    AuthenticationError
)

try:
    result = extractor.extract("https://problematic-site.com")
except AuthenticationError:
    print("Invalid API key")
except ScrapingError as e:
    print(f"Failed to scrape website: {e}")
except LLMError as e:
    print(f"AI processing failed: {e}")
except ExtractionError as e:
    print(f"General extraction error: {e}")

Custom Prompts

config = (ConfigBuilder()
    .with_openai("sk-...", "gpt-4")
    .with_custom_prompt("""
        Focus on extracting:
        1. Financial metrics and numbers
        2. Company performance indicators  
        3. Market trends and predictions
        4. Executive quotes and statements
    """)
    .build())

🏗️ How It Works

graph LR
    A[URL] --> B[Playwright Scraper]
    B --> C[Content Cleaning]
    C --> D[LLM Processing]
    D --> E[Structured Data]
    
    B --> F[JavaScript Handling]
    C --> G[Ad/Nav Removal]
    D --> H[JSON Validation]
    E --> I[Confidence Scoring]

Modern Web Scraping: Playwright handles JavaScript, SPAs, and modern websites
Intelligent Content Processing: Removes ads, navigation, focuses on main content
AI Analysis: Your chosen LLM extracts structured information
Quality Assurance: Validates output format and calculates confidence scores

🛡️ Requirements

Python 3.8+
One of:
- Ollama running locally (free, private)
- OpenAI API key (paid, powerful)
- Anthropic API key (paid, great reasoning)

Installing Ollama (Recommended for beginners)

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.2

# Start the service  
ollama serve

🎯 Use Cases

📰 News Monitoring: Extract key information from news articles
🔬 Research: Process academic papers and technical documents
🛒 E-commerce: Monitor product prices, reviews, specifications
📈 Market Research: Analyze competitor websites and industry trends
📋 Content Curation: Summarize and categorize web content
🤖 AI Training: Generate structured datasets from web content

🧪 Testing Your Setup

# Test connection and model availability
llm-webextract test

# Test with a specific URL
llm-webextract extract "https://example.com" --format pretty

# Check available providers
python -c "
from webextract.core.llm_factory import get_available_providers
import json
print(json.dumps(get_available_providers(), indent=2))
"

🤝 Contributing

We welcome contributions! Here's how to get started:

For Contributors

📖 Read our Development Guide for commit conventions and processes
🐛 Report bugs by opening an issue with detailed reproduction steps
💡 Suggest features through GitHub discussions
🔧 Submit PRs following our coding standards

Quick Start for Development

# Fork and clone
git clone https://github.com/HimashaHerath/webextract.git
cd webextract

# Install in development mode
pip install -e ".[dev]"

# Run tests and quality checks
python -m pytest
python -m black --check .
python -m flake8 --config .flake8

🔍 Troubleshooting

Common Issues

"Model not available"

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Pull the model if missing
ollama pull llama3.2

"Connection refused"

Ensure Ollama is running: ollama serve
Check firewall settings
Verify the base URL in configuration

"Rate limit exceeded"

Add delays between requests
Use batch processing with lower concurrency
Check your API plan limits

"Content too short"

Site might be blocking scrapers
Try different user agents
Check if site requires JavaScript (we handle this)

📄 License

MIT License - feel free to use this in your projects!

🙏 Acknowledgments

Built with these amazing tools:

Ollama - Local LLM inference
Playwright - Modern web scraping
Beautiful Soup - HTML parsing
Pydantic - Data validation
Typer - CLI framework

📞 Support

📫 Email: himasha626@gmail.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Got questions? Open an issue - I'm happy to help!
Find this useful? Give it a ⭐ - it really helps!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.4

Jun 25, 2025

1.2.2

Jun 18, 2025

1.2.1

Jun 16, 2025

This version

1.2.0

Jun 16, 2025

1.1.2

Jun 15, 2025

1.1.1

Jun 14, 2025

1.0.4

Jun 14, 2025

1.0.3

Jun 13, 2025

1.0.2

Jun 13, 2025

1.0.0

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_webextract-1.2.0.tar.gz (47.2 kB view details)

Uploaded Jun 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_webextract-1.2.0-py3-none-any.whl (38.4 kB view details)

Uploaded Jun 16, 2025 Python 3

File details

Details for the file llm_webextract-1.2.0.tar.gz.

File metadata

Download URL: llm_webextract-1.2.0.tar.gz
Upload date: Jun 16, 2025
Size: 47.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_webextract-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`902ae28e5af1221c191178eb7858fae3421e1315bbda2d46dd6f3ea0f2cbbd37`
MD5	`24ca0820ab81f5664ae595679310de91`
BLAKE2b-256	`cbd5a2d571fa8fe2d7a752751452be52ce6d2ba1f47d757f41511722938dd1d1`

See more details on using hashes here.

File details

Details for the file llm_webextract-1.2.0-py3-none-any.whl.

File metadata

Download URL: llm_webextract-1.2.0-py3-none-any.whl
Upload date: Jun 16, 2025
Size: 38.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_webextract-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bff604ef179fd78e50535f4a69051fb9763179ab8852251f035e4e020f3bc9b2`
MD5	`cdf16e578d8faaf2d5b52e70731c3209`
BLAKE2b-256	`b46a71b805fcd205f32097d0dbef6b99a5edae868051a419641a94c94c113526`

See more details on using hashes here.

llm-webextract 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤖 LLM WebExtract

🎯 What Does This Do?

⭐ Key Features

🚀 Quick Start

Installation

30-Second Example

🛠️ Configuration & Usage

Provider Setup

🏠 Local with Ollama (Free & Private)

☁️ OpenAI GPT

🧠 Anthropic Claude

Pre-built Profiles

Environment Variables

📊 What You Get Back

🔧 Advanced Usage

Custom Extraction Schema

Batch Processing

Error Handling

Custom Prompts

🏗️ How It Works

🛡️ Requirements

Installing Ollama (Recommended for beginners)

🎯 Use Cases

🧪 Testing Your Setup

🤝 Contributing

For Contributors

Quick Start for Development

🔍 Troubleshooting

Common Issues

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes