AI-powered web content extraction with Large Language Models
Project description
๐ WebExtract
AI-powered web content extraction using Large Language Models. Extract structured information from any webpage with the power of local or cloud-based LLMs.
โจ What does it do?
Transform any webpage into structured data:
- ๐ Smart Scraping - Uses Playwright for reliable scraping of modern websites
- ๐ค AI Processing - Leverages LLMs (Ollama, OpenAI, Anthropic) for intelligent content analysis
- ๐ Structured Output - Extracts topics, entities, sentiment, summaries, and key information
- ๐ฏ Configurable - Flexible configuration for different use cases and LLM providers
Perfect for researchers, developers, and anyone who needs to extract meaningful information from web content.
๐ Quick Start
Installation
# Install the package
pip install webextract
# For specific LLM providers (optional)
pip install webextract[openai] # For OpenAI GPT models
pip install webextract[anthropic] # For Anthropic Claude models
pip install webextract[all] # For all providers
# Install browser dependencies
playwright install chromium
Basic Usage
import webextract
# Simple extraction with defaults (requires Ollama)
result = webextract.quick_extract("https://example.com")
print(result.structured_info)
# With OpenAI
result = webextract.extract_with_openai(
"https://news.bbc.co.uk",
api_key="sk-..."
)
# With Anthropic
result = webextract.extract_with_anthropic(
"https://example.com",
api_key="sk-ant-..."
)
Command Line Interface
# Extract with default settings
webextract extract "https://example.com"
# Pretty formatted output
webextract extract "https://example.com" --format pretty
# Custom model and prompt
webextract extract "https://example.com" \
--model llama3:8b \
--prompt "Focus on extracting contact information and key facts"
# Test your setup
webextract test
๐ก Features
๐ Modern Web Scraping - Uses Playwright for reliable scraping of modern websites, including SPAs and JavaScript-heavy sites
๐ก๏ธ Robust & Reliable - Handles errors gracefully, retries failed requests, and works with anti-bot measures
๐ง Smart Extraction - Uses your local LLM to understand content and extract meaningful information
โก Fast & Efficient - Optimized for speed with intelligent content processing and browser automation
๐จ Beautiful Output - Clean JSON or rich terminal formatting
๐ง Highly Configurable - Customize everything from timeouts to extraction prompts
๐ Built-in Monitoring - Confidence scores and performance metrics included
๐ฏ Usage Examples
Python API
from webextract import WebExtractor, ConfigBuilder, ConfigProfiles
# Method 1: Simple usage
extractor = WebExtractor()
result = extractor.extract("https://example.com")
# Method 2: Custom configuration
config = (ConfigBuilder()
.with_model("llama3:8b")
.with_custom_prompt("Extract key facts and figures")
.with_timeout(60)
.build())
extractor = WebExtractor(config)
result = extractor.extract("https://example.com")
# Method 3: Use pre-built profiles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
ecommerce_extractor = WebExtractor(ConfigProfiles.ecommerce())
# Method 4: Different LLM providers
openai_config = (ConfigBuilder()
.with_openai(api_key="sk-...", model="gpt-4")
.build())
anthropic_config = (ConfigBuilder()
.with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
.build())
Command Line Usage
# Basic extraction
webextract extract "https://example.com"
# Save to file with pretty formatting
webextract extract "https://example.com" \
--format pretty \
--output results.json
# Custom model and settings
webextract extract "https://example.com" \
--model llama3:8b \
--max-content 8000 \
--prompt "Focus on extracting technical information"
# Test connection
webextract test
# Show version
webextract version
Environment Configuration
# Set via environment variables
export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_LLM_BASE_URL="http://localhost:11434"
๐ Configuration
You can customize the behavior using environment variables:
export OLLAMA_BASE_URL="http://localhost:11434"
export DEFAULT_MODEL="gemma3:27b"
export REQUEST_TIMEOUT="30"
export MAX_CONTENT_LENGTH="5000"
export REQUEST_DELAY="1.0"
Or modify config/settings.py directly.
๐ What Gets Extracted?
The LLM analyzes web content and extracts:
- Topics & Themes - Main subjects discussed
- Entities - People, organizations, locations mentioned
- Key Points - Important takeaways and facts
- Sentiment - Overall tone (positive/negative/neutral)
- Summary - Concise overview of the content
- Metadata - Title, description, important links
- Category - Content classification
- Important Dates - Key dates mentioned in the content
๐ Project Structure
webextract/
โโโ src/
โ โโโ models.py # Data structures
โ โโโ scraper.py # Playwright-based web scraping
โ โโโ llm_client.py # Ollama integration
โ โโโ extractor.py # Main coordination
โโโ config/
โ โโโ settings.py # Configuration
โโโ examples/
โ โโโ basic_usage.py # Code examples
โโโ main.py # CLI interface
โโโ requirements.txt # Dependencies
๐ Technical Highlights
- Browser Automation: Uses Playwright for reliable, modern web scraping
- Dynamic Content: Handles JavaScript-rendered content and SPAs
- Smart Rate Limiting: Respects website resources with configurable delays
- Error Recovery: Comprehensive retry logic with exponential backoff
- Resource Management: Proper browser lifecycle management
- Anti-Detection: Rotates user agents and uses realistic browser behavior
๐ค Contributing
Found a bug? Have an idea? Contributions are welcome!
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with Ollama for local LLM processing
- Uses Playwright for modern web scraping
- HTML parsing with Beautiful Soup
- CLI powered by Typer and Rich
โญ If this tool helps you, consider giving it a star!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_webextract-1.0.0.tar.gz.
File metadata
- Download URL: llm_webextract-1.0.0.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d6efc00680df9a59f094c2df75e8d3efce81417b64c31a0cca8e77931d85dd7
|
|
| MD5 |
e5821d654c862e5e706eec2af941cd47
|
|
| BLAKE2b-256 |
748f933e1adbcbbdbbc4565c05a1739cf6fd7ce9d6768c62b91f0fcd7ede9f37
|
File details
Details for the file llm_webextract-1.0.0-py3-none-any.whl.
File metadata
- Download URL: llm_webextract-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b04ae1c76fed98003782a2d983afc23c15c3c7ce094d76914c8f6ff6a376bd02
|
|
| MD5 |
3a36142ed1f8d94e44af7fc7ec077f14
|
|
| BLAKE2b-256 |
07738f22ec2ae1eadd4b36de19cb231ad4b90484188371d4dddb59e309370df1
|