AI-powered web content extraction with Large Language Models

These details have not been verified by PyPI

Project links

Project description

🌐 WebExtract

AI-powered web content extraction using Large Language Models. Extract structured information from any webpage with the power of local or cloud-based LLMs.

✨ What does it do?

Transform any webpage into structured data:

🌐 Smart Scraping - Uses Playwright for reliable scraping of modern websites
🤖 AI Processing - Leverages LLMs (Ollama, OpenAI, Anthropic) for intelligent content analysis
📊 Structured Output - Extracts topics, entities, sentiment, summaries, and key information
🎯 Configurable - Flexible configuration for different use cases and LLM providers

Perfect for researchers, developers, and anyone who needs to extract meaningful information from web content.

🚀 Quick Start

Installation

# Install the package
pip install webextract

# For specific LLM providers (optional)
pip install webextract[openai]    # For OpenAI GPT models
pip install webextract[anthropic] # For Anthropic Claude models  
pip install webextract[all]       # For all providers

# Install browser dependencies
playwright install chromium

Basic Usage

import webextract

# Simple extraction with defaults (requires Ollama)
result = webextract.quick_extract("https://example.com")
print(result.structured_info)

# With OpenAI
result = webextract.extract_with_openai(
    "https://news.bbc.co.uk", 
    api_key="sk-..."
)

# With Anthropic  
result = webextract.extract_with_anthropic(
    "https://example.com",
    api_key="sk-ant-..."
)

Command Line Interface

# Extract with default settings
webextract extract "https://example.com"

# Pretty formatted output
webextract extract "https://example.com" --format pretty

# Custom model and prompt
webextract extract "https://example.com" \
  --model llama3:8b \
  --prompt "Focus on extracting contact information and key facts"

# Test your setup
webextract test

💡 Features

🌐 Modern Web Scraping - Uses Playwright for reliable scraping of modern websites, including SPAs and JavaScript-heavy sites

🛡️ Robust & Reliable - Handles errors gracefully, retries failed requests, and works with anti-bot measures

🧠 Smart Extraction - Uses your local LLM to understand content and extract meaningful information

⚡ Fast & Efficient - Optimized for speed with intelligent content processing and browser automation

🎨 Beautiful Output - Clean JSON or rich terminal formatting

🔧 Highly Configurable - Customize everything from timeouts to extraction prompts

📊 Built-in Monitoring - Confidence scores and performance metrics included

🎯 Usage Examples

Python API

from webextract import WebExtractor, ConfigBuilder, ConfigProfiles

# Method 1: Simple usage
extractor = WebExtractor()
result = extractor.extract("https://example.com")

# Method 2: Custom configuration  
config = (ConfigBuilder()
          .with_model("llama3:8b")
          .with_custom_prompt("Extract key facts and figures")
          .with_timeout(60)
          .build())

extractor = WebExtractor(config)
result = extractor.extract("https://example.com")

# Method 3: Use pre-built profiles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
ecommerce_extractor = WebExtractor(ConfigProfiles.ecommerce())

# Method 4: Different LLM providers
openai_config = (ConfigBuilder()
                 .with_openai(api_key="sk-...", model="gpt-4")
                 .build())

anthropic_config = (ConfigBuilder()
                   .with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
                   .build())

Command Line Usage

# Basic extraction
webextract extract "https://example.com"

# Save to file with pretty formatting
webextract extract "https://example.com" \
  --format pretty \
  --output results.json

# Custom model and settings
webextract extract "https://example.com" \
  --model llama3:8b \
  --max-content 8000 \
  --prompt "Focus on extracting technical information"

# Test connection
webextract test

# Show version
webextract version

Environment Configuration

# Set via environment variables
export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_LLM_BASE_URL="http://localhost:11434"

🛠 Configuration

You can customize the behavior using environment variables:

export OLLAMA_BASE_URL="http://localhost:11434"
export DEFAULT_MODEL="gemma3:27b"
export REQUEST_TIMEOUT="30"
export MAX_CONTENT_LENGTH="5000"
export REQUEST_DELAY="1.0"

Or modify config/settings.py directly.

📋 What Gets Extracted?

The LLM analyzes web content and extracts:

Topics & Themes - Main subjects discussed
Entities - People, organizations, locations mentioned
Key Points - Important takeaways and facts
Sentiment - Overall tone (positive/negative/neutral)
Summary - Concise overview of the content
Metadata - Title, description, important links
Category - Content classification
Important Dates - Key dates mentioned in the content

🏗 Project Structure

webextract/
├── src/
│   ├── models.py          # Data structures
│   ├── scraper.py         # Playwright-based web scraping
│   ├── llm_client.py      # Ollama integration
│   └── extractor.py       # Main coordination
├── config/
│   └── settings.py        # Configuration
├── examples/
│   └── basic_usage.py     # Code examples
├── main.py               # CLI interface
└── requirements.txt      # Dependencies

🚀 Technical Highlights

Browser Automation: Uses Playwright for reliable, modern web scraping
Dynamic Content: Handles JavaScript-rendered content and SPAs
Smart Rate Limiting: Respects website resources with configurable delays
Error Recovery: Comprehensive retry logic with exponential backoff
Resource Management: Proper browser lifecycle management
Anti-Detection: Rotates user agents and uses realistic browser behavior

🤝 Contributing

Found a bug? Have an idea? Contributions are welcome!

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Ollama for local LLM processing
Uses Playwright for modern web scraping
HTML parsing with Beautiful Soup
CLI powered by Typer and Rich

⭐ If this tool helps you, consider giving it a star!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.4

Jun 25, 2025

1.2.2

Jun 18, 2025

1.2.1

Jun 16, 2025

1.2.0

Jun 16, 2025

1.1.2

Jun 15, 2025

1.1.1

Jun 14, 2025

1.0.4

Jun 14, 2025

1.0.3

Jun 13, 2025

1.0.2

Jun 13, 2025

This version

1.0.0

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_webextract-1.0.0.tar.gz (26.4 kB view details)

Uploaded Jun 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_webextract-1.0.0-py3-none-any.whl (23.9 kB view details)

Uploaded Jun 13, 2025 Python 3

File details

Details for the file llm_webextract-1.0.0.tar.gz.

File metadata

Download URL: llm_webextract-1.0.0.tar.gz
Upload date: Jun 13, 2025
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9d6efc00680df9a59f094c2df75e8d3efce81417b64c31a0cca8e77931d85dd7`
MD5	`e5821d654c862e5e706eec2af941cd47`
BLAKE2b-256	`748f933e1adbcbbdbbc4565c05a1739cf6fd7ce9d6768c62b91f0fcd7ede9f37`

See more details on using hashes here.

File details

Details for the file llm_webextract-1.0.0-py3-none-any.whl.

File metadata

Download URL: llm_webextract-1.0.0-py3-none-any.whl
Upload date: Jun 13, 2025
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b04ae1c76fed98003782a2d983afc23c15c3c7ce094d76914c8f6ff6a376bd02`
MD5	`3a36142ed1f8d94e44af7fc7ec077f14`
BLAKE2b-256	`07738f22ec2ae1eadd4b36de19cb231ad4b90484188371d4dddb59e309370df1`

See more details on using hashes here.

llm-webextract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🌐 WebExtract

✨ What does it do?

🚀 Quick Start

Installation

Basic Usage

Command Line Interface

💡 Features

🎯 Usage Examples

Python API

Command Line Usage

Environment Configuration

🛠 Configuration

📋 What Gets Extracted?

🏗 Project Structure

🚀 Technical Highlights

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes