Skip to main content

AI-powered web content extraction with Large Language Models

Project description

๐ŸŒ WebExtract

AI-powered web content extraction using Large Language Models. Extract structured information from any webpage with the power of local or cloud-based LLMs.

โœจ What does it do?

Transform any webpage into structured data:

  1. ๐ŸŒ Smart Scraping - Uses Playwright for reliable scraping of modern websites
  2. ๐Ÿค– AI Processing - Leverages LLMs (Ollama, OpenAI, Anthropic) for intelligent content analysis
  3. ๐Ÿ“Š Structured Output - Extracts topics, entities, sentiment, summaries, and key information
  4. ๐ŸŽฏ Configurable - Flexible configuration for different use cases and LLM providers

Perfect for researchers, developers, and anyone who needs to extract meaningful information from web content.

๐Ÿš€ Quick Start

Installation

# Install the package
pip install webextract

# For specific LLM providers (optional)
pip install webextract[openai]    # For OpenAI GPT models
pip install webextract[anthropic] # For Anthropic Claude models  
pip install webextract[all]       # For all providers

# Install browser dependencies
playwright install chromium

Basic Usage

import webextract

# Simple extraction with defaults (requires Ollama)
result = webextract.quick_extract("https://example.com")
print(result.structured_info)

# With OpenAI
result = webextract.extract_with_openai(
    "https://news.bbc.co.uk", 
    api_key="sk-..."
)

# With Anthropic  
result = webextract.extract_with_anthropic(
    "https://example.com",
    api_key="sk-ant-..."
)

Command Line Interface

# Extract with default settings
webextract extract "https://example.com"

# Pretty formatted output
webextract extract "https://example.com" --format pretty

# Custom model and prompt
webextract extract "https://example.com" \
  --model llama3:8b \
  --prompt "Focus on extracting contact information and key facts"

# Test your setup
webextract test

๐Ÿ’ก Features

๐ŸŒ Modern Web Scraping - Uses Playwright for reliable scraping of modern websites, including SPAs and JavaScript-heavy sites

๐Ÿ›ก๏ธ Robust & Reliable - Handles errors gracefully, retries failed requests, and works with anti-bot measures

๐Ÿง  Smart Extraction - Uses your local LLM to understand content and extract meaningful information

โšก Fast & Efficient - Optimized for speed with intelligent content processing and browser automation

๐ŸŽจ Beautiful Output - Clean JSON or rich terminal formatting

๐Ÿ”ง Highly Configurable - Customize everything from timeouts to extraction prompts

๐Ÿ“Š Built-in Monitoring - Confidence scores and performance metrics included

๐ŸŽฏ Usage Examples

Python API

from webextract import WebExtractor, ConfigBuilder, ConfigProfiles

# Method 1: Simple usage
extractor = WebExtractor()
result = extractor.extract("https://example.com")

# Method 2: Custom configuration  
config = (ConfigBuilder()
          .with_model("llama3:8b")
          .with_custom_prompt("Extract key facts and figures")
          .with_timeout(60)
          .build())

extractor = WebExtractor(config)
result = extractor.extract("https://example.com")

# Method 3: Use pre-built profiles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
ecommerce_extractor = WebExtractor(ConfigProfiles.ecommerce())

# Method 4: Different LLM providers
openai_config = (ConfigBuilder()
                 .with_openai(api_key="sk-...", model="gpt-4")
                 .build())

anthropic_config = (ConfigBuilder()
                   .with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
                   .build())

Command Line Usage

# Basic extraction
webextract extract "https://example.com"

# Save to file with pretty formatting
webextract extract "https://example.com" \
  --format pretty \
  --output results.json

# Custom model and settings
webextract extract "https://example.com" \
  --model llama3:8b \
  --max-content 8000 \
  --prompt "Focus on extracting technical information"

# Test connection
webextract test

# Show version
webextract version

Environment Configuration

# Set via environment variables
export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_LLM_BASE_URL="http://localhost:11434"

๐Ÿ›  Configuration

You can customize the behavior using environment variables:

export OLLAMA_BASE_URL="http://localhost:11434"
export DEFAULT_MODEL="gemma3:27b"
export REQUEST_TIMEOUT="30"
export MAX_CONTENT_LENGTH="5000"
export REQUEST_DELAY="1.0"

Or modify config/settings.py directly.

๐Ÿ“‹ What Gets Extracted?

The LLM analyzes web content and extracts:

  • Topics & Themes - Main subjects discussed
  • Entities - People, organizations, locations mentioned
  • Key Points - Important takeaways and facts
  • Sentiment - Overall tone (positive/negative/neutral)
  • Summary - Concise overview of the content
  • Metadata - Title, description, important links
  • Category - Content classification
  • Important Dates - Key dates mentioned in the content

๐Ÿ— Project Structure

webextract/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ models.py          # Data structures
โ”‚   โ”œโ”€โ”€ scraper.py         # Playwright-based web scraping
โ”‚   โ”œโ”€โ”€ llm_client.py      # Ollama integration
โ”‚   โ””โ”€โ”€ extractor.py       # Main coordination
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ settings.py        # Configuration
โ”œโ”€โ”€ examples/
โ”‚   โ””โ”€โ”€ basic_usage.py     # Code examples
โ”œโ”€โ”€ main.py               # CLI interface
โ””โ”€โ”€ requirements.txt      # Dependencies

๐Ÿš€ Technical Highlights

  • Browser Automation: Uses Playwright for reliable, modern web scraping
  • Dynamic Content: Handles JavaScript-rendered content and SPAs
  • Smart Rate Limiting: Respects website resources with configurable delays
  • Error Recovery: Comprehensive retry logic with exponential backoff
  • Resource Management: Proper browser lifecycle management
  • Anti-Detection: Rotates user agents and uses realistic browser behavior

๐Ÿค Contributing

Found a bug? Have an idea? Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments


โญ If this tool helps you, consider giving it a star!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_webextract-1.0.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_webextract-1.0.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_webextract-1.0.0.tar.gz.

File metadata

  • Download URL: llm_webextract-1.0.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9d6efc00680df9a59f094c2df75e8d3efce81417b64c31a0cca8e77931d85dd7
MD5 e5821d654c862e5e706eec2af941cd47
BLAKE2b-256 748f933e1adbcbbdbbc4565c05a1739cf6fd7ce9d6768c62b91f0fcd7ede9f37

See more details on using hashes here.

File details

Details for the file llm_webextract-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: llm_webextract-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b04ae1c76fed98003782a2d983afc23c15c3c7ce094d76914c8f6ff6a376bd02
MD5 3a36142ed1f8d94e44af7fc7ec077f14
BLAKE2b-256 07738f22ec2ae1eadd4b36de19cb231ad4b90484188371d4dddb59e309370df1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page