Skip to main content

AI-powered web content extraction with Large Language Models

Project description

🤖 LLM WebExtract

Turn any website into structured data using the power of AI

Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? That's exactly why I built this tool. It combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.

🎯 What does this actually do?

Instead of writing complex parsing rules for every website, this tool:

  1. Scrapes the webpage using Playwright (handles modern JavaScript sites)
  2. Feeds the content to an LLM (local via Ollama, or cloud via OpenAI/Anthropic)
  3. Gets back structured data - topics, entities, summaries, key facts, and more

Think of it as having an AI assistant that reads web pages and summarizes them for you.

🚀 Getting Started

Installation

pip install llm-webextract
playwright install chromium

Want to use OpenAI or Anthropic instead of local models?

pip install llm-webextract[openai]     # For GPT models
pip install llm-webextract[anthropic]  # For Claude models
pip install llm-webextract[all]        # Everything

Quick Examples

Command Line (easiest way to start):

# Extract content from any URL
llm-webextract extract "https://news.ycombinator.com"

# Pretty formatted output
llm-webextract extract "https://example.com" --format pretty

# Test your setup
llm-webextract test

Python Code:

import webextract

# Simple one-liner (requires Ollama running locally)
result = webextract.quick_extract("https://news.bbc.co.uk")
print(f"Summary: {result.summary}")
print(f"Key topics: {result.topics}")

# Or use cloud providers
result = webextract.extract_with_openai(
    "https://techcrunch.com", 
    api_key="sk-your-key-here"
)

🛠 Configuration Options

Using Different LLM Providers

Local with Ollama (default):

from webextract import WebExtractor, ConfigBuilder

extractor = WebExtractor(
    ConfigBuilder()
    .with_model("llama3:8b")  # or any model you have
    .build()
)

OpenAI GPT:

extractor = WebExtractor(
    ConfigBuilder()
    .with_openai(api_key="sk-...", model="gpt-4")
    .build()
)

Anthropic Claude:

extractor = WebExtractor(
    ConfigBuilder()
    .with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
    .build()
)

Pre-built Profiles

I've included some ready-to-use configurations for common scenarios:

from webextract import ConfigProfiles

# For news articles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())

# For research papers  
research_extractor = WebExtractor(ConfigProfiles.research_papers())

# For e-commerce sites
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())

📊 What You Get Back

The LLM analyzes the content and returns structured data like:

  • Summary - Clean, concise overview
  • Topics - Main themes and subjects
  • Entities - People, companies, locations mentioned
  • Key Facts - Important information and takeaways
  • Sentiment - Overall tone (positive/negative/neutral)
  • Category - Content classification
  • Important Dates - Key dates found in the content

Example output:

{
  "summary": "Article discusses the latest developments in AI technology...",
  "topics": ["artificial intelligence", "machine learning", "tech industry"],
  "entities": ["OpenAI", "San Francisco", "Sam Altman"],
  "sentiment": "positive",
  "key_facts": ["New model released", "Performance improvements", "Beta testing"],
  "category": "technology",
  "confidence_score": 0.92
}

⚙️ Environment Setup

You can configure defaults using environment variables:

export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"

🏗 How It Works

  1. Modern Web Scraping - Uses Playwright to handle JavaScript, SPAs, and modern websites
  2. Smart Content Processing - Removes ads, navigation, and focuses on main content
  3. LLM Analysis - Feeds clean content to your chosen LLM for intelligent extraction
  4. Structured Output - Returns consistent, structured data you can actually use

🤔 Why I Built This

I was tired of:

  • Writing custom scrapers for every website
  • Dealing with HTML parsing edge cases
  • Manually extracting insights from content
  • Working with inconsistent data formats

This tool solves all of that by letting the LLM do the heavy lifting of understanding and structuring content.

🛡 Requirements

  • Python 3.8+
  • One of:
    • Ollama running locally (free, private)
    • OpenAI API key (paid, powerful)
    • Anthropic API key (paid, great reasoning)

🔧 Advanced Usage

Custom extraction prompts:

llm-webextract extract "https://example.com" \
  --prompt "Focus on extracting pricing and contact information"

Batch processing:

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
for url in urls:
    result = extractor.extract(url)
    # Process each result

Error handling:

try:
    result = extractor.extract("https://problematic-site.com")
except ExtractionError as e:
    print(f"Failed to extract: {e}")

🤝 Contributing

Found a bug? Want to add a feature? PRs are welcome!

  1. Fork the repo
  2. Create a feature branch
  3. Make your changes
  4. Add tests if possible
  5. Submit a PR

📄 License

MIT License - feel free to use this in your projects!

🙏 Thanks

Built with some amazing tools:


Got questions? Open an issue - I'm happy to help!

Find this useful? Give it a ⭐ - it really helps!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_webextract-1.0.3.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_webextract-1.0.3-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_webextract-1.0.3.tar.gz.

File metadata

  • Download URL: llm_webextract-1.0.3.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for llm_webextract-1.0.3.tar.gz
Algorithm Hash digest
SHA256 68ffb29aa256b57dfe9c4924ace4ad814a9f308c9406c00180f4a7e31a46e19a
MD5 aeeb628542ec5b079e8bbee346d915b6
BLAKE2b-256 e4070fab6e627acd466088f9467a3aea4a9c7bc218062a469ebd608728d1cce6

See more details on using hashes here.

File details

Details for the file llm_webextract-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: llm_webextract-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for llm_webextract-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 423176510da9ef5e53b3f53dde110ab47b35fa13e9f19c0849ac569b315feb8b
MD5 bbda8a62072e1fab62cfedd46605c3d2
BLAKE2b-256 1937f4686f5d31a47806af5179c60a321afeefa224b5ad6abf920f174d84da15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page