AI-powered web content extraction with Large Language Models

These details have not been verified by PyPI

Project links

Project description

🤖 LLM WebExtract

Turn any website into structured data using the power of AI

Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? That's exactly why I built this tool. It combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.

🎯 What does this actually do?

Instead of writing complex parsing rules for every website, this tool:

Scrapes the webpage using Playwright (handles modern JavaScript sites)
Feeds the content to an LLM (local via Ollama, or cloud via OpenAI/Anthropic)
Gets back structured data - topics, entities, summaries, key facts, and more

Think of it as having an AI assistant that reads web pages and summarizes them for you.

🚀 Getting Started

Installation

pip install llm-webextract
playwright install chromium

Want to use OpenAI or Anthropic instead of local models?

pip install llm-webextract[openai]     # For GPT models
pip install llm-webextract[anthropic]  # For Claude models
pip install llm-webextract[all]        # Everything

Quick Examples

Command Line (easiest way to start):

# Extract content from any URL
llm-webextract extract "https://news.ycombinator.com"

# Pretty formatted output
llm-webextract extract "https://example.com" --format pretty

# Test your setup
llm-webextract test

Python Code:

import webextract

# Simple one-liner (requires Ollama running locally)
result = webextract.quick_extract("https://news.bbc.co.uk")
print(f"Summary: {result.summary}")
print(f"Key topics: {result.topics}")

# Or use cloud providers
result = webextract.extract_with_openai(
    "https://techcrunch.com", 
    api_key="sk-your-key-here"
)

🛠 Configuration Options

Using Different LLM Providers

Local with Ollama (default):

from webextract import WebExtractor, ConfigBuilder

extractor = WebExtractor(
    ConfigBuilder()
    .with_model("llama3:8b")  # or any model you have
    .build()
)

OpenAI GPT:

extractor = WebExtractor(
    ConfigBuilder()
    .with_openai(api_key="sk-...", model="gpt-4")
    .build()
)

Anthropic Claude:

extractor = WebExtractor(
    ConfigBuilder()
    .with_anthropic(api_key="sk-ant-...", model="claude-3-sonnet-20240229")
    .build()
)

Pre-built Profiles

I've included some ready-to-use configurations for common scenarios:

from webextract import ConfigProfiles

# For news articles
news_extractor = WebExtractor(ConfigProfiles.news_scraping())

# For research papers  
research_extractor = WebExtractor(ConfigProfiles.research_papers())

# For e-commerce sites
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())

📊 What You Get Back

The LLM analyzes the content and returns structured data like:

Summary - Clean, concise overview
Topics - Main themes and subjects
Entities - People, companies, locations mentioned
Key Facts - Important information and takeaways
Sentiment - Overall tone (positive/negative/neutral)
Category - Content classification
Important Dates - Key dates found in the content

Example output:

{
  "summary": "Article discusses the latest developments in AI technology...",
  "topics": ["artificial intelligence", "machine learning", "tech industry"],
  "entities": ["OpenAI", "San Francisco", "Sam Altman"],
  "sentiment": "positive",
  "key_facts": ["New model released", "Performance improvements", "Beta testing"],
  "category": "technology",
  "confidence_score": 0.92
}

⚙️ Environment Setup

You can configure defaults using environment variables:

export WEBEXTRACT_MODEL="llama3:8b"
export WEBEXTRACT_LLM_PROVIDER="ollama"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
export WEBEXTRACT_MAX_CONTENT="8000"

🏗 How It Works

Modern Web Scraping - Uses Playwright to handle JavaScript, SPAs, and modern websites
Smart Content Processing - Removes ads, navigation, and focuses on main content
LLM Analysis - Feeds clean content to your chosen LLM for intelligent extraction
Structured Output - Returns consistent, structured data you can actually use

🤔 Why I Built This

I was tired of:

Writing custom scrapers for every website
Dealing with HTML parsing edge cases
Manually extracting insights from content
Working with inconsistent data formats

This tool solves all of that by letting the LLM do the heavy lifting of understanding and structuring content.

🛡 Requirements

Python 3.8+
One of:
- Ollama running locally (free, private)
- OpenAI API key (paid, powerful)
- Anthropic API key (paid, great reasoning)

🔧 Advanced Usage

Custom extraction prompts:

llm-webextract extract "https://example.com" \
  --prompt "Focus on extracting pricing and contact information"

Batch processing:

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
for url in urls:
    result = extractor.extract(url)
    # Process each result

Error handling:

try:
    result = extractor.extract("https://problematic-site.com")
except ExtractionError as e:
    print(f"Failed to extract: {e}")

🤝 Contributing

Found a bug? Want to add a feature? PRs are welcome!

For Contributors:

📖 Read our Development Guide for commit conventions, versioning, and release processes
🐛 Report bugs by opening an issue with detailed reproduction steps
💡 Suggest features by opening a discussion or issue
🔧 Submit PRs following our coding standards and commit message format

Quick Start for Contributors:

# Fork and clone the repo
git clone https://github.com/yourusername/llm-scraper.git
cd llm-scraper

# Install in development mode
pip install -e ".[dev]"

# Run tests and quality checks
python -m pytest && python -m black --check . && python -m flake8 --config .flake8

Fork the repo
Create a feature branch
Make your changes
Add tests if possible
Submit a PR

📄 License

MIT License - feel free to use this in your projects!

🙏 Thanks

Built with some amazing tools:

Ollama - Local LLM inference
Playwright - Modern web scraping
Beautiful Soup - HTML parsing
Pydantic - Data validation
Typer - CLI framework

Got questions? Open an issue - I'm happy to help!

Find this useful? Give it a ⭐ - it really helps!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.4

Jun 25, 2025

1.2.2

Jun 18, 2025

1.2.1

Jun 16, 2025

1.2.0

Jun 16, 2025

1.1.2

Jun 15, 2025

This version

1.1.1

Jun 14, 2025

1.0.4

Jun 14, 2025

1.0.3

Jun 13, 2025

1.0.2

Jun 13, 2025

1.0.0

Jun 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_webextract-1.1.1.tar.gz (37.0 kB view details)

Uploaded Jun 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_webextract-1.1.1-py3-none-any.whl (29.7 kB view details)

Uploaded Jun 14, 2025 Python 3

File details

Details for the file llm_webextract-1.1.1.tar.gz.

File metadata

Download URL: llm_webextract-1.1.1.tar.gz
Upload date: Jun 14, 2025
Size: 37.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ee1c7a8d6a21e8c452677a5d05e62e18360080622bec31243233bd0e8655cc0c`
MD5	`8da3d17acf8adad31042083b332291b3`
BLAKE2b-256	`492b197aa79535e6ca8a5e0fc946b2e68e22d9ee01908d8369be5dea5db1e146`

See more details on using hashes here.

File details

Details for the file llm_webextract-1.1.1-py3-none-any.whl.

File metadata

Download URL: llm_webextract-1.1.1-py3-none-any.whl
Upload date: Jun 14, 2025
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for llm_webextract-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd55995e05afdcc1cb3ed400626a918c2e5196a3111d30e4c8a9f5d7e0cca8d2`
MD5	`0c3feeecfcfe704211860454f89ba651`
BLAKE2b-256	`8375c650a793bc61bf7d31b6e36a11d58b333b345f4d9ff51c38f253f00f95ea`

See more details on using hashes here.

llm-webextract 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤖 LLM WebExtract

🎯 What does this actually do?

🚀 Getting Started

Installation

Quick Examples

🛠 Configuration Options

Using Different LLM Providers

Pre-built Profiles

📊 What You Get Back

⚙️ Environment Setup

🏗 How It Works

🤔 Why I Built This

🛡 Requirements

🔧 Advanced Usage

🤝 Contributing

📄 License

🙏 Thanks

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes