A comprehensive web scraping and content summarization library with explicit AI provider and model selection (Gemini, OpenAI, OpenRouter, DeepSeek)

These details have not been verified by PyPI

Project links

Project description

ScraperSage

A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using multiple providers: Gemini, OpenAI, OpenRouter, and DeepSeek.

⚠️ Model specification is now required - No default models to ensure explicit choice.

🚀 Features

Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
Multiple AI Providers: Support for Gemini, OpenAI, OpenRouter, and DeepSeek
Explicit Model Selection: Must specify both provider and model - no defaults
Dynamic Model Support: Use any model supported by your chosen provider
Parallel Processing: Concurrent scraping and summarization for improved performance
Retry Mechanisms: Built-in retry logic for reliable operations
Structured Output: Clean JSON output format for easy integration
Error Handling: Comprehensive error handling and graceful degradation
Configurable Parameters: Flexible configuration for different use cases
Real-time Processing: Live status updates during processing

🤖 Supported AI Providers & Example Models

Important: You must specify both provider and model - there are no default models.

Gemini (Google)

gemini-1.5-flash - Fast and efficient
gemini-1.5-pro - Most capable model
gemini-1.0-pro - Original Gemini model
Any other Gemini models as they become available

OpenAI

gpt-4o-mini - Faster and cost-effective
gpt-4o - Latest and most capable
gpt-4-turbo - High performance
gpt-3.5-turbo - Cost-effective option
Any other OpenAI models as they become available

OpenRouter

openai/gpt-4o-mini - GPT-4o mini via OpenRouter
anthropic/claude-3.5-sonnet - Anthropic's latest
anthropic/claude-3-haiku - Fast Anthropic model
meta-llama/llama-3.1-8b-instruct - Meta's Llama
Any other models available on OpenRouter

DeepSeek

deepseek-chat - General purpose
deepseek-coder - Optimized for code
Any other DeepSeek models as they become available

📦 Installation

From PyPI (Recommended)

pip install ScraperSage

Install Playwright Browsers (Required)

playwright install chromium

🔑 API Keys Setup

You need API keys for:

Serper API (for Google Search) - Get it here
Your chosen AI provider:
- Gemini: Google AI Studio
- OpenAI: OpenAI Platform
- OpenRouter: OpenRouter
- DeepSeek: DeepSeek Platform

Set Environment Variables

# Required for search
export SERPER_API_KEY="your_serper_api_key"

# Choose your AI provider (set one)
export GEMINI_API_KEY="your_gemini_key"
export OPENAI_API_KEY="your_openai_key" 
export OPENROUTER_API_KEY="your_openrouter_key"
export DEEPSEEK_API_KEY="your_deepseek_key"

📚 Usage Guide

Basic Usage - Provider and Model Required

from ScraperSage import scrape_and_summarize

# ✅ CORRECT: Specify both provider and model
scraper = scrape_and_summarize(provider="gemini", model="gemini-1.5-flash")
result = scraper.run({"query": "AI trends 2024"})

# ✅ CORRECT: Using OpenAI
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
result = scraper.run({"query": "AI trends 2024"})

# ❌ INCORRECT: This will raise an error
# scraper = scrape_and_summarize()  # Missing provider and model
# scraper = scrape_and_summarize(provider="openai")  # Missing model

Get Available Models

from ScraperSage import get_available_models, get_supported_providers

# See all supported providers
providers = get_supported_providers()
print(f"Providers: {providers}")

# Get example models for each provider
for provider in providers:
    models = get_available_models(provider)
    print(f"\n{provider.upper()} example models:")
    for model_id, description in models.items():
        print(f"  - {model_id}: {description}")

Advanced Configuration

# All parameters with explicit model
params = {
    "query": "machine learning in healthcare",
    "max_results": 8,
    "max_urls": 12,
    "save_to_file": True
}

# Try different providers/models
providers_to_try = [
    {"provider": "gemini", "model": "gemini-1.5-pro"},
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "openrouter", "model": "anthropic/claude-3.5-sonnet"},
    {"provider": "deepseek", "model": "deepseek-chat"}
]

for config in providers_to_try:
    try:
        scraper = scrape_and_summarize(**config)
        result = scraper.run(params)
        if result["status"] == "success":
            print(f"✅ {config['provider']} with {config['model']} worked!")
            break
    except Exception as e:
        print(f"❌ {config['provider']}/{config['model']} failed: {e}")
        continue

⚙️ Configuration Parameters

Constructor Parameters (All Required)

Parameter	Type	Required	Description
`provider`	str	✅ YES	AI provider: gemini, openai, openrouter, deepseek
`model`	str	✅ YES	Specific model name supported by the provider
`serper_api_key`	str	Optional	Serper API key (uses env var if not provided)
`provider_api_key`	str	Optional	AI provider API key (uses env var if not provided)

Run Parameters

Parameter	Type	Default	Description
`query`	str	Required	The search query to process
`max_results`	int	5	Maximum search results per engine (1-20)
`max_urls`	int	8	Maximum URLs to scrape (1-50)
`save_to_file`	bool	False	Save results to timestamped JSON file

🚨 Error Handling

Common Errors and Solutions

from ScraperSage import scrape_and_summarize, get_available_models

# ❌ Missing provider
try:
    scraper = scrape_and_summarize(model="gpt-4o")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Provider is required. Please specify one of: ['gemini', 'openai', 'openrouter', 'deepseek']

# ❌ Missing model
try:
    scraper = scrape_and_summarize(provider="openai")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Model is required for openai. Example models: ['gpt-4o-mini', 'gpt-4o', 'gpt-4-turbo']

# ✅ Get help with model selection
models = get_available_models("openai")
print(f"Available OpenAI models: {list(models.keys())}")

# ✅ Correct usage
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")

Safe Model Selection Helper

def safe_create_scraper(provider, model_preferences):
    """Try models in order of preference."""
    for model in model_preferences:
        try:
            scraper = scrape_and_summarize(provider=provider, model=model)
            print(f"✅ Successfully initialized {provider} with {model}")
            return scraper
        except Exception as e:
            print(f"❌ {provider}/{model} failed: {e}")
            continue
    
    raise ValueError(f"No working models found for {provider}")

# Usage
openai_preferences = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
scraper = safe_create_scraper("openai", openai_preferences)

💡 Model Selection Strategy

Recommended Approach

from ScraperSage import scrape_and_summarize, get_available_models

def create_scraper_with_fallback(provider_preferences):
    """Create scraper with provider/model fallbacks."""
    
    for provider_config in provider_preferences:
        provider = provider_config["provider"]
        models = provider_config["models"]
        
        print(f"🔍 Trying {provider}...")
        for model in models:
            try:
                scraper = scrape_and_summarize(provider=provider, model=model)
                print(f"✅ Success: {provider}/{model}")
                return scraper
            except Exception as e:
                print(f"❌ Failed: {provider}/{model} - {str(e)[:50]}...")
                continue
    
    raise ValueError("No working provider/model combinations found")

# Define your preferences
preferences = [
    {
        "provider": "openai",
        "models": ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
    },
    {
        "provider": "gemini", 
        "models": ["gemini-1.5-pro", "gemini-1.5-flash"]
    },
    {
        "provider": "openrouter",
        "models": ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini"]
    }
]

scraper = create_scraper_with_fallback(preferences)
result = scraper.run({"query": "your search query"})

📊 Benefits of Explicit Model Selection

✅ Advantages

No surprises: You always know which model is being used
Cost control: Explicitly choose cost-effective models
Performance predictability: Know exactly what capabilities you're getting
Future-proof: New models don't change existing behavior
Debugging: Easier to identify model-specific issues
Transparency: Clear model usage in logs and results

📈 Best Practices

Always specify both provider and model
Use get_available_models() to see examples
Implement fallback strategies for reliability
Test models with small queries first
Monitor costs when using premium models
Keep model preferences in configuration files

🔄 Changelog

v1.2.0 (Latest)

✅ BREAKING CHANGE: Removed default models - provider and model are now required
✅ ENHANCED: Explicit error messages when provider/model missing
✅ IMPROVED: Better model validation and error handling
✅ ADDED: Helper functions for model selection
✅ UPDATED: Documentation with explicit usage examples

v1.1.0

✅ Multiple AI provider support (Gemini, OpenAI, OpenRouter, DeepSeek)
✅ Dynamic model support
✅ Provider comparison capabilities

🤝 Contributing

Areas where you can help:

🔧 Add support for more AI providers
🎯 Improve model validation and discovery
📊 Add model performance benchmarking
🧪 Expand test coverage for various models
📚 Add more model selection examples

📄 License

MIT License - see the LICENSE file for details.

Made with ❤️ by AkilLabs

Now requires explicit provider and model selection for better control!

📦 Available on PyPI: https://pypi.org/project/ScraperSage/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.2

Sep 26, 2025

1.2.1

Sep 26, 2025

This version

1.2.0

Sep 26, 2025

1.0.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapersage-1.2.0.tar.gz (17.3 kB view details)

Uploaded Sep 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapersage-1.2.0-py3-none-any.whl (13.9 kB view details)

Uploaded Sep 26, 2025 Python 3

File details

Details for the file scrapersage-1.2.0.tar.gz.

File metadata

Download URL: scrapersage-1.2.0.tar.gz
Upload date: Sep 26, 2025
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9575e66b77ff280a7b62bc6f67be4d55ed4ef72497cb031e313656c705998596`
MD5	`a32b3adea980d4fdb0bb3654416588be`
BLAKE2b-256	`5ebbe383ba838346d7a143b9522e0705148294937429fe87c152e301c0baee42`

See more details on using hashes here.

File details

Details for the file scrapersage-1.2.0-py3-none-any.whl.

File metadata

Download URL: scrapersage-1.2.0-py3-none-any.whl
Upload date: Sep 26, 2025
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fc0fbad53e2dc9e825022b1c3f75bd528a63c59440b719354efd1f4a2ed54fd0`
MD5	`080df1aba09e27b2e99d04d234197890`
BLAKE2b-256	`02122d83da7c7423d90b139ddb358734850cc4c2665ebbd7d346ae56a12dfa57`

See more details on using hashes here.

ScraperSage 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ScraperSage

🚀 Features

🤖 Supported AI Providers & Example Models

Gemini (Google)

OpenAI

OpenRouter

DeepSeek

📦 Installation

From PyPI (Recommended)

Install Playwright Browsers (Required)

🔑 API Keys Setup

Set Environment Variables

📚 Usage Guide

Basic Usage - Provider and Model Required

Get Available Models

Advanced Configuration

⚙️ Configuration Parameters

Constructor Parameters (All Required)

Run Parameters

🚨 Error Handling

Common Errors and Solutions

Safe Model Selection Helper

💡 Model Selection Strategy

Recommended Approach

📊 Benefits of Explicit Model Selection

✅ Advantages

📈 Best Practices

🔄 Changelog

v1.2.0 (Latest)

v1.1.0

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes