Skip to main content

A comprehensive web scraping and content summarization library with explicit AI provider and model selection (Gemini, OpenAI, OpenRouter, DeepSeek)

Project description

ScraperSage

A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using multiple providers: Gemini, OpenAI, OpenRouter, and DeepSeek.

⚠️ Model specification is now required - No default models to ensure explicit choice.

Python Version License PyPI Version

🚀 Features

  • Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
  • Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
  • Multiple AI Providers: Support for Gemini, OpenAI, OpenRouter, and DeepSeek
  • Explicit Model Selection: Must specify both provider and model - no defaults
  • Dynamic Model Support: Use any model supported by your chosen provider
  • Parallel Processing: Concurrent scraping and summarization for improved performance
  • Retry Mechanisms: Built-in retry logic for reliable operations
  • Structured Output: Clean JSON output format for easy integration
  • Error Handling: Comprehensive error handling and graceful degradation
  • Configurable Parameters: Flexible configuration for different use cases
  • Real-time Processing: Live status updates during processing

🤖 Supported AI Providers & Example Models

Important: You must specify both provider and model - there are no default models.

Gemini (Google)

  • gemini-1.5-flash - Fast and efficient
  • gemini-1.5-pro - Most capable model
  • gemini-1.0-pro - Original Gemini model
  • Any other Gemini models as they become available

OpenAI

  • gpt-4o-mini - Faster and cost-effective
  • gpt-4o - Latest and most capable
  • gpt-4-turbo - High performance
  • gpt-3.5-turbo - Cost-effective option
  • Any other OpenAI models as they become available

OpenRouter

  • openai/gpt-4o-mini - GPT-4o mini via OpenRouter
  • anthropic/claude-3.5-sonnet - Anthropic's latest
  • anthropic/claude-3-haiku - Fast Anthropic model
  • meta-llama/llama-3.1-8b-instruct - Meta's Llama
  • Any other models available on OpenRouter

DeepSeek

  • deepseek-chat - General purpose
  • deepseek-coder - Optimized for code
  • Any other DeepSeek models as they become available

📦 Installation

From PyPI (Recommended)

pip install ScraperSage

Install Playwright Browsers (Required)

playwright install chromium

🔑 API Keys Setup

You need API keys for:

  1. Serper API (for Google Search) - Get it here
  2. Your chosen AI provider:

Set Environment Variables

# Required for search
export SERPER_API_KEY="your_serper_api_key"

# Choose your AI provider (set one)
export GEMINI_API_KEY="your_gemini_key"
export OPENAI_API_KEY="your_openai_key" 
export OPENROUTER_API_KEY="your_openrouter_key"
export DEEPSEEK_API_KEY="your_deepseek_key"

📚 Usage Guide

Basic Usage - Provider and Model Required

from ScraperSage import scrape_and_summarize

# ✅ CORRECT: Specify both provider and model
scraper = scrape_and_summarize(provider="gemini", model="gemini-1.5-flash")
result = scraper.run({"query": "AI trends 2024"})

# ✅ CORRECT: Using OpenAI
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
result = scraper.run({"query": "AI trends 2024"})

# ❌ INCORRECT: This will raise an error
# scraper = scrape_and_summarize()  # Missing provider and model
# scraper = scrape_and_summarize(provider="openai")  # Missing model

Get Available Models

from ScraperSage import get_available_models, get_supported_providers

# See all supported providers
providers = get_supported_providers()
print(f"Providers: {providers}")

# Get example models for each provider
for provider in providers:
    models = get_available_models(provider)
    print(f"\n{provider.upper()} example models:")
    for model_id, description in models.items():
        print(f"  - {model_id}: {description}")

Advanced Configuration

# All parameters with explicit model
params = {
    "query": "machine learning in healthcare",
    "max_results": 8,
    "max_urls": 12,
    "save_to_file": True
}

# Try different providers/models
providers_to_try = [
    {"provider": "gemini", "model": "gemini-1.5-pro"},
    {"provider": "openai", "model": "gpt-4o"},
    {"provider": "openrouter", "model": "anthropic/claude-3.5-sonnet"},
    {"provider": "deepseek", "model": "deepseek-chat"}
]

for config in providers_to_try:
    try:
        scraper = scrape_and_summarize(**config)
        result = scraper.run(params)
        if result["status"] == "success":
            print(f"✅ {config['provider']} with {config['model']} worked!")
            break
    except Exception as e:
        print(f"❌ {config['provider']}/{config['model']} failed: {e}")
        continue

⚙️ Configuration Parameters

Constructor Parameters (All Required)

Parameter Type Required Description
provider str YES AI provider: gemini, openai, openrouter, deepseek
model str YES Specific model name supported by the provider
serper_api_key str Optional Serper API key (uses env var if not provided)
provider_api_key str Optional AI provider API key (uses env var if not provided)

Run Parameters

Parameter Type Default Description
query str Required The search query to process
max_results int 5 Maximum search results per engine (1-20)
max_urls int 8 Maximum URLs to scrape (1-50)
save_to_file bool False Save results to timestamped JSON file

🚨 Error Handling

Common Errors and Solutions

from ScraperSage import scrape_and_summarize, get_available_models

# ❌ Missing provider
try:
    scraper = scrape_and_summarize(model="gpt-4o")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Provider is required. Please specify one of: ['gemini', 'openai', 'openrouter', 'deepseek']

# ❌ Missing model
try:
    scraper = scrape_and_summarize(provider="openai")
except ValueError as e:
    print(f"Error: {e}")
    # Shows: Model is required for openai. Example models: ['gpt-4o-mini', 'gpt-4o', 'gpt-4-turbo']

# ✅ Get help with model selection
models = get_available_models("openai")
print(f"Available OpenAI models: {list(models.keys())}")

# ✅ Correct usage
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")

Safe Model Selection Helper

def safe_create_scraper(provider, model_preferences):
    """Try models in order of preference."""
    for model in model_preferences:
        try:
            scraper = scrape_and_summarize(provider=provider, model=model)
            print(f"✅ Successfully initialized {provider} with {model}")
            return scraper
        except Exception as e:
            print(f"❌ {provider}/{model} failed: {e}")
            continue
    
    raise ValueError(f"No working models found for {provider}")

# Usage
openai_preferences = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
scraper = safe_create_scraper("openai", openai_preferences)

💡 Model Selection Strategy

Recommended Approach

from ScraperSage import scrape_and_summarize, get_available_models

def create_scraper_with_fallback(provider_preferences):
    """Create scraper with provider/model fallbacks."""
    
    for provider_config in provider_preferences:
        provider = provider_config["provider"]
        models = provider_config["models"]
        
        print(f"🔍 Trying {provider}...")
        for model in models:
            try:
                scraper = scrape_and_summarize(provider=provider, model=model)
                print(f"✅ Success: {provider}/{model}")
                return scraper
            except Exception as e:
                print(f"❌ Failed: {provider}/{model} - {str(e)[:50]}...")
                continue
    
    raise ValueError("No working provider/model combinations found")

# Define your preferences
preferences = [
    {
        "provider": "openai",
        "models": ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
    },
    {
        "provider": "gemini", 
        "models": ["gemini-1.5-pro", "gemini-1.5-flash"]
    },
    {
        "provider": "openrouter",
        "models": ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini"]
    }
]

scraper = create_scraper_with_fallback(preferences)
result = scraper.run({"query": "your search query"})

📊 Benefits of Explicit Model Selection

✅ Advantages

  • No surprises: You always know which model is being used
  • Cost control: Explicitly choose cost-effective models
  • Performance predictability: Know exactly what capabilities you're getting
  • Future-proof: New models don't change existing behavior
  • Debugging: Easier to identify model-specific issues
  • Transparency: Clear model usage in logs and results

📈 Best Practices

  1. Always specify both provider and model
  2. Use get_available_models() to see examples
  3. Implement fallback strategies for reliability
  4. Test models with small queries first
  5. Monitor costs when using premium models
  6. Keep model preferences in configuration files

🔄 Changelog

v1.2.0 (Latest)

  • BREAKING CHANGE: Removed default models - provider and model are now required
  • ENHANCED: Explicit error messages when provider/model missing
  • IMPROVED: Better model validation and error handling
  • ADDED: Helper functions for model selection
  • UPDATED: Documentation with explicit usage examples

v1.1.0

  • ✅ Multiple AI provider support (Gemini, OpenAI, OpenRouter, DeepSeek)
  • ✅ Dynamic model support
  • ✅ Provider comparison capabilities

🤝 Contributing

Areas where you can help:

  • 🔧 Add support for more AI providers
  • 🎯 Improve model validation and discovery
  • 📊 Add model performance benchmarking
  • 🧪 Expand test coverage for various models
  • 📚 Add more model selection examples

📄 License

MIT License - see the LICENSE file for details.


Made with ❤️ by AkilLabs

Now requires explicit provider and model selection for better control!

📦 Available on PyPI: https://pypi.org/project/ScraperSage/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapersage-1.2.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapersage-1.2.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapersage-1.2.0.tar.gz.

File metadata

  • Download URL: scrapersage-1.2.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.2.0.tar.gz
Algorithm Hash digest
SHA256 9575e66b77ff280a7b62bc6f67be4d55ed4ef72497cb031e313656c705998596
MD5 a32b3adea980d4fdb0bb3654416588be
BLAKE2b-256 5ebbe383ba838346d7a143b9522e0705148294937429fe87c152e301c0baee42

See more details on using hashes here.

File details

Details for the file scrapersage-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: scrapersage-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fc0fbad53e2dc9e825022b1c3f75bd528a63c59440b719354efd1f4a2ed54fd0
MD5 080df1aba09e27b2e99d04d234197890
BLAKE2b-256 02122d83da7c7423d90b139ddb358734850cc4c2665ebbd7d346ae56a12dfa57

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page