A comprehensive web scraping and content summarization library with explicit AI provider and model selection (Gemini, OpenAI, OpenRouter, DeepSeek)
Project description
ScraperSage
A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using multiple providers: Gemini, OpenAI, OpenRouter, and DeepSeek.
⚠️ Model specification is now required - No default models to ensure explicit choice.
🚀 Features
- Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
- Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
- Multiple AI Providers: Support for Gemini, OpenAI, OpenRouter, and DeepSeek
- Explicit Model Selection: Must specify both provider and model - no defaults
- Dynamic Model Support: Use any model supported by your chosen provider
- Parallel Processing: Concurrent scraping and summarization for improved performance
- Retry Mechanisms: Built-in retry logic for reliable operations
- Structured Output: Clean JSON output format for easy integration
- Error Handling: Comprehensive error handling and graceful degradation
- Configurable Parameters: Flexible configuration for different use cases
- Real-time Processing: Live status updates during processing
🤖 Supported AI Providers & Example Models
Important: You must specify both provider and model - there are no default models.
Gemini (Google)
gemini-1.5-flash- Fast and efficientgemini-1.5-pro- Most capable modelgemini-1.0-pro- Original Gemini model- Any other Gemini models as they become available
OpenAI
gpt-4o-mini- Faster and cost-effectivegpt-4o- Latest and most capablegpt-4-turbo- High performancegpt-3.5-turbo- Cost-effective option- Any other OpenAI models as they become available
OpenRouter
openai/gpt-4o-mini- GPT-4o mini via OpenRouteranthropic/claude-3.5-sonnet- Anthropic's latestanthropic/claude-3-haiku- Fast Anthropic modelmeta-llama/llama-3.1-8b-instruct- Meta's Llama- Any other models available on OpenRouter
DeepSeek
deepseek-chat- General purposedeepseek-coder- Optimized for code- Any other DeepSeek models as they become available
📦 Installation
From PyPI (Recommended)
pip install ScraperSage
Install Playwright Browsers (Required)
playwright install chromium
🔑 API Keys Setup
You need API keys for:
- Serper API (for Google Search) - Get it here
- Your chosen AI provider:
- Gemini: Google AI Studio
- OpenAI: OpenAI Platform
- OpenRouter: OpenRouter
- DeepSeek: DeepSeek Platform
Set Environment Variables
# Required for search
export SERPER_API_KEY="your_serper_api_key"
# Choose your AI provider (set one)
export GEMINI_API_KEY="your_gemini_key"
export OPENAI_API_KEY="your_openai_key"
export OPENROUTER_API_KEY="your_openrouter_key"
export DEEPSEEK_API_KEY="your_deepseek_key"
📚 Usage Guide
Basic Usage - Provider and Model Required
from ScraperSage import scrape_and_summarize
# ✅ CORRECT: Specify both provider and model
scraper = scrape_and_summarize(provider="gemini", model="gemini-1.5-flash")
result = scraper.run({"query": "AI trends 2024"})
# ✅ CORRECT: Using OpenAI
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
result = scraper.run({"query": "AI trends 2024"})
# ❌ INCORRECT: This will raise an error
# scraper = scrape_and_summarize() # Missing provider and model
# scraper = scrape_and_summarize(provider="openai") # Missing model
Get Available Models
from ScraperSage import get_available_models, get_supported_providers
# See all supported providers
providers = get_supported_providers()
print(f"Providers: {providers}")
# Get example models for each provider
for provider in providers:
models = get_available_models(provider)
print(f"\n{provider.upper()} example models:")
for model_id, description in models.items():
print(f" - {model_id}: {description}")
Advanced Configuration
# All parameters with explicit model
params = {
"query": "machine learning in healthcare",
"max_results": 8,
"max_urls": 12,
"save_to_file": True
}
# Try different providers/models
providers_to_try = [
{"provider": "gemini", "model": "gemini-1.5-pro"},
{"provider": "openai", "model": "gpt-4o"},
{"provider": "openrouter", "model": "anthropic/claude-3.5-sonnet"},
{"provider": "deepseek", "model": "deepseek-chat"}
]
for config in providers_to_try:
try:
scraper = scrape_and_summarize(**config)
result = scraper.run(params)
if result["status"] == "success":
print(f"✅ {config['provider']} with {config['model']} worked!")
break
except Exception as e:
print(f"❌ {config['provider']}/{config['model']} failed: {e}")
continue
⚙️ Configuration Parameters
Constructor Parameters (All Required)
| Parameter | Type | Required | Description |
|---|---|---|---|
provider |
str | ✅ YES | AI provider: gemini, openai, openrouter, deepseek |
model |
str | ✅ YES | Specific model name supported by the provider |
serper_api_key |
str | Optional | Serper API key (uses env var if not provided) |
provider_api_key |
str | Optional | AI provider API key (uses env var if not provided) |
Run Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str | Required | The search query to process |
max_results |
int | 5 | Maximum search results per engine (1-20) |
max_urls |
int | 8 | Maximum URLs to scrape (1-50) |
save_to_file |
bool | False | Save results to timestamped JSON file |
🚨 Error Handling
Common Errors and Solutions
from ScraperSage import scrape_and_summarize, get_available_models
# ❌ Missing provider
try:
scraper = scrape_and_summarize(model="gpt-4o")
except ValueError as e:
print(f"Error: {e}")
# Shows: Provider is required. Please specify one of: ['gemini', 'openai', 'openrouter', 'deepseek']
# ❌ Missing model
try:
scraper = scrape_and_summarize(provider="openai")
except ValueError as e:
print(f"Error: {e}")
# Shows: Model is required for openai. Example models: ['gpt-4o-mini', 'gpt-4o', 'gpt-4-turbo']
# ✅ Get help with model selection
models = get_available_models("openai")
print(f"Available OpenAI models: {list(models.keys())}")
# ✅ Correct usage
scraper = scrape_and_summarize(provider="openai", model="gpt-4o-mini")
Safe Model Selection Helper
def safe_create_scraper(provider, model_preferences):
"""Try models in order of preference."""
for model in model_preferences:
try:
scraper = scrape_and_summarize(provider=provider, model=model)
print(f"✅ Successfully initialized {provider} with {model}")
return scraper
except Exception as e:
print(f"❌ {provider}/{model} failed: {e}")
continue
raise ValueError(f"No working models found for {provider}")
# Usage
openai_preferences = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
scraper = safe_create_scraper("openai", openai_preferences)
💡 Model Selection Strategy
Recommended Approach
from ScraperSage import scrape_and_summarize, get_available_models
def create_scraper_with_fallback(provider_preferences):
"""Create scraper with provider/model fallbacks."""
for provider_config in provider_preferences:
provider = provider_config["provider"]
models = provider_config["models"]
print(f"🔍 Trying {provider}...")
for model in models:
try:
scraper = scrape_and_summarize(provider=provider, model=model)
print(f"✅ Success: {provider}/{model}")
return scraper
except Exception as e:
print(f"❌ Failed: {provider}/{model} - {str(e)[:50]}...")
continue
raise ValueError("No working provider/model combinations found")
# Define your preferences
preferences = [
{
"provider": "openai",
"models": ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
},
{
"provider": "gemini",
"models": ["gemini-1.5-pro", "gemini-1.5-flash"]
},
{
"provider": "openrouter",
"models": ["anthropic/claude-3.5-sonnet", "openai/gpt-4o-mini"]
}
]
scraper = create_scraper_with_fallback(preferences)
result = scraper.run({"query": "your search query"})
📊 Benefits of Explicit Model Selection
✅ Advantages
- No surprises: You always know which model is being used
- Cost control: Explicitly choose cost-effective models
- Performance predictability: Know exactly what capabilities you're getting
- Future-proof: New models don't change existing behavior
- Debugging: Easier to identify model-specific issues
- Transparency: Clear model usage in logs and results
📈 Best Practices
- Always specify both provider and model
- Use get_available_models() to see examples
- Implement fallback strategies for reliability
- Test models with small queries first
- Monitor costs when using premium models
- Keep model preferences in configuration files
🔄 Changelog
v1.2.0 (Latest)
- ✅ BREAKING CHANGE: Removed default models - provider and model are now required
- ✅ ENHANCED: Explicit error messages when provider/model missing
- ✅ IMPROVED: Better model validation and error handling
- ✅ ADDED: Helper functions for model selection
- ✅ UPDATED: Documentation with explicit usage examples
v1.1.0
- ✅ Multiple AI provider support (Gemini, OpenAI, OpenRouter, DeepSeek)
- ✅ Dynamic model support
- ✅ Provider comparison capabilities
🤝 Contributing
Areas where you can help:
- 🔧 Add support for more AI providers
- 🎯 Improve model validation and discovery
- 📊 Add model performance benchmarking
- 🧪 Expand test coverage for various models
- 📚 Add more model selection examples
📄 License
MIT License - see the LICENSE file for details.
Made with ❤️ by AkilLabs
Now requires explicit provider and model selection for better control!
📦 Available on PyPI: https://pypi.org/project/ScraperSage/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapersage-1.2.0.tar.gz.
File metadata
- Download URL: scrapersage-1.2.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9575e66b77ff280a7b62bc6f67be4d55ed4ef72497cb031e313656c705998596
|
|
| MD5 |
a32b3adea980d4fdb0bb3654416588be
|
|
| BLAKE2b-256 |
5ebbe383ba838346d7a143b9522e0705148294937429fe87c152e301c0baee42
|
File details
Details for the file scrapersage-1.2.0-py3-none-any.whl.
File metadata
- Download URL: scrapersage-1.2.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc0fbad53e2dc9e825022b1c3f75bd528a63c59440b719354efd1f4a2ed54fd0
|
|
| MD5 |
080df1aba09e27b2e99d04d234197890
|
|
| BLAKE2b-256 |
02122d83da7c7423d90b139ddb358734850cc4c2665ebbd7d346ae56a12dfa57
|