A comprehensive web scraping and content summarization library with AI-powered features
Project description
Scrape and Summarize
A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using Google Gemini.
Features
- Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
- Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
- AI-Powered Summarization: Leverages Google Gemini AI for intelligent content summarization
- Parallel Processing: Concurrent scraping and summarization for improved performance
- Retry Mechanisms: Built-in retry logic for reliable operations
- Structured Output: Clean JSON output format for easy integration
- Error Handling: Comprehensive error handling and graceful degradation
Installation
From Source (Development)
- Clone or download this repository
- Navigate to the project directory
- Install the package in development mode:
pip install -e .
Install Dependencies
pip install -r requirements.txt
Install Playwright Browsers (Required)
playwright install chromium
Quick Start
import os
import json
from ScraperSage import scrape_and_summarize
# Set your API keys
os.environ["SERPER_API_KEY"] = "your_serper_api_key"
os.environ["GEMINI_API_KEY"] = "your_gemini_api_key"
# Initialize scrape_and_summarize
scraper = scrape_and_summarize()
# Define search parameters
params = {
"query": "AI in healthcare",
"max_results": 5,
"save_to_file": False
}
# Run the scraper
result = scraper.run(params)
# Print results
print(json.dumps(result, indent=2))
API Keys Setup
You need two API keys to use this library:
1. Serper API Key (for Google Search)
- Visit Serper.dev
- Sign up for a free account
- Get your API key from the dashboard
- Set as environment variable:
SERPER_API_KEY
2. Google Gemini API Key
- Visit Google AI Studio
- Create a new API key
- Set as environment variable:
GEMINI_API_KEY
Usage Examples
Basic Usage
from ScraperSage import scrape_and_summarize
import os
# Initialize with API keys
scraper = scrape_and_summarize(
serper_api_key="your_serper_key",
gemini_api_key="your_gemini_key"
)
# Basic search
params = {
"query": "machine learning trends 2024"
}
result = scraper.run(params)
Advanced Configuration
params = {
"query": "climate change solutions",
"max_results": 8, # Maximum search results per engine (default: 5)
"max_urls": 10, # Maximum URLs to scrape (default: 8)
"save_to_file": True # Save results to JSON file (default: False)
}
result = scraper.run(params)
Error Handling
try:
scraper = scrape_and_summarize()
result = scraper.run({"query": "your search query"})
if result["status"] == "success":
print(f"Found {result['successfully_scraped']} sources")
print(f"Summary: {result['overall_summary']}")
else:
print(f"Error: {result['message']}")
except ValueError as e:
print(f"API key error: {e}")
Output Format
The library returns a structured JSON object with the following format:
{
"status": "success",
"query": "your search query",
"timestamp": "2024-01-01 12:00:00",
"total_sources_found": 10,
"successfully_scraped": 8,
"sources": [
{
"url": "https://example.com",
"title": "Page Title",
"content_preview": "First 200 characters...",
"individual_summary": "AI-generated summary of this source",
"scraped": true
}
],
"failed_sources": [
{
"url": "https://failed-example.com",
"scraped": false
}
],
"overall_summary": "Comprehensive AI-generated summary of all sources",
"metadata": {
"google_results_count": 5,
"duckduckgo_results_count": 5,
"total_unique_urls": 10,
"processing_time": "Real-time processing completed"
}
}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str | Required | The search query to process |
max_results |
int | 5 | Maximum number of search results per search engine |
max_urls |
int | 8 | Maximum number of URLs to scrape |
save_to_file |
bool | False | Whether to save results to a JSON file |
Requirements
- Python 3.8+
- Internet connection
- Valid Serper API key
- Valid Google Gemini API key
Dependencies
requests- HTTP requestsduckduckgo-search- DuckDuckGo search integrationplaywright- Web scraping with browser automationgoogle-generativeai- Google Gemini AI integrationbeautifulsoup4- HTML parsingtenacity- Retry mechanisms
Error Handling
The library includes comprehensive error handling:
- API Key Validation: Checks for required API keys on initialization
- Network Retry Logic: Automatic retries for failed network requests
- Graceful Degradation: Continues processing even if some sources fail
- Timeout Management: Proper timeouts for web scraping operations
Performance Considerations
- Uses ThreadPoolExecutor for concurrent scraping
- Limits content size per URL to prevent memory issues
- Implements exponential backoff for retries
- Configurable worker limits for parallel processing
Development
Project Structure
ScraperSage/
├── __init__.py
├── scraper_sage.py
├── setup.py
├── requirements.txt
├── README.md
└── example_usage.py
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues, questions, or contributions, please visit the project repository or contact the maintainers.
Changelog
v1.0.0
- Initial release
- Multi-engine search support (Google + DuckDuckGo)
- Playwright-based web scraping
- Google Gemini AI summarization
- Structured JSON output
- Comprehensive error handling
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapersage-1.0.0.tar.gz.
File metadata
- Download URL: scrapersage-1.0.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d18e0aa5a8464a16c2a54295ecfb7b1ff2a9244e766e723dd9edcb1c249dae72
|
|
| MD5 |
63f2a70eab2d30f45b1461a93c22b4fd
|
|
| BLAKE2b-256 |
77ff6999ed52741b45fb5cfa4ace2363edc6c0cf7cfbe6631f050422d6d70ce5
|
File details
Details for the file scrapersage-1.0.0-py3-none-any.whl.
File metadata
- Download URL: scrapersage-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62ac1eacc9a77a7f66d15f863435ad6b8e78ff7c1f72e90bacb05831c8411f37
|
|
| MD5 |
94c02d8ca7d885463abe7cc9fc56bbcd
|
|
| BLAKE2b-256 |
cd56ad232228510a5fb59dcda20625a3f555f84e4980db902333ff454d814a87
|