Skip to main content

A comprehensive web scraping and content summarization library with AI-powered features

Project description

Scrape and Summarize

A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using Google Gemini.

Features

  • Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
  • Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
  • AI-Powered Summarization: Leverages Google Gemini AI for intelligent content summarization
  • Parallel Processing: Concurrent scraping and summarization for improved performance
  • Retry Mechanisms: Built-in retry logic for reliable operations
  • Structured Output: Clean JSON output format for easy integration
  • Error Handling: Comprehensive error handling and graceful degradation

Installation

From Source (Development)

  1. Clone or download this repository
  2. Navigate to the project directory
  3. Install the package in development mode:
pip install -e .

Install Dependencies

pip install -r requirements.txt

Install Playwright Browsers (Required)

playwright install chromium

Quick Start

import os
import json
from ScraperSage import scrape_and_summarize

# Set your API keys
os.environ["SERPER_API_KEY"] = "your_serper_api_key"
os.environ["GEMINI_API_KEY"] = "your_gemini_api_key"

# Initialize scrape_and_summarize
scraper = scrape_and_summarize()

# Define search parameters
params = {
    "query": "AI in healthcare",
    "max_results": 5,
    "save_to_file": False
}

# Run the scraper
result = scraper.run(params)

# Print results
print(json.dumps(result, indent=2))

API Keys Setup

You need two API keys to use this library:

1. Serper API Key (for Google Search)

  • Visit Serper.dev
  • Sign up for a free account
  • Get your API key from the dashboard
  • Set as environment variable: SERPER_API_KEY

2. Google Gemini API Key

  • Visit Google AI Studio
  • Create a new API key
  • Set as environment variable: GEMINI_API_KEY

Usage Examples

Basic Usage

from ScraperSage import scrape_and_summarize
import os

# Initialize with API keys
scraper = scrape_and_summarize(
    serper_api_key="your_serper_key",
    gemini_api_key="your_gemini_key"
)

# Basic search
params = {
    "query": "machine learning trends 2024"
}

result = scraper.run(params)

Advanced Configuration

params = {
    "query": "climate change solutions",
    "max_results": 8,        # Maximum search results per engine (default: 5)
    "max_urls": 10,          # Maximum URLs to scrape (default: 8)
    "save_to_file": True     # Save results to JSON file (default: False)
}

result = scraper.run(params)

Error Handling

try:
    scraper = scrape_and_summarize()
    result = scraper.run({"query": "your search query"})
    
    if result["status"] == "success":
        print(f"Found {result['successfully_scraped']} sources")
        print(f"Summary: {result['overall_summary']}")
    else:
        print(f"Error: {result['message']}")
        
except ValueError as e:
    print(f"API key error: {e}")

Output Format

The library returns a structured JSON object with the following format:

{
  "status": "success",
  "query": "your search query",
  "timestamp": "2024-01-01 12:00:00",
  "total_sources_found": 10,
  "successfully_scraped": 8,
  "sources": [
    {
      "url": "https://example.com",
      "title": "Page Title",
      "content_preview": "First 200 characters...",
      "individual_summary": "AI-generated summary of this source",
      "scraped": true
    }
  ],
  "failed_sources": [
    {
      "url": "https://failed-example.com",
      "scraped": false
    }
  ],
  "overall_summary": "Comprehensive AI-generated summary of all sources",
  "metadata": {
    "google_results_count": 5,
    "duckduckgo_results_count": 5,
    "total_unique_urls": 10,
    "processing_time": "Real-time processing completed"
  }
}

Parameters

Parameter Type Default Description
query str Required The search query to process
max_results int 5 Maximum number of search results per search engine
max_urls int 8 Maximum number of URLs to scrape
save_to_file bool False Whether to save results to a JSON file

Requirements

  • Python 3.8+
  • Internet connection
  • Valid Serper API key
  • Valid Google Gemini API key

Dependencies

  • requests - HTTP requests
  • duckduckgo-search - DuckDuckGo search integration
  • playwright - Web scraping with browser automation
  • google-generativeai - Google Gemini AI integration
  • beautifulsoup4 - HTML parsing
  • tenacity - Retry mechanisms

Error Handling

The library includes comprehensive error handling:

  • API Key Validation: Checks for required API keys on initialization
  • Network Retry Logic: Automatic retries for failed network requests
  • Graceful Degradation: Continues processing even if some sources fail
  • Timeout Management: Proper timeouts for web scraping operations

Performance Considerations

  • Uses ThreadPoolExecutor for concurrent scraping
  • Limits content size per URL to prevent memory issues
  • Implements exponential backoff for retries
  • Configurable worker limits for parallel processing

Development

Project Structure

ScraperSage/
├── __init__.py
├── scraper_sage.py
├── setup.py
├── requirements.txt
├── README.md
└── example_usage.py

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues, questions, or contributions, please visit the project repository or contact the maintainers.

Changelog

v1.0.0

  • Initial release
  • Multi-engine search support (Google + DuckDuckGo)
  • Playwright-based web scraping
  • Google Gemini AI summarization
  • Structured JSON output
  • Comprehensive error handling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapersage-1.0.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapersage-1.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapersage-1.0.0.tar.gz.

File metadata

  • Download URL: scrapersage-1.0.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d18e0aa5a8464a16c2a54295ecfb7b1ff2a9244e766e723dd9edcb1c249dae72
MD5 63f2a70eab2d30f45b1461a93c22b4fd
BLAKE2b-256 77ff6999ed52741b45fb5cfa4ace2363edc6c0cf7cfbe6631f050422d6d70ce5

See more details on using hashes here.

File details

Details for the file scrapersage-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: scrapersage-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 62ac1eacc9a77a7f66d15f863435ad6b8e78ff7c1f72e90bacb05831c8411f37
MD5 94c02d8ca7d885463abe7cc9fc56bbcd
BLAKE2b-256 cd56ad232228510a5fb59dcda20625a3f555f84e4980db902333ff454d814a87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page