A comprehensive web scraping and content summarization library with AI-powered features

These details have not been verified by PyPI

Project links

Project description

Scrape and Summarize

A comprehensive web scraping and content summarization library that combines Google/DuckDuckGo search with web scraping and AI-powered summarization using Google Gemini.

Features

Multi-Engine Search: Combines Google (via Serper API) and DuckDuckGo search results
Advanced Web Scraping: Uses Playwright for robust, JavaScript-enabled web scraping
AI-Powered Summarization: Leverages Google Gemini AI for intelligent content summarization
Parallel Processing: Concurrent scraping and summarization for improved performance
Retry Mechanisms: Built-in retry logic for reliable operations
Structured Output: Clean JSON output format for easy integration
Error Handling: Comprehensive error handling and graceful degradation

Installation

From Source (Development)

Clone or download this repository
Navigate to the project directory
Install the package in development mode:

pip install -e .

Install Dependencies

pip install -r requirements.txt

Install Playwright Browsers (Required)

playwright install chromium

Quick Start

import os
import json
from ScraperSage import scrape_and_summarize

# Set your API keys
os.environ["SERPER_API_KEY"] = "your_serper_api_key"
os.environ["GEMINI_API_KEY"] = "your_gemini_api_key"

# Initialize scrape_and_summarize
scraper = scrape_and_summarize()

# Define search parameters
params = {
    "query": "AI in healthcare",
    "max_results": 5,
    "save_to_file": False
}

# Run the scraper
result = scraper.run(params)

# Print results
print(json.dumps(result, indent=2))

API Keys Setup

You need two API keys to use this library:

1. Serper API Key (for Google Search)

Visit Serper.dev
Sign up for a free account
Get your API key from the dashboard
Set as environment variable: SERPER_API_KEY

2. Google Gemini API Key

Visit Google AI Studio
Create a new API key
Set as environment variable: GEMINI_API_KEY

Usage Examples

Basic Usage

from ScraperSage import scrape_and_summarize
import os

# Initialize with API keys
scraper = scrape_and_summarize(
    serper_api_key="your_serper_key",
    gemini_api_key="your_gemini_key"
)

# Basic search
params = {
    "query": "machine learning trends 2024"
}

result = scraper.run(params)

Advanced Configuration

params = {
    "query": "climate change solutions",
    "max_results": 8,        # Maximum search results per engine (default: 5)
    "max_urls": 10,          # Maximum URLs to scrape (default: 8)
    "save_to_file": True     # Save results to JSON file (default: False)
}

result = scraper.run(params)

Error Handling

try:
    scraper = scrape_and_summarize()
    result = scraper.run({"query": "your search query"})
    
    if result["status"] == "success":
        print(f"Found {result['successfully_scraped']} sources")
        print(f"Summary: {result['overall_summary']}")
    else:
        print(f"Error: {result['message']}")
        
except ValueError as e:
    print(f"API key error: {e}")

Output Format

The library returns a structured JSON object with the following format:

{
  "status": "success",
  "query": "your search query",
  "timestamp": "2024-01-01 12:00:00",
  "total_sources_found": 10,
  "successfully_scraped": 8,
  "sources": [
    {
      "url": "https://example.com",
      "title": "Page Title",
      "content_preview": "First 200 characters...",
      "individual_summary": "AI-generated summary of this source",
      "scraped": true
    }
  ],
  "failed_sources": [
    {
      "url": "https://failed-example.com",
      "scraped": false
    }
  ],
  "overall_summary": "Comprehensive AI-generated summary of all sources",
  "metadata": {
    "google_results_count": 5,
    "duckduckgo_results_count": 5,
    "total_unique_urls": 10,
    "processing_time": "Real-time processing completed"
  }
}

Parameters

Parameter	Type	Default	Description
`query`	str	Required	The search query to process
`max_results`	int	5	Maximum number of search results per search engine
`max_urls`	int	8	Maximum number of URLs to scrape
`save_to_file`	bool	False	Whether to save results to a JSON file

Requirements

Python 3.8+
Internet connection
Valid Serper API key
Valid Google Gemini API key

Dependencies

requests - HTTP requests
duckduckgo-search - DuckDuckGo search integration
playwright - Web scraping with browser automation
google-generativeai - Google Gemini AI integration
beautifulsoup4 - HTML parsing
tenacity - Retry mechanisms

Error Handling

The library includes comprehensive error handling:

API Key Validation: Checks for required API keys on initialization
Network Retry Logic: Automatic retries for failed network requests
Graceful Degradation: Continues processing even if some sources fail
Timeout Management: Proper timeouts for web scraping operations

Performance Considerations

Uses ThreadPoolExecutor for concurrent scraping
Limits content size per URL to prevent memory issues
Implements exponential backoff for retries
Configurable worker limits for parallel processing

Development

Project Structure

ScraperSage/
├── __init__.py
├── scraper_sage.py
├── setup.py
├── requirements.txt
├── README.md
└── example_usage.py

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues, questions, or contributions, please visit the project repository or contact the maintainers.

Changelog

v1.0.0

Initial release
Multi-engine search support (Google + DuckDuckGo)
Playwright-based web scraping
Google Gemini AI summarization
Structured JSON output
Comprehensive error handling

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.2

Sep 26, 2025

1.2.1

Sep 26, 2025

1.2.0

Sep 26, 2025

This version

1.0.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapersage-1.0.0.tar.gz (10.4 kB view details)

Uploaded Sep 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapersage-1.0.0-py3-none-any.whl (9.7 kB view details)

Uploaded Sep 25, 2025 Python 3

File details

Details for the file scrapersage-1.0.0.tar.gz.

File metadata

Download URL: scrapersage-1.0.0.tar.gz
Upload date: Sep 25, 2025
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d18e0aa5a8464a16c2a54295ecfb7b1ff2a9244e766e723dd9edcb1c249dae72`
MD5	`63f2a70eab2d30f45b1461a93c22b4fd`
BLAKE2b-256	`77ff6999ed52741b45fb5cfa4ace2363edc6c0cf7cfbe6631f050422d6d70ce5`

See more details on using hashes here.

File details

Details for the file scrapersage-1.0.0-py3-none-any.whl.

File metadata

Download URL: scrapersage-1.0.0-py3-none-any.whl
Upload date: Sep 25, 2025
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for scrapersage-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62ac1eacc9a77a7f66d15f863435ad6b8e78ff7c1f72e90bacb05831c8411f37`
MD5	`94c02d8ca7d885463abe7cc9fc56bbcd`
BLAKE2b-256	`cd56ad232228510a5fb59dcda20625a3f555f84e4980db902333ff454d814a87`

See more details on using hashes here.

ScraperSage 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrape and Summarize

Features

Installation

From Source (Development)

Install Dependencies

Install Playwright Browsers (Required)

Quick Start

API Keys Setup

1. Serper API Key (for Google Search)

2. Google Gemini API Key

Usage Examples

Basic Usage

Advanced Configuration

Error Handling

Output Format

Parameters

Requirements

Dependencies

Error Handling

Performance Considerations

Development

Project Structure

Contributing

License

Support

Changelog

v1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes