Skip to main content

AI-driven web scraping framework

Project description

ScrapeGen

Logo

ScrapeGen 🚀 is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.

✨ Features

  • 🤖 AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing.
  • ⚙️ Configurable Web Scraping: Supports depth control and flexible extraction rules.
  • 📊 Structured Data Modeling: Uses Pydantic for well-defined data structures.
  • 🛡️ Robust Error Handling: Implements retry mechanisms and detailed error reporting.
  • 🔧 Customizable Scraping Configurations: Adjust settings dynamically based on needs.
  • 🌐 Comprehensive URL Handling: Supports both relative and absolute URLs.
  • 📦 Modular Architecture: Ensures clear separation of concerns for maintainability.

📥 Installation

pip install scrapegen  # Package name may vary

📌 Requirements

  • Python 3.7+
  • Google API Key (for Gemini models)
  • Required Python packages:
    • requests
    • beautifulsoup4
    • langchain
    • langchain-google-genai
    • pydantic

🚀 Quick Start

from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define the target URL
url = "https://example.com"

# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)

# Display extracted data
for company in companies_data.companies:
    print(f"🏢 Company Name: {company.company_name}")
    print(f"📄 Description: {company.company_description}")

⚙️ Configuration

🔹 ScrapeConfig Options

from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)

🔄 Updating Configuration

scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)

📌 Custom Data Models

Define Pydantic models to structure extracted data:

from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)

🤖 Supported Gemini Models

  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-1.5-flash

⚠️ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

  • ❗ ScrapeGenError: Base exception class.
  • ⚙️ ConfigurationError: Errors related to scraper configuration.
  • 🕷️ ScrapingError: Issues encountered during web scraping.
  • 🔍 ExtractionError: Problems with AI-driven data extraction.

Example usage:

try:
    data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
    print(f"⚙️ Configuration error: {e}")
except ScrapingError as e:
    print(f"🕷️ Scraping error: {e}")
except ExtractionError as e:
    print(f"🔍 Extraction error: {e}")

🏗️ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

  1. 🕷️ WebsiteScraper: Handles core web scraping logic.
  2. 📑 InfoExtractorAi: Performs AI-driven content extraction.
  3. 🤖 LlmManager: Manages interactions with language models.
  4. 🔗 UrlParser: Parses and normalizes URLs.
  5. 📥 ContentExtractor: Extracts structured data from HTML elements.

✅ Best Practices

1️⃣ Rate Limiting

  • ⏳ Use delays between requests.
  • 📜 Respect robots.txt guidelines.
  • ⚖️ Configure max_pages and max_depth responsibly.

2️⃣ Error Handling

  • 🔄 Wrap scraping operations in try-except blocks.
  • 📋 Implement proper logging for debugging.
  • 🔁 Handle network timeouts and retries effectively.

3️⃣ Resource Management

  • 🖥️ Monitor memory usage for large-scale operations.
  • 📚 Implement pagination for large datasets.
  • ⏱️ Adjust timeout settings based on expected response times.

🤝 Contributing

Contributions are welcome! 🎉 Feel free to submit a Pull Request to improve ScrapeGen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapegen-0.1.0.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

scrapegen-0.1.0-py3-none-any.whl (43.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapegen-0.1.0.tar.gz.

File metadata

  • Download URL: scrapegen-0.1.0.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for scrapegen-0.1.0.tar.gz
Algorithm Hash digest
SHA256 eaf67e2773e91876362c02cb6bf288c35e830c960774f38177324c44d47f9327
MD5 6249c1f69de83e23cf3f97a4f522b02a
BLAKE2b-256 6bdc541c589cdc0b8ea36521492fa7bd18d1275b5baa7c773e8c5d9d6c651fb7

See more details on using hashes here.

File details

Details for the file scrapegen-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapegen-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 43.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for scrapegen-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d09c4d281128526d7a5cf01de46a25ea7ae2a1349031bcffefe8e6eb133dae29
MD5 0ccddcdb43e48e59654cc6fa25ff17c7
BLAKE2b-256 a8c2fd8a22f785ed33c025482bca3dcf4859f487b04a56a32174b11e3aac715c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page