Skip to main content

AI-driven web scraping framework

Project description

ScrapeGen

Logo

ScrapeGen ๐Ÿš€ is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.

โœจ Features

  • ๐Ÿค– AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing
  • โš™๏ธ Configurable Web Scraping: Supports depth control and flexible extraction rules
  • ๐Ÿ“Š Structured Data Modeling: Uses Pydantic for well-defined data structures
  • ๐Ÿ›ก๏ธ Robust Error Handling: Implements retry mechanisms and detailed error reporting
  • ๐Ÿ”ง Customizable Scraping Configurations: Adjust settings dynamically based on needs
  • ๐ŸŒ Comprehensive URL Handling: Supports both relative and absolute URLs
  • ๐Ÿ“ฆ Modular Architecture: Ensures clear separation of concerns for maintainability

๐Ÿ“ฅ Installation

pip install scrapegen

๐Ÿ“Œ Requirements

  • Python 3.7+
  • Google API Key (for Gemini models)
  • Required Python packages:
    • requests
    • beautifulsoup4
    • langchain
    • langchain-google-genai
    • pydantic

๐Ÿš€ Quick Start with Custom Prompts

from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""

# Scrape with custom prompt and model
companies_data = scraper.scrape(
    url=url,
    prompt=custom_prompt,
    base_model=CompaniesInfo
)

# Display extracted data
for company in companies_data.companies:
    print(f"๐Ÿข {company.company_name}")
    print(f"๐Ÿ”ง Technologies: {', '.join(company.core_technologies)}")
    print(f"๐Ÿ“ˆ Focus Areas: {', '.join(company.industry_focus)}")

โš™๏ธ Configuration

๐Ÿ”น ScrapeConfig Options

from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)

๐Ÿ”„ Updating Configuration

scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config , verbose=False)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)

๐Ÿ“Œ Custom Data Models

Define Pydantic models to structure extracted data:

from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)

๐Ÿค– Supported Gemini Models

  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-1.5-flash

๐Ÿ†• Custom Prompt Engineering Guide

1๏ธโƒฃ Basic Prompt Structure

basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""

2๏ธโƒฃ Tech-Focused Extraction

tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""

3๏ธโƒฃ Multi-Level Extraction

multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
   - Name
   - Mission statement
   - Key executives
2. Technical Capabilities:
   - Core technologies
   - Development stack
   - Infrastructure
3. Market Position:
   - Competitors
   - Market share
   - Growth metrics
"""

๐Ÿ“Œ Specialized Prompt Examples

๐Ÿ” Competitive Analysis Prompt

competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""

๐ŸŒฑ Sustainability Focused Prompt

green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""

๐Ÿ’ก Innovation Tracking Prompt

innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""

๐Ÿ› ๏ธ Prompt Optimization Tips

  1. Be Specific: Clearly define required fields and formats

    "Format output as JSON with 'company_name', 'employees', 'revenue' keys"
    
  2. Add Context:

    "Analyze content from CEO interviews for strategic priorities"
    
  3. Define Output Structure:

    "Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
    
  4. Set Priorities:

    "Focus on technical specifications over marketing content"
    

โš ๏ธ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

  • โ— ScrapeGenError: Base exception class
  • โš™๏ธ ConfigurationError: Errors related to scraper configuration
  • ๐Ÿ•ท๏ธ ScrapingError: Issues encountered during web scraping
  • ๐Ÿ” ExtractionError: Problems with AI-driven data extraction

Example usage:

try:
    data = scraper.scrape(
        url=url,
        prompt=complex_prompt,
        base_model=MarketAnalysis
    )
except ExtractionError as e:
    print(f"๐Ÿ” Extraction failed with custom prompt: {e}")
    print(f"๐Ÿง  Prompt used: {complex_prompt}")
except ScrapingError as e:
    print(f"๐ŸŒ Scraping error: {str(e)}")

๐Ÿ—๏ธ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

  1. ๐Ÿ•ท๏ธ WebsiteScraper: Handles core web scraping logic
  2. ๐Ÿ“‘ InfoExtractorAi: Performs AI-driven content extraction
  3. ๐Ÿค– LlmManager: Manages interactions with language models
  4. ๐Ÿ”— UrlParser: Parses and normalizes URLs
  5. ๐Ÿ“ฅ ContentExtractor: Extracts structured data from HTML elements

โœ… Best Practices

1๏ธโƒฃ Rate Limiting

  • โณ Use delays between requests
  • ๐Ÿ“œ Respect robots.txt guidelines
  • โš–๏ธ Configure max_pages and max_depth responsibly

2๏ธโƒฃ Error Handling

  • ๐Ÿ”„ Wrap scraping operations in try-except blocks
  • ๐Ÿ“‹ Implement proper logging for debugging
  • ๐Ÿ” Handle network timeouts and retries effectively

3๏ธโƒฃ Resource Management

  • ๐Ÿ–ฅ๏ธ Monitor memory usage for large-scale operations
  • ๐Ÿ“š Implement pagination for large datasets
  • โฑ๏ธ Adjust timeout settings based on expected response times

๐Ÿค Contributing

Contributions are welcome! ๐ŸŽ‰ Feel free to submit a Pull Request to improve ScrapeGen.

โœจ Features

  • ๐Ÿค– AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing.
  • โš™๏ธ Configurable Web Scraping: Supports depth control and flexible extraction rules.
  • ๐Ÿ“Š Structured Data Modeling: Uses Pydantic for well-defined data structures.
  • ๐Ÿ›ก๏ธ Robust Error Handling: Implements retry mechanisms and detailed error reporting.
  • ๐Ÿ”ง Customizable Scraping Configurations: Adjust settings dynamically based on needs.
  • ๐ŸŒ Comprehensive URL Handling: Supports both relative and absolute URLs.
  • ๐Ÿ“ฆ Modular Architecture: Ensures clear separation of concerns for maintainability.

๐Ÿ“ฅ Installation

pip install scrapegen  # Package name may vary

๐Ÿ“Œ Requirements

  • Python 3.7+
  • Google API Key (for Gemini models)
  • Required Python packages:
    • requests
    • beautifulsoup4
    • langchain
    • langchain-google-genai
    • pydantic

๐Ÿš€ Quick Start

from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo

# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")

# Define the target URL
url = "https://example.com"

# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)

# Display extracted data
for company in companies_data.companies:
    print(f"๐Ÿข Company Name: {company.company_name}")
    print(f"๐Ÿ“„ Description: {company.company_description}")

โš™๏ธ Configuration

๐Ÿ”น ScrapeConfig Options

from scrapegen import ScrapeConfig

config = ScrapeConfig(
    max_pages=20,      # Max pages to scrape per depth level
    max_subpages=2,    # Max subpages to scrape per page
    max_depth=1,       # Max depth to follow links
    timeout=30,        # Request timeout in seconds
    retries=3,         # Number of retry attempts
    user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
    headers=None       # Additional HTTP headers
)

๐Ÿ”„ Updating Configuration

scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)

# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)

๐Ÿ“Œ Custom Data Models

Define Pydantic models to structure extracted data:

from pydantic import BaseModel
from typing import Optional, List

class CustomDataModel(BaseModel):
    title: str
    description: Optional[str]
    date: str
    tags: List[str]

class CustomDataCollection(BaseModel):
    items: List[CustomDataModel]

# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)

๐Ÿค– Supported Gemini Models

  • gemini-1.5-flash-8b
  • gemini-1.5-pro
  • gemini-2.0-flash-exp
  • gemini-1.5-flash

โš ๏ธ Error Handling

ScrapeGen provides specific exception classes for detailed error handling:

  • โ— ScrapeGenError: Base exception class.
  • โš™๏ธ ConfigurationError: Errors related to scraper configuration.
  • ๐Ÿ•ท๏ธ ScrapingError: Issues encountered during web scraping.
  • ๐Ÿ” ExtractionError: Problems with AI-driven data extraction.

Example usage:

try:
    data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
    print(f"โš™๏ธ Configuration error: {e}")
except ScrapingError as e:
    print(f"๐Ÿ•ท๏ธ Scraping error: {e}")
except ExtractionError as e:
    print(f"๐Ÿ” Extraction error: {e}")

๐Ÿ—๏ธ Architecture

ScrapeGen follows a modular design for scalability and maintainability:

  1. ๐Ÿ•ท๏ธ WebsiteScraper: Handles core web scraping logic.
  2. ๐Ÿ“‘ InfoExtractorAi: Performs AI-driven content extraction.
  3. ๐Ÿค– LlmManager: Manages interactions with language models.
  4. ๐Ÿ”— UrlParser: Parses and normalizes URLs.
  5. ๐Ÿ“ฅ ContentExtractor: Extracts structured data from HTML elements.

โœ… Best Practices

1๏ธโƒฃ Rate Limiting

  • โณ Use delays between requests.
  • ๐Ÿ“œ Respect robots.txt guidelines.
  • โš–๏ธ Configure max_pages and max_depth responsibly.

2๏ธโƒฃ Error Handling

  • ๐Ÿ”„ Wrap scraping operations in try-except blocks.
  • ๐Ÿ“‹ Implement proper logging for debugging.
  • ๐Ÿ” Handle network timeouts and retries effectively.

3๏ธโƒฃ Resource Management

  • ๐Ÿ–ฅ๏ธ Monitor memory usage for large-scale operations.
  • ๐Ÿ“š Implement pagination for large datasets.
  • โฑ๏ธ Adjust timeout settings based on expected response times.

๐Ÿค Contributing

Contributions are welcome! ๐ŸŽ‰ Feel free to submit a Pull Request to improve ScrapeGen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapegen-0.1.3.tar.gz (37.3 kB view details)

Uploaded Source

Built Distribution

scrapegen-0.1.3-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapegen-0.1.3.tar.gz.

File metadata

  • Download URL: scrapegen-0.1.3.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for scrapegen-0.1.3.tar.gz
Algorithm Hash digest
SHA256 29a7cf6f31fcbcc2227ddd226749da9e796d2275a6de09bb65af0b63dc8acf54
MD5 8bbe5cfee4d559ce889b41d5a34be040
BLAKE2b-256 5cd5e459f335b917bf384825f3c454a776d314a9e18ad6b53db292e2563ea93d

See more details on using hashes here.

File details

Details for the file scrapegen-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: scrapegen-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 44.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for scrapegen-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e26521de990bf702860ef3cd43bfc184125cc11a53474abcd7cd3fce67ca0c8a
MD5 7e31f744045d67d3feb1e5e922e710e9
BLAKE2b-256 c856ce7d23b0f99bcc205e915f2c1687b345a7fb2bc73ca2c0a5e9351009266c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page