AI-driven web scraping framework
Project description
ScrapeGen
ScrapeGen ๐ is a powerful Python library that combines web scraping with AI-driven data extraction to collect and structure information from websites efficiently. It leverages Google's Gemini models for intelligent data processing and provides a flexible, configurable framework for web scraping operations.
โจ Features
- ๐ค AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing
- โ๏ธ Configurable Web Scraping: Supports depth control and flexible extraction rules
- ๐ Structured Data Modeling: Uses Pydantic for well-defined data structures
- ๐ก๏ธ Robust Error Handling: Implements retry mechanisms and detailed error reporting
- ๐ง Customizable Scraping Configurations: Adjust settings dynamically based on needs
- ๐ Comprehensive URL Handling: Supports both relative and absolute URLs
- ๐ฆ Modular Architecture: Ensures clear separation of concerns for maintainability
๐ฅ Installation
pip install scrapegen
๐ Requirements
- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
- requests
- beautifulsoup4
- langchain
- langchain-google-genai
- pydantic
๐ Quick Start with Custom Prompts
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo
# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")
# Define target URL and custom prompt
url = "https://example.com"
custom_prompt = """
Analyze the website content and extract:
- Company name
- Core technologies
- Industry focus areas
- Key product features
"""
# Scrape with custom prompt and model
companies_data = scraper.scrape(
url=url,
prompt=custom_prompt,
base_model=CompaniesInfo
)
# Display extracted data
for company in companies_data.companies:
print(f"๐ข {company.company_name}")
print(f"๐ง Technologies: {', '.join(company.core_technologies)}")
print(f"๐ Focus Areas: {', '.join(company.industry_focus)}")
โ๏ธ Configuration
๐น ScrapeConfig Options
from scrapegen import ScrapeConfig
config = ScrapeConfig(
max_pages=20, # Max pages to scrape per depth level
max_subpages=2, # Max subpages to scrape per page
max_depth=1, # Max depth to follow links
timeout=30, # Request timeout in seconds
retries=3, # Number of retry attempts
user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
headers=None # Additional HTTP headers
)
๐ Updating Configuration
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config , verbose=False)
# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
๐ Custom Data Models
Define Pydantic models to structure extracted data:
from pydantic import BaseModel
from typing import Optional, List
class CustomDataModel(BaseModel):
title: str
description: Optional[str]
date: str
tags: List[str]
class CustomDataCollection(BaseModel):
items: List[CustomDataModel]
# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
๐ค Supported Gemini Models
- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash
๐ Custom Prompt Engineering Guide
1๏ธโฃ Basic Prompt Structure
basic_prompt = """
Extract the following details from the content:
- Company name
- Founding year
- Headquarters location
- Main products/services
"""
2๏ธโฃ Tech-Focused Extraction
tech_prompt = """
Identify and categorize technologies mentioned in the content:
1. AI/ML Technologies
2. Cloud Infrastructure
3. Data Analytics Tools
4. Cybersecurity Solutions
Include version numbers and implementation details where available.
"""
3๏ธโฃ Multi-Level Extraction
multi_level_prompt = """
Perform hierarchical extraction:
1. Company Overview:
- Name
- Mission statement
- Key executives
2. Technical Capabilities:
- Core technologies
- Development stack
- Infrastructure
3. Market Position:
- Competitors
- Market share
- Growth metrics
"""
๐ Specialized Prompt Examples
๐ Competitive Analysis Prompt
competitor_prompt = """
Identify and compare key competitors:
- Competitor names
- Feature differentiators
- Pricing models
- Market positioning
Output as a comparison table with relative strengths.
"""
๐ฑ Sustainability Focused Prompt
green_prompt = """
Extract environmental sustainability information:
1. Green initiatives
2. Carbon reduction targets
3. Eco-friendly technologies
4. Sustainability certifications
5. Renewable energy usage
Prioritize quantitative metrics and timelines.
"""
๐ก Innovation Tracking Prompt
innovation_prompt = """
Analyze R&D activities and innovations:
- Recent patents filed
- Research partnerships
- New product launches (last 24 months)
- Technology breakthroughs
- Investment in R&D (% of revenue)
"""
๐ ๏ธ Prompt Optimization Tips
-
Be Specific: Clearly define required fields and formats
"Format output as JSON with 'company_name', 'employees', 'revenue' keys"
-
Add Context:
"Analyze content from CEO interviews for strategic priorities"
-
Define Output Structure:
"Categorize findings under 'basic_info', 'tech_stack', 'growth_metrics'"
-
Set Priorities:
"Focus on technical specifications over marketing content"
โ ๏ธ Error Handling
ScrapeGen provides specific exception classes for detailed error handling:
- โ ScrapeGenError: Base exception class
- โ๏ธ ConfigurationError: Errors related to scraper configuration
- ๐ท๏ธ ScrapingError: Issues encountered during web scraping
- ๐ ExtractionError: Problems with AI-driven data extraction
Example usage:
try:
data = scraper.scrape(
url=url,
prompt=complex_prompt,
base_model=MarketAnalysis
)
except ExtractionError as e:
print(f"๐ Extraction failed with custom prompt: {e}")
print(f"๐ง Prompt used: {complex_prompt}")
except ScrapingError as e:
print(f"๐ Scraping error: {str(e)}")
๐๏ธ Architecture
ScrapeGen follows a modular design for scalability and maintainability:
- ๐ท๏ธ WebsiteScraper: Handles core web scraping logic
- ๐ InfoExtractorAi: Performs AI-driven content extraction
- ๐ค LlmManager: Manages interactions with language models
- ๐ UrlParser: Parses and normalizes URLs
- ๐ฅ ContentExtractor: Extracts structured data from HTML elements
โ Best Practices
1๏ธโฃ Rate Limiting
- โณ Use delays between requests
- ๐ Respect robots.txt guidelines
- โ๏ธ Configure max_pages and max_depth responsibly
2๏ธโฃ Error Handling
- ๐ Wrap scraping operations in try-except blocks
- ๐ Implement proper logging for debugging
- ๐ Handle network timeouts and retries effectively
3๏ธโฃ Resource Management
- ๐ฅ๏ธ Monitor memory usage for large-scale operations
- ๐ Implement pagination for large datasets
- โฑ๏ธ Adjust timeout settings based on expected response times
๐ค Contributing
Contributions are welcome! ๐ Feel free to submit a Pull Request to improve ScrapeGen.
โจ Features
- ๐ค AI-Powered Data Extraction: Utilizes Google's Gemini models for intelligent parsing.
- โ๏ธ Configurable Web Scraping: Supports depth control and flexible extraction rules.
- ๐ Structured Data Modeling: Uses Pydantic for well-defined data structures.
- ๐ก๏ธ Robust Error Handling: Implements retry mechanisms and detailed error reporting.
- ๐ง Customizable Scraping Configurations: Adjust settings dynamically based on needs.
- ๐ Comprehensive URL Handling: Supports both relative and absolute URLs.
- ๐ฆ Modular Architecture: Ensures clear separation of concerns for maintainability.
๐ฅ Installation
pip install scrapegen # Package name may vary
๐ Requirements
- Python 3.7+
- Google API Key (for Gemini models)
- Required Python packages:
- requests
- beautifulsoup4
- langchain
- langchain-google-genai
- pydantic
๐ Quick Start
from scrapegen import ScrapeGen, CompanyInfo, CompaniesInfo
# Initialize ScrapeGen with your Google API key
scraper = ScrapeGen(api_key="your-google-api-key", model="gemini-1.5-pro")
# Define the target URL
url = "https://example.com"
# Scrape and extract company information
companies_data = scraper.scrape(url, CompaniesInfo)
# Display extracted data
for company in companies_data.companies:
print(f"๐ข Company Name: {company.company_name}")
print(f"๐ Description: {company.company_description}")
โ๏ธ Configuration
๐น ScrapeConfig Options
from scrapegen import ScrapeConfig
config = ScrapeConfig(
max_pages=20, # Max pages to scrape per depth level
max_subpages=2, # Max subpages to scrape per page
max_depth=1, # Max depth to follow links
timeout=30, # Request timeout in seconds
retries=3, # Number of retry attempts
user_agent="Mozilla/5.0 (compatible; ScrapeGen/1.0)",
headers=None # Additional HTTP headers
)
๐ Updating Configuration
scraper = ScrapeGen(api_key="your-api-key", model="gemini-1.5-pro", config=config)
# Dynamically update configuration
scraper.update_config(max_pages=30, timeout=45)
๐ Custom Data Models
Define Pydantic models to structure extracted data:
from pydantic import BaseModel
from typing import Optional, List
class CustomDataModel(BaseModel):
title: str
description: Optional[str]
date: str
tags: List[str]
class CustomDataCollection(BaseModel):
items: List[CustomDataModel]
# Scrape using the custom model
data = scraper.scrape(url, CustomDataCollection)
๐ค Supported Gemini Models
- gemini-1.5-flash-8b
- gemini-1.5-pro
- gemini-2.0-flash-exp
- gemini-1.5-flash
โ ๏ธ Error Handling
ScrapeGen provides specific exception classes for detailed error handling:
- โ ScrapeGenError: Base exception class.
- โ๏ธ ConfigurationError: Errors related to scraper configuration.
- ๐ท๏ธ ScrapingError: Issues encountered during web scraping.
- ๐ ExtractionError: Problems with AI-driven data extraction.
Example usage:
try:
data = scraper.scrape(url, CustomDataCollection)
except ConfigurationError as e:
print(f"โ๏ธ Configuration error: {e}")
except ScrapingError as e:
print(f"๐ท๏ธ Scraping error: {e}")
except ExtractionError as e:
print(f"๐ Extraction error: {e}")
๐๏ธ Architecture
ScrapeGen follows a modular design for scalability and maintainability:
- ๐ท๏ธ WebsiteScraper: Handles core web scraping logic.
- ๐ InfoExtractorAi: Performs AI-driven content extraction.
- ๐ค LlmManager: Manages interactions with language models.
- ๐ UrlParser: Parses and normalizes URLs.
- ๐ฅ ContentExtractor: Extracts structured data from HTML elements.
โ Best Practices
1๏ธโฃ Rate Limiting
- โณ Use delays between requests.
- ๐ Respect robots.txt guidelines.
- โ๏ธ Configure max_pages and max_depth responsibly.
2๏ธโฃ Error Handling
- ๐ Wrap scraping operations in try-except blocks.
- ๐ Implement proper logging for debugging.
- ๐ Handle network timeouts and retries effectively.
3๏ธโฃ Resource Management
- ๐ฅ๏ธ Monitor memory usage for large-scale operations.
- ๐ Implement pagination for large datasets.
- โฑ๏ธ Adjust timeout settings based on expected response times.
๐ค Contributing
Contributions are welcome! ๐ Feel free to submit a Pull Request to improve ScrapeGen.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapegen-0.1.3.tar.gz
.
File metadata
- Download URL: scrapegen-0.1.3.tar.gz
- Upload date:
- Size: 37.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29a7cf6f31fcbcc2227ddd226749da9e796d2275a6de09bb65af0b63dc8acf54 |
|
MD5 | 8bbe5cfee4d559ce889b41d5a34be040 |
|
BLAKE2b-256 | 5cd5e459f335b917bf384825f3c454a776d314a9e18ad6b53db292e2563ea93d |
File details
Details for the file scrapegen-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: scrapegen-0.1.3-py3-none-any.whl
- Upload date:
- Size: 44.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e26521de990bf702860ef3cd43bfc184125cc11a53474abcd7cd3fce67ca0c8a |
|
MD5 | 7e31f744045d67d3feb1e5e922e710e9 |
|
BLAKE2b-256 | c856ce7d23b0f99bcc205e915f2c1687b345a7fb2bc73ca2c0a5e9351009266c |