Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing
Project description
SpiderForce4AI Python Wrapper (FireCrawl, Jina AI, Crawl4AI Alternative)
A comprehensive Python package for web content crawling, HTML-to-Markdown conversion, and AI-powered post-processing with real-time webhook notifications. Built for seamless integration with the SpiderForce4AI service.
Prerequisites
Important: To use this wrapper, you must have the SpiderForce4AI service running. For full installation and deployment instructions, visit: https://github.com/petertamai/SpiderForce4AI
Features
- HTML to Markdown conversion
- Advanced content extraction with custom selectors
- Parallel and async crawling support
- Sitemap processing
- Automatic retry mechanism
- Real-time webhook notifications for each processed URL
- AI-powered post-extraction processing
- Post-extraction webhook integration
- Detailed progress tracking
- Customizable reporting
Installation
pip install spiderforce4ai
Quick Start with Webhooks
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
# Initialize crawler with your SpiderForce4AI service URL (required)
spider = SpiderForce4AI("http://localhost:3004")
# Configure crawling options with webhook support
config = CrawlConfig(
target_selector="article", # Optional: target element to extract
remove_selectors=[".ads", ".navigation"], # Optional: elements to remove
max_concurrent_requests=5, # Optional: default is 1
save_reports=True, # Optional: default is False
# Webhook configuration for real-time notifications
webhook_url="https://webhook.site/your-unique-webhook-id", # Required for webhooks
webhook_headers={ # Optional custom headers
"Authorization": "Bearer your-token",
"Content-Type": "application/json"
}
)
# Crawl a sitemap with webhook notifications
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)
Core Components
Configuration Options
The CrawlConfig class accepts the following parameters:
Mandatory Parameters for Webhook Integration
webhook_url: (str) Webhook endpoint URL (required only if you want webhook notifications)
Optional Parameters
Content Selection:
target_selector: (str) CSS selector for the main content to extractremove_selectors: (List[str]) List of CSS selectors to remove from contentremove_selectors_regex: (List[str]) List of regex patterns for element removal
Processing:
max_concurrent_requests: (int) Number of parallel requests (default: 1)request_delay: (float) Delay between requests in seconds (default: 0.5)timeout: (int) Request timeout in seconds (default: 30)
Output:
output_dir: (Path) Directory to save output files (default: "spiderforce_reports")save_reports: (bool) Whether to save crawl reports (default: False)report_file: (Path) Custom report file location (generated if None)combine_to_one_markdown: (str) 'full' or 'metadata_headers' (perfect for SEO) to combine contentcombined_markdown_file: (Path) Custom combined file location (generated if None)
Webhook:
webhook_url: (str) Webhook endpoint URLwebhook_timeout: (int) Webhook timeout in seconds (default: 10)webhook_headers: (Dict[str, str]) Custom webhook headerswebhook_payload_template: (str) Custom webhook payload template
Post-Extraction Processing:
post_extraction_agent: (Dict) Configuration for LLM post-processingpost_extraction_agent_save_to_file: (str) Path to save extraction resultspost_agent_transformer_function: (Callable) Custom transformer function for webhook payloads
When Are Webhooks Triggered?
SpiderForce4AI provides webhook notifications at multiple points in the crawling process:
- URL Processing Completion: Triggered after each URL is processed (whether successful or failed)
- Post-Extraction Completion: Triggered after LLM post-processing of each URL (if configured)
- Custom Transformation: You can implement your own webhook logic in the transformer function
The webhook payload contains detailed information about the processed URL, including:
- The URL that was processed
- Processing status (success/failed)
- Extracted markdown content (for successful requests)
- Error details (for failed requests)
- Timestamp of processing
- Post-extraction results (if LLM processing is enabled)
Complete Webhook Integration Examples
Basic Webhook Integration
This example sends a webhook notification after each URL is processed:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure with basic webhook support
config = CrawlConfig(
# Content selection
target_selector="article",
remove_selectors=[".ads", ".navigation"],
# Processing
max_concurrent_requests=5,
request_delay=0.5,
# Output
output_dir=Path("content_output"),
save_reports=True,
# Webhook configuration
webhook_url="https://webhook.site/your-unique-id",
webhook_headers={
"Authorization": "Bearer your-token",
"Content-Type": "application/json"
}
)
# Crawl URLs with webhook notifications
urls = ["https://petertam.pro/about", "https://petertam.pro/contact"]
results = spider.crawl_urls_server_parallel(urls, config)
Custom Webhook Payload Template
You can customize the webhook payload format:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure with custom webhook payload
config = CrawlConfig(
target_selector="article",
max_concurrent_requests=3,
# Webhook with custom payload template
webhook_url="https://webhook.site/your-unique-id",
webhook_payload_template='''{
"page": {
"url": "{url}",
"status": "{status}",
"processed_at": "{timestamp}"
},
"content": {
"markdown": "{markdown}",
"error": "{error}"
},
"metadata": {
"service": "SpiderForce4AI",
"version": "2.6.7",
"client_id": "your-client-id"
}
}'''
)
# Crawl with custom webhook payload
results = spider.crawl_url("https://petertam.pro", config)
Advanced: Post-Extraction Webhooks
This example demonstrates how to use webhooks with the AI post-processing feature:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
import requests
from pathlib import Path
# Define extraction template structure
extraction_template = """
{
"Title": "Extract the main title of the content",
"MetaDescription": "Extract a search-friendly meta description (under 160 characters)",
"KeyPoints": ["Extract 3-5 key points from the content"],
"Categories": ["Extract relevant categories for this content"],
"ReadingTimeMinutes": "Estimate reading time in minutes"
}
"""
# Define a custom webhook function
def post_extraction_webhook(extraction_result):
"""Send extraction results to a webhook and return transformed data."""
# Add custom fields or transform the data as needed
payload = {
"url": extraction_result.get("url", ""),
"title": extraction_result.get("Title", ""),
"description": extraction_result.get("MetaDescription", ""),
"key_points": extraction_result.get("KeyPoints", []),
"categories": extraction_result.get("Categories", []),
"reading_time": extraction_result.get("ReadingTimeMinutes", ""),
"processed_at": extraction_result.get("timestamp", "")
}
# Send to webhook (example using a different webhook than the main one)
try:
response = requests.post(
"https://webhook.site/your-extraction-webhook-id",
json=payload,
headers={
"Authorization": "Bearer extraction-token",
"Content-Type": "application/json"
},
timeout=10
)
print(f"Extraction webhook sent: Status {response.status_code}")
except Exception as e:
print(f"Extraction webhook error: {str(e)}")
# Return the transformed data (will be stored in result.extraction_result)
return payload
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure with post-extraction and webhooks
config = CrawlConfig(
# Basic crawling settings
target_selector="article",
remove_selectors=[".ads", ".navigation", ".comments"],
max_concurrent_requests=5,
# Regular webhook for crawl results
webhook_url="https://webhook.site/your-crawl-webhook-id",
webhook_headers={
"Authorization": "Bearer crawl-token",
"Content-Type": "application/json"
},
# Post-extraction LLM processing
post_extraction_agent={
"model": "gpt-4-turbo", # Or another compatible model
"api_key": "your-api-key-here",
"max_tokens": 1000,
"temperature": 0.3,
"response_format": "json_object", # Request JSON response format
"messages": [
{
"role": "system",
"content": f"Extract the following information from the content. Return ONLY valid JSON, no explanations:\n\n{extraction_template}"
},
{
"role": "user",
"content": "{here_markdown_content}" # Will be replaced with actual content
}
]
},
# Save combined extraction results
post_extraction_agent_save_to_file="extraction_results.json",
# Custom function to transform and send extraction webhook
post_agent_transformer_function=post_extraction_webhook
)
# Crawl a sitemap with both regular and post-extraction webhooks
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)
Real-World Example: SEO Analyzer with Webhooks
This complete example crawls a site and uses Mistral AI to extract SEO-focused data, sending custom webhooks at each step:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
import requests
# Define your extraction template
extraction_template = """
{
"Title": "Extract the main title of the content (ideally under 60 characters for SEO)",
"MetaDescription": "Extract the meta description of the content (ideally under 160 characters for SEO)",
"CanonicalUrl": "Extract the canonical URL of the content",
"Headings": {
"H1": "Extract the main H1 heading",
"H2": ["Extract all H2 headings (for better content structure)"],
"H3": ["Extract all H3 headings (for better content structure)"]
},
"KeyPoints": [
"Point 1 (focus on primary keywords)",
"Point 2 (focus on secondary keywords)",
"Point 3 (focus on user intent)"
],
"CallToAction": "Extract current call to actions (e.g., Sign up now for exclusive offers!)"
}
"""
# Define custom transformer function
def post_webhook(post_extraction_agent_response):
"""Transform and send extracted SEO-focused data to webhook."""
payload = {
"url": post_extraction_agent_response.get("url", ""),
"Title": post_extraction_agent_response.get("Title", ""),
"MetaDescription": post_extraction_agent_response.get("MetaDescription", ""),
"CanonicalUrl": post_extraction_agent_response.get("CanonicalUrl", ""),
"Headings": post_extraction_agent_response.get("Headings", {}),
"KeyPoints": post_extraction_agent_response.get("KeyPoints", []),
"CallToAction": post_extraction_agent_response.get("CallToAction", "")
}
# Send webhook with custom headers
try:
response = requests.post(
"https://webhook.site/your-extraction-webhook-id",
json=payload,
headers={
"Authorization": "Bearer token",
"X-Custom-Header": "value",
"Content-Type": "application/json"
},
timeout=10
)
# Log the response for debugging
if response.status_code == 200:
print(f"✅ Webhook sent successfully for {payload['url']}")
else:
print(f"❌ Failed to send webhook. Status code: {response.status_code}")
except Exception as e:
print(f"❌ Webhook error: {str(e)}")
return payload # Return transformed data
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure post-extraction agent
post_extraction_agent = {
"model": "mistral/mistral-large-latest", # Or any model compatible with litellm
"messages": [
{
"role": "system",
"content": f"Based on the provided markdown content, extract the following information. Return ONLY valid JSON, no comments or explanations:\n\n{extraction_template}"
},
{
"role": "user",
"content": "{here_markdown_content}" # Placeholder for markdown content
}
],
"api_key": "your-mistral-api-key", # Replace with your actual API key
"max_tokens": 8000,
"response_format": "json_object" # Request JSON format from the API
}
# Create crawl configuration
config = CrawlConfig(
# Basic crawl settings
save_reports=True,
max_concurrent_requests=5,
remove_selectors=[
".header-bottom",
"#mainmenu",
".headerfix",
".header",
".menu-item",
".wpcf7",
".followus-section"
],
output_dir=Path("reports"),
# Crawl results webhook
webhook_url="https://webhook.site/your-crawl-webhook-id",
webhook_headers={
"Authorization": "Bearer crawl-token",
"Content-Type": "application/json"
},
# Add post-extraction configuration
post_extraction_agent=post_extraction_agent,
post_extraction_agent_save_to_file="combined_extraction.json",
post_agent_transformer_function=post_webhook
)
# Run the crawler with parallel processing
results = spider.crawl_sitemap_parallel(
"https://petertam.pro/sitemap.xml",
config
)
# Print summary
successful = len([r for r in results if r.status == "success"])
extracted = len([r for r in results if hasattr(r, 'extraction_result') and r.extraction_result])
print(f"\n📊 Crawling complete:")
print(f" - {len(results)} URLs processed")
print(f" - {successful} successfully crawled")
print(f" - {extracted} with AI extraction")
print(f" - Reports saved to: {config.report_file}")
print(f" - Extraction data saved to: {config.post_extraction_agent_save_to_file}")
AI-Powered Post-Processing with Webhook Integration
The package includes a powerful AI post-processing system through the PostExtractionAgent class with integrated webhook capabilities.
Post-Extraction Configuration
from spiderforce4ai import SpiderForce4AI, CrawlConfig, PostExtractionConfig, PostExtractionAgent
import requests
# Define a custom webhook transformer
def transform_and_notify(result):
"""Process extraction results and send to external system."""
# Send to external system
requests.post(
"https://your-api.com/analyze",
json=result,
headers={"Authorization": "Bearer token"}
)
return result # Return data for storage
# Configure post-extraction processing with webhooks
config = CrawlConfig(
# Basic crawl settings
target_selector="article",
max_concurrent_requests=5,
# Standard crawl webhook
webhook_url="https://webhook.site/your-crawl-webhook-id",
# Post-extraction LLM configuration
post_extraction_agent={
"model": "gpt-4-turbo",
"api_key": "your-api-key-here",
"max_tokens": 1000,
"temperature": 0.7,
"base_url": "https://api.openai.com/v1",
"messages": [
{
"role": "system",
"content": "Extract key information from the following content."
},
{
"role": "user",
"content": "Please analyze the following content:\n\n{here_markdown_content}"
}
]
},
# Save extraction results to file
post_extraction_agent_save_to_file="extraction_results.json",
# Custom webhook transformer
post_agent_transformer_function=transform_and_notify
)
# Crawl with post-processing and webhooks
results = spider.crawl_urls_server_parallel(["https://petertam.pro"], config)
Crawling Methods
All methods support webhook integration automatically when configured:
1. Single URL Processing
# Synchronous with webhook
result = spider.crawl_url("https://petertam.pro", config)
# Asynchronous with webhook
async def crawl():
result = await spider.crawl_url_async("https://petertam.pro", config)
2. Multiple URLs
urls = ["https://petertam.pro/about", "https://petertam.pro/contact"]
# Server-side parallel with webhooks (recommended)
results = spider.crawl_urls_server_parallel(urls, config)
# Client-side parallel with webhooks
results = spider.crawl_urls_parallel(urls, config)
# Asynchronous with webhooks
async def crawl():
results = await spider.crawl_urls_async(urls, config)
3. Sitemap Processing
# Server-side parallel with webhooks (recommended)
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)
# Client-side parallel with webhooks
results = spider.crawl_sitemap_parallel("https://petertam.pro/sitemap.xml", config)
# Asynchronous with webhooks
async def crawl():
results = await spider.crawl_sitemap_async("https://petertam.pro/sitemap.xml", config)
Webhook Payload Structure
The default webhook payload includes:
{
"url": "https://petertam.pro/about",
"status": "success",
"markdown": "# About\n\nThis is the about page content...",
"error": null,
"timestamp": "2025-02-27T12:30:45.123456",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", ".navigation"]
},
"extraction_result": {
"Title": "About Peter Tam",
"MetaDescription": "Professional developer with expertise in AI and web technologies",
"KeyPoints": ["Over 10 years experience", "Specializes in AI integration", "Full-stack development"]
}
}
For failed requests:
{
"url": "https://petertam.pro/missing-page",
"status": "failed",
"markdown": null,
"error": "HTTP 404: Not Found",
"timestamp": "2025-02-27T12:31:23.456789",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", ".navigation"]
},
"extraction_result": null
}
Custom Webhook Transformation
You can transform webhook data with a custom function:
def transform_webhook_data(result):
"""Custom transformer for webhook data."""
# Extract only needed fields
transformed = {
"url": result.get("url"),
"title": result.get("Title"),
"description": result.get("MetaDescription"),
"processed_at": result.get("timestamp")
}
# Add custom calculations
if "raw_content" in result:
transformed["word_count"] = len(result["raw_content"].split())
transformed["reading_time_min"] = max(1, transformed["word_count"] // 200)
# Send to external systems if needed
requests.post("https://your-analytics-api.com/log", json=transformed)
return transformed # This will be stored in result.extraction_result
Smart Retry Mechanism
The package provides a sophisticated retry system with webhook notifications:
# Retry behavior with webhook notifications
config = CrawlConfig(
max_concurrent_requests=5,
request_delay=1.0,
webhook_url="https://webhook.site/your-webhook-id",
# Webhook will be called for both initial attempts and retries
)
results = spider.crawl_urls_async(urls, config)
Report Generation
The package can generate detailed reports of crawling operations:
config = CrawlConfig(
save_reports=True,
report_file=Path("custom_report.json"),
output_dir=Path("content"),
webhook_url="https://webhook.site/your-webhook-id"
)
Combined Content Output
You can combine multiple pages into a single Markdown file:
config = CrawlConfig(
combine_to_one_markdown="full",
combined_markdown_file=Path("all_pages.md"),
webhook_url="https://webhook.site/your-webhook-id"
)
Progress Tracking
The package provides rich progress tracking with detailed statistics:
Fetching sitemap from https://petertam.pro/sitemap.xml...
Found 23 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 23/23 URLs
Retrying failed URLs: 3 (13.0% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 3/3 retries
Starting post-extraction processing...
[⠋] Post-extraction processing... • 0/20 0% • 00:00:00
Crawling Summary:
Total URLs processed: 23
Initial failures: 3 (13.0%)
Final results:
✓ Successful: 22
✗ Failed: 1
Retry success rate: 2/3 (66.7%)
Advanced Error Handling with Webhooks
import logging
from datetime import datetime
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
filename='spider_log.txt'
)
# Define error handling webhook function
def error_handler(extraction_result):
"""Handle errors and send notifications."""
url = extraction_result.get('url', 'unknown')
# Check for errors
if 'error' in extraction_result and extraction_result['error']:
# Log the error
logging.error(f"Error processing {url}: {extraction_result['error']}")
# Send error notification webhook
try:
requests.post(
"https://webhook.site/your-error-webhook-id",
json={
"url": url,
"error": extraction_result['error'],
"timestamp": datetime.now().isoformat(),
"severity": "high" if "404" in extraction_result['error'] else "medium"
},
headers={"X-Error-Alert": "true"}
)
except Exception as e:
logging.error(f"Failed to send error webhook: {e}")
# Always return the original result to ensure data is preserved
return extraction_result
# Add to config
config = CrawlConfig(
# Regular webhook for all crawl results
webhook_url="https://webhook.site/your-regular-webhook-id",
# Error handling webhook through the transformer function
post_agent_transformer_function=error_handler
)
Requirements
- Python 3.11+
- Running SpiderForce4AI service
- Internet connection
Dependencies
- aiohttp
- asyncio
- rich
- aiofiles
- httpx
- litellm
- pydantic
- requests
- pandas
- numpy
- openai
License
MIT License
Credits
Created by Piotr Tamulewicz
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiderforce4ai-2.7.tar.gz.
File metadata
- Download URL: spiderforce4ai-2.7.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4620e412b1ce2715f71d09611764170171513b04448ac8971957bf623706e34e
|
|
| MD5 |
0ef89b9a9d4013958e2138f0891723b5
|
|
| BLAKE2b-256 |
3b99c71f8c991922517aa57a8959f226b1d05c0b84c6cccda3eadf482b5353d4
|
File details
Details for the file spiderforce4ai-2.7-py3-none-any.whl.
File metadata
- Download URL: spiderforce4ai-2.7-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a3e2c685a6c856f107bc70ade6a21ef58da1fa4f5094afa327376bc818ae1c1
|
|
| MD5 |
6161e7e80ecf95c8f4701238e865b222
|
|
| BLAKE2b-256 |
019bb4a62538e372f7a78109b39659e69a871ffbd99787a49751f22284816c15
|