Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing
Project description
SpiderForce4AI Python Wrapper
A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
Features
- HTML to Markdown conversion
- Parallel and async crawling support
- Sitemap processing
- Custom content selection
- Automatic retry mechanism
- Detailed progress tracking
- Webhook notifications
- Customizable reporting
Installation
pip install spiderforce4ai
Quick Start
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure crawling options
config = CrawlConfig(
target_selector="article",
remove_selectors=[".ads", ".navigation"],
max_concurrent_requests=5,
save_reports=True
)
# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
Key Features
1. Smart Retry Mechanism
- Automatically retries failed URLs
- Monitors failure ratio to prevent server overload
- Detailed retry statistics and progress tracking
- Aborts retries if failure rate exceeds 20%
# Retry behavior is automatic
config = CrawlConfig(
max_concurrent_requests=5,
request_delay=1.0 # Delay between retries
)
results = spider.crawl_urls_async(urls, config)
2. Custom Webhook Integration
- Flexible payload formatting
- Custom headers support
- Variable substitution in templates
config = CrawlConfig(
webhook_url="https://your-webhook.com",
webhook_headers={
"Authorization": "Bearer token",
"X-Custom-Header": "value"
},
webhook_payload_template='''{
"url": "{url}",
"content": "{markdown}",
"status": "{status}",
"custom_field": "value"
}'''
)
3. Flexible Report Generation
- Optional report saving
- Customizable report location
- Detailed success/failure statistics
config = CrawlConfig(
save_reports=True,
report_file=Path("custom_report.json"),
output_dir=Path("content")
)
Crawling Methods
1. Single URL Processing
# Synchronous
result = spider.crawl_url("https://example.com", config)
# Asynchronous
async def crawl():
result = await spider.crawl_url_async("https://example.com", config)
2. Multiple URLs
urls = ["https://example.com/page1", "https://example.com/page2"]
# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)
# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)
# Asynchronous
async def crawl():
results = await spider.crawl_urls_async(urls, config)
3. Sitemap Processing
# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
# Asynchronous
async def crawl():
results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
Configuration Options
config = CrawlConfig(
# Content Selection
target_selector="article", # Target element to extract
remove_selectors=[".ads", "#popup"], # Elements to remove
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
# Processing
max_concurrent_requests=5, # Parallel processing limit
request_delay=0.5, # Delay between requests
timeout=30, # Request timeout
# Output
output_dir=Path("content"), # Output directory
save_reports=False, # Enable/disable report saving
report_file=Path("report.json"), # Report location
# Webhook
webhook_url="https://webhook.com", # Webhook endpoint
webhook_timeout=10, # Webhook timeout
webhook_headers={ # Custom headers
"Authorization": "Bearer token"
},
webhook_payload_template=''' # Custom payload format
{
"url": "{url}",
"content": "{markdown}",
"status": "{status}",
"error": "{error}",
"time": "{timestamp}"
}'''
)
Progress Tracking
The package provides detailed progress information:
Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs
Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries
Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
✓ Successful: 150
✗ Failed: 6
Retry success rate: 12/18 (66.7%)
Output Structure
1. Directory Layout
content/ # Output directory
├── example-com-page1.md # Markdown files
├── example-com-page2.md
└── report.json # Crawl report
2. Report Format
{
"timestamp": "2025-02-15T10:30:00",
"config": {
"target_selector": "article",
"remove_selectors": [".ads"]
},
"results": {
"successful": [...],
"failed": [...]
},
"summary": {
"total": 156,
"successful": 150,
"failed": 6
}
}
Performance Optimization
-
Server-side Parallel Processing
- Recommended for most cases
- Single HTTP request
- Reduced network overhead
- Built-in load balancing
-
Client-side Parallel Processing
- Better control over processing
- Customizable concurrency
- Progress tracking per URL
- Automatic retry handling
-
Asynchronous Processing
- Ideal for async applications
- Non-blocking operation
- Real-time progress updates
- Efficient resource usage
Error Handling
The package provides comprehensive error handling:
- Automatic retry for failed URLs
- Failure ratio monitoring
- Detailed error reporting
- Webhook error notifications
- Progress tracking during retries
Requirements
- Python 3.11+
- Running SpiderForce4AI service
- Internet connection
Dependencies
- aiohttp
- asyncio
- rich
- aiofiles
- httpx
License
MIT License
Credits
Created by Peter Tam
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiderforce4ai-2.5.tar.gz.
File metadata
- Download URL: spiderforce4ai-2.5.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d735a5701bcf4818d34bc0e13f6127751808dc4653dba73366c3130b3a48888
|
|
| MD5 |
b4390687eb9fc3b69a063256ab9b4991
|
|
| BLAKE2b-256 |
8730b0f9190e492efe16336b790ec4a77cd0b292285126fcc7a62ed853911865
|
File details
Details for the file spiderforce4ai-2.5-py3-none-any.whl.
File metadata
- Download URL: spiderforce4ai-2.5-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a4575e5fadc0b8dd3656b09cfb0ef812e9f026c72b3cfed539ab10da4b72fd
|
|
| MD5 |
0ea083a1dcf44db6b1fe04a82d4c94c8
|
|
| BLAKE2b-256 |
e86fe9ba5be3a2f98e909827c72e22997c61ea542f2734200a6c77b9d3941b3c
|