Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Project description
SpiderForce4AI Python Wrapper (Jina ai reader, fFrecrawl alternative)
A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
Features
- 🔄 Simple synchronous and asynchronous APIs
- 📁 Automatic Markdown file saving with URL-based filenames
- 📊 Real-time progress tracking in console
- 🪝 Webhook support for real-time notifications
- 📝 Detailed crawl reports in JSON format
- ⚡ Concurrent crawling with rate limiting
- 🔍 Support for sitemap.xml crawling
- 🛡️ Comprehensive error handling
Installation
pip install spiderforce4ai
Quick Start
from spiderforce4ai import SpiderForce4AI, CrawlConfig
# Initialize the client
spider = SpiderForce4AI("http://localhost:3004")
# Use default configuration
config = CrawlConfig()
# Crawl a single URL
result = spider.crawl_url("https://example.com", config)
# Crawl multiple URLs
urls = [
"https://example.com/page1",
"https://example.com/page2"
]
results = spider.crawl_urls(urls, config)
# Crawl from sitemap
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
Configuration
The CrawlConfig class provides various configuration options. All parameters are optional with sensible defaults:
config = CrawlConfig(
# Content Selection (all optional)
target_selector="article", # Specific element to target
remove_selectors=[".ads", "#popup"], # Elements to remove
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
# Processing Settings
max_concurrent_requests=1, # Default: 1
request_delay=0.5, # Delay between requests in seconds
timeout=30, # Request timeout in seconds
# Output Settings
output_dir="spiderforce_reports", # Default output directory
webhook_url="https://your-webhook.com", # Optional webhook endpoint
webhook_timeout=10, # Webhook timeout in seconds
report_file=None # Optional custom report location
)
Default Directory Structure
./
└── spiderforce_reports/
├── example-com-page1.md
├── example-com-page2.md
└── crawl_report.json
Webhook Notifications
If webhook_url is configured, the crawler sends POST requests with the following JSON structure:
{
"url": "https://example.com/page1",
"status": "success",
"markdown": "# Page Title\n\nContent...",
"timestamp": "2025-02-15T10:30:00.123456",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", "#popup"],
"remove_selectors_regex": ["modal-\\d+"]
}
}
Crawl Report
A comprehensive JSON report is automatically generated in the output directory:
{
"timestamp": "2025-02-15T10:30:00.123456",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", "#popup"],
"remove_selectors_regex": ["modal-\\d+"]
},
"results": {
"successful": [
{
"url": "https://example.com/page1",
"status": "success",
"markdown": "# Page Title\n\nContent...",
"timestamp": "2025-02-15T10:30:00.123456"
}
],
"failed": [
{
"url": "https://example.com/page2",
"status": "failed",
"error": "HTTP 404: Not Found",
"timestamp": "2025-02-15T10:30:01.123456"
}
]
},
"summary": {
"total": 2,
"successful": 1,
"failed": 1
}
}
Async Usage
import asyncio
from spiderforce4ai import SpiderForce4AI, CrawlConfig
async def main():
config = CrawlConfig()
spider = SpiderForce4AI("http://localhost:3004")
async with spider:
results = await spider.crawl_urls_async(
["https://example.com/page1", "https://example.com/page2"],
config
)
return results
if __name__ == "__main__":
results = asyncio.run(main())
Error Handling
The crawler is designed to be resilient:
- Continues processing even if some URLs fail
- Records all errors in the crawl report
- Sends error notifications via webhook if configured
- Provides clear error messages in console output
Progress Tracking
The crawler provides real-time progress tracking in the console:
🔄 Crawling URLs... [####################] 100%
✓ Successful: 95
✗ Failed: 5
📊 Report saved to: ./spiderforce_reports/crawl_report.json
Usage with AI Agents
The package is designed to be easily integrated with AI agents and chat systems:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
def fetch_content_for_ai(urls):
spider = SpiderForce4AI("http://localhost:3004")
config = CrawlConfig()
# Crawl content
results = spider.crawl_urls(urls, config)
# Return successful results
return {
result.url: result.markdown
for result in results
if result.status == "success"
}
# Use with AI agent
urls = ["https://example.com/article1", "https://example.com/article2"]
content = fetch_content_for_ai(urls)
Requirements
- Python 3.11 or later
- Docker (for running SpiderForce4AI service)
License
MIT License
Credits
Created by Peter Tam
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiderforce4ai-0.1.2.tar.gz.
File metadata
- Download URL: spiderforce4ai-0.1.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
898b02fab5f3c870702a62deb90afd5cc5f4db0b935e066a15c3b4f897eb5941
|
|
| MD5 |
02ba834e274b05f92b5d0734330e54e4
|
|
| BLAKE2b-256 |
f0e001dc1e009953ab7e4eede36b9ff92aec21ca97422b4e33c4fe6818d12e85
|
File details
Details for the file spiderforce4ai-0.1.2-py3-none-any.whl.
File metadata
- Download URL: spiderforce4ai-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
607870cb147539e71b258d046ec81abb6a02fdfdcdfb01413202affa95b7197d
|
|
| MD5 |
f339f1c87ad2fc0ec5275e62d943b3c3
|
|
| BLAKE2b-256 |
66808d960d3f74494379497f06ca7a2ec7ac6a13b2e019fbeea4108482caf921
|