Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service
Project description
SpiderForce4AI Python Wrapper
A Python wrapper for SpiderForce4AI - a powerful HTML-to-Markdown conversion service. This package provides an easy-to-use interface for crawling websites and converting their content to clean Markdown format.
Installation
pip install spiderforce4ai
Quick Start (Minimal Setup)
from spiderforce4ai import SpiderForce4AI, CrawlConfig
# Initialize with your SpiderForce4AI service URL
spider = SpiderForce4AI("http://localhost:3004")
# Use default configuration (will save in ./spiderforce_reports)
config = CrawlConfig()
# Crawl a single URL
result = spider.crawl_url("https://example.com", config)
Crawling Methods
1. Single URL Crawling
# Synchronous
result = spider.crawl_url("https://example.com", config)
# Asynchronous
async def crawl():
result = await spider.crawl_url_async("https://example.com", config)
2. Multiple URLs Crawling
# List of URLs
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
# Synchronous
results = spider.crawl_urls(urls, config)
# Asynchronous
async def crawl():
results = await spider.crawl_urls_async(urls, config)
# Parallel (using multiprocessing)
results = spider.crawl_urls_parallel(urls, config)
3. Sitemap Crawling
# Synchronous
results = spider.crawl_sitemap("https://example.com/sitemap.xml", config)
# Asynchronous
async def crawl():
results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
# Parallel (using multiprocessing)
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
Configuration Options
All configuration options are optional with sensible defaults:
config = CrawlConfig(
# Content Selection (all optional)
target_selector="article", # Specific element to target
remove_selectors=[ # Elements to remove
".ads",
"#popup",
".navigation",
".footer"
],
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
# Processing Settings
max_concurrent_requests=1, # Default: 1 (parallel processing)
request_delay=0.5, # Delay between requests in seconds
timeout=30, # Request timeout in seconds
# Output Settings
output_dir="custom_output", # Default: "spiderforce_reports"
report_file="custom_report.json", # Default: "crawl_report.json"
webhook_url="https://your-webhook.com", # Optional webhook endpoint
webhook_timeout=10 # Webhook timeout in seconds
)
Real-World Examples
1. Basic Website Crawling
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
spider = SpiderForce4AI("http://localhost:3004")
config = CrawlConfig(
output_dir=Path("blog_content")
)
result = spider.crawl_url("https://example.com/blog", config)
print(f"Content saved to: {result.url}.md")
2. Advanced Parallel Sitemap Crawling
config = CrawlConfig(
max_concurrent_requests=5,
output_dir=Path("website_content"),
remove_selectors=[
".navigation",
".footer",
".ads",
"#cookie-notice"
],
webhook_url="https://your-webhook.com/endpoint"
)
results = spider.crawl_sitemap_parallel(
"https://example.com/sitemap.xml",
config
)
3. Async Crawling with Progress
import asyncio
async def main():
config = CrawlConfig(
max_concurrent_requests=3,
request_delay=1.0
)
async with spider:
results = await spider.crawl_urls_async([
"https://example.com/1",
"https://example.com/2",
"https://example.com/3"
], config)
return results
results = asyncio.run(main())
Output Structure
1. File Organization
output_dir/
├── example-com-page1.md
├── example-com-page2.md
└── crawl_report.json
2. Markdown Files
Each markdown file is named using a slugified version of the URL and contains the converted content.
3. Report JSON Structure
{
"timestamp": "2025-02-15T10:30:00.123456",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", "#popup"],
"remove_selectors_regex": ["modal-\\d+"]
},
"results": {
"successful": [
{
"url": "https://example.com/page1",
"status": "success",
"markdown": "# Page Title\n\nContent...",
"timestamp": "2025-02-15T10:30:00.123456"
}
],
"failed": [
{
"url": "https://example.com/page2",
"status": "failed",
"error": "HTTP 404: Not Found",
"timestamp": "2025-02-15T10:30:01.123456"
}
]
},
"summary": {
"total": 2,
"successful": 1,
"failed": 1
}
}
4. Webhook Notifications
If configured, webhooks receive real-time updates in JSON format:
{
"url": "https://example.com/page1",
"status": "success",
"markdown": "# Page Title\n\nContent...",
"timestamp": "2025-02-15T10:30:00.123456",
"config": {
"target_selector": "article",
"remove_selectors": [".ads", "#popup"]
}
}
Error Handling
The package handles various types of errors:
- Network errors
- Timeout errors
- Invalid URLs
- Missing content
- Service errors
All errors are:
- Logged in the console
- Included in the JSON report
- Sent via webhook (if configured)
- Available in the results list
Requirements
- Python 3.11 or later
- Running SpiderForce4AI service
- Internet connection
License
MIT License
Credits
Created by Peter Tam
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spiderforce4ai-0.1.6.tar.gz.
File metadata
- Download URL: spiderforce4ai-0.1.6.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b1a073477acbea8b15c347eedb0df5f502b327959b2f4b45a73f229ca2e0c2c
|
|
| MD5 |
e444d5b5bfc3e554b123bb9de0a808d3
|
|
| BLAKE2b-256 |
303ae0069c45de3a6e373da842f9d6e433fae5fe5109d29aa9726ad9cff7e164
|
File details
Details for the file spiderforce4ai-0.1.6-py3-none-any.whl.
File metadata
- Download URL: spiderforce4ai-0.1.6-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0900219745e8d18262808e904f26da34f0abdb6ab3b18a1db6f97cd1c8222e3
|
|
| MD5 |
6001d11e977b8995bc0475a1f48da2af
|
|
| BLAKE2b-256 |
080b872dc6e0e7b2bcc7fb039e107870fe8ded92d9ffdb8b49ce4f156aae91b9
|