Skip to main content

Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing

Project description

SpiderForce4AI Python Wrapper

A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.

Features

  • HTML to Markdown conversion
  • Parallel and async crawling support
  • Sitemap processing
  • Custom content selection
  • Automatic retry mechanism
  • Detailed progress tracking
  • Webhook notifications
  • Customizable reporting

Installation

pip install spiderforce4ai

Quick Start

from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure crawling options
config = CrawlConfig(
    target_selector="article",
    remove_selectors=[".ads", ".navigation"],
    max_concurrent_requests=5,
    save_reports=True
)

# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

Key Features

1. Smart Retry Mechanism

  • Automatically retries failed URLs
  • Monitors failure ratio to prevent server overload
  • Detailed retry statistics and progress tracking
  • Aborts retries if failure rate exceeds 20%
# Retry behavior is automatic
config = CrawlConfig(
    max_concurrent_requests=5,
    request_delay=1.0  # Delay between retries
)
results = spider.crawl_urls_async(urls, config)

2. Custom Webhook Integration

  • Flexible payload formatting
  • Custom headers support
  • Variable substitution in templates
config = CrawlConfig(
    webhook_url="https://your-webhook.com",
    webhook_headers={
        "Authorization": "Bearer token",
        "X-Custom-Header": "value"
    },
    webhook_payload_template='''{
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "custom_field": "value"
    }'''
)

3. Flexible Report Generation

  • Optional report saving
  • Customizable report location
  • Detailed success/failure statistics
config = CrawlConfig(
    save_reports=True,
    report_file=Path("custom_report.json"),
    output_dir=Path("content")
)

Crawling Methods

1. Single URL Processing

# Synchronous
result = spider.crawl_url("https://example.com", config)

# Asynchronous
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)

2. Multiple URLs

urls = ["https://example.com/page1", "https://example.com/page2"]

# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)

# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)

# Asynchronous
async def crawl():
    results = await spider.crawl_urls_async(urls, config)

3. Sitemap Processing

# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)

# Asynchronous
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)

Configuration Options

config = CrawlConfig(
    # Content Selection
    target_selector="article",              # Target element to extract
    remove_selectors=[".ads", "#popup"],    # Elements to remove
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing
    max_concurrent_requests=5,              # Parallel processing limit
    request_delay=0.5,                      # Delay between requests
    timeout=30,                             # Request timeout
    
    # Output
    output_dir=Path("content"),             # Output directory
    save_reports=False,                     # Enable/disable report saving
    report_file=Path("report.json"),        # Report location
    
    # Webhook
    webhook_url="https://webhook.com",      # Webhook endpoint
    webhook_timeout=10,                     # Webhook timeout
    webhook_headers={                       # Custom headers
        "Authorization": "Bearer token"
    },
    webhook_payload_template='''            # Custom payload format
    {
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "error": "{error}",
        "time": "{timestamp}"
    }'''
)

Progress Tracking

The package provides detailed progress information:

Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs

Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries

Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
  ✓ Successful: 150
  ✗ Failed: 6
Retry success rate: 12/18 (66.7%)

Output Structure

1. Directory Layout

content/                    # Output directory
├── example-com-page1.md   # Markdown files
├── example-com-page2.md
└── report.json            # Crawl report

2. Report Format

{
  "timestamp": "2025-02-15T10:30:00",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads"]
  },
  "results": {
    "successful": [...],
    "failed": [...]
  },
  "summary": {
    "total": 156,
    "successful": 150,
    "failed": 6
  }
}

Performance Optimization

  1. Server-side Parallel Processing

    • Recommended for most cases
    • Single HTTP request
    • Reduced network overhead
    • Built-in load balancing
  2. Client-side Parallel Processing

    • Better control over processing
    • Customizable concurrency
    • Progress tracking per URL
    • Automatic retry handling
  3. Asynchronous Processing

    • Ideal for async applications
    • Non-blocking operation
    • Real-time progress updates
    • Efficient resource usage

Error Handling

The package provides comprehensive error handling:

  • Automatic retry for failed URLs
  • Failure ratio monitoring
  • Detailed error reporting
  • Webhook error notifications
  • Progress tracking during retries

Requirements

  • Python 3.11+
  • Running SpiderForce4AI service
  • Internet connection

Dependencies

  • aiohttp
  • asyncio
  • rich
  • aiofiles
  • httpx

License

MIT License

Credits

Created by Peter Tam

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderforce4ai-2.4.1.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spiderforce4ai-2.4.1-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file spiderforce4ai-2.4.1.tar.gz.

File metadata

  • Download URL: spiderforce4ai-2.4.1.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for spiderforce4ai-2.4.1.tar.gz
Algorithm Hash digest
SHA256 8eab2ad84cb7a40908e8e0d529223c8a4caaf99863c25fb09eab82033036bb7d
MD5 070a018a78c0095deb6bc0636117c368
BLAKE2b-256 6c10299867c5c2d093e6832562c79e60c14c8623e378727ff0987f850840afc9

See more details on using hashes here.

File details

Details for the file spiderforce4ai-2.4.1-py3-none-any.whl.

File metadata

  • Download URL: spiderforce4ai-2.4.1-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for spiderforce4ai-2.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f6002c1a90709ef56208d0f9a448e3e0244b03534576abbbcca4457bb535b9b3
MD5 c261ab04c09b7fed23e7c0846b6d54ff
BLAKE2b-256 c29e9357b406c88c241ffea60c0a6917d5a930c8dae6bea3c7b2ec4f8a69cd56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page