Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing

These details have not been verified by PyPI

Project links

Project description

SpiderForce4AI Python Wrapper

A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.

Features

HTML to Markdown conversion
Parallel and async crawling support
Sitemap processing
Custom content selection
Automatic retry mechanism
Detailed progress tracking
Webhook notifications
Customizable reporting

Installation

pip install spiderforce4ai

Quick Start

from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure crawling options
config = CrawlConfig(
    target_selector="article",
    remove_selectors=[".ads", ".navigation"],
    max_concurrent_requests=5,
    save_reports=True
)

# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

Key Features

1. Smart Retry Mechanism

Automatically retries failed URLs
Monitors failure ratio to prevent server overload
Detailed retry statistics and progress tracking
Aborts retries if failure rate exceeds 20%

# Retry behavior is automatic
config = CrawlConfig(
    max_concurrent_requests=5,
    request_delay=1.0  # Delay between retries
)
results = spider.crawl_urls_async(urls, config)

2. Custom Webhook Integration

Flexible payload formatting
Custom headers support
Variable substitution in templates

config = CrawlConfig(
    webhook_url="https://your-webhook.com",
    webhook_headers={
        "Authorization": "Bearer token",
        "X-Custom-Header": "value"
    },
    webhook_payload_template='''{
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "custom_field": "value"
    }'''
)

3. Flexible Report Generation

Optional report saving
Customizable report location
Detailed success/failure statistics

config = CrawlConfig(
    save_reports=True,
    report_file=Path("custom_report.json"),
    output_dir=Path("content")
)

Crawling Methods

1. Single URL Processing

# Synchronous
result = spider.crawl_url("https://example.com", config)

# Asynchronous
async def crawl():
    result = await spider.crawl_url_async("https://example.com", config)

2. Multiple URLs

urls = ["https://example.com/page1", "https://example.com/page2"]

# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)

# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)

# Asynchronous
async def crawl():
    results = await spider.crawl_urls_async(urls, config)

3. Sitemap Processing

# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)

# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)

# Asynchronous
async def crawl():
    results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)

Configuration Options

config = CrawlConfig(
    # Content Selection
    target_selector="article",              # Target element to extract
    remove_selectors=[".ads", "#popup"],    # Elements to remove
    remove_selectors_regex=["modal-\\d+"],  # Regex patterns for removal
    
    # Processing
    max_concurrent_requests=5,              # Parallel processing limit
    request_delay=0.5,                      # Delay between requests
    timeout=30,                             # Request timeout
    
    # Output
    output_dir=Path("content"),             # Output directory
    save_reports=False,                     # Enable/disable report saving
    report_file=Path("report.json"),        # Report location
    
    # Webhook
    webhook_url="https://webhook.com",      # Webhook endpoint
    webhook_timeout=10,                     # Webhook timeout
    webhook_headers={                       # Custom headers
        "Authorization": "Bearer token"
    },
    webhook_payload_template='''            # Custom payload format
    {
        "url": "{url}",
        "content": "{markdown}",
        "status": "{status}",
        "error": "{error}",
        "time": "{timestamp}"
    }'''
)

Progress Tracking

The package provides detailed progress information:

Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs

Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries

Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
  ✓ Successful: 150
  ✗ Failed: 6
Retry success rate: 12/18 (66.7%)

Output Structure

1. Directory Layout

content/                    # Output directory
├── example-com-page1.md   # Markdown files
├── example-com-page2.md
└── report.json            # Crawl report

2. Report Format

{
  "timestamp": "2025-02-15T10:30:00",
  "config": {
    "target_selector": "article",
    "remove_selectors": [".ads"]
  },
  "results": {
    "successful": [...],
    "failed": [...]
  },
  "summary": {
    "total": 156,
    "successful": 150,
    "failed": 6
  }
}

Performance Optimization

Server-side Parallel Processing
- Recommended for most cases
- Single HTTP request
- Reduced network overhead
- Built-in load balancing
Client-side Parallel Processing
- Better control over processing
- Customizable concurrency
- Progress tracking per URL
- Automatic retry handling
Asynchronous Processing
- Ideal for async applications
- Non-blocking operation
- Real-time progress updates
- Efficient resource usage

Error Handling

The package provides comprehensive error handling:

Automatic retry for failed URLs
Failure ratio monitoring
Detailed error reporting
Webhook error notifications
Progress tracking during retries

Requirements

Python 3.11+
Running SpiderForce4AI service
Internet connection

Dependencies

aiohttp
asyncio
rich
aiofiles
httpx

License

MIT License

Credits

Created by Peter Tam

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.7

Feb 27, 2025

2.6.9

Feb 27, 2025

2.6.8

Feb 27, 2025

2.6.7

Feb 16, 2025

2.6.6

Feb 16, 2025

2.6.5

Feb 16, 2025

2.6.4

Feb 15, 2025

2.6.3

Feb 15, 2025

2.6

Feb 15, 2025

2.5.9

Feb 15, 2025

2.5.8

Feb 15, 2025

2.5.7

Feb 15, 2025

2.5.6

Feb 15, 2025

2.5.5

Feb 15, 2025

2.5.4

Feb 15, 2025

2.5.3

Feb 15, 2025

2.5.2

Feb 15, 2025

2.5.1

Feb 15, 2025

2.5

Feb 15, 2025

2.4.9

Feb 15, 2025

2.4.8

Feb 15, 2025

2.4.7

Feb 15, 2025

2.4.6

Feb 15, 2025

2.4.5

Feb 15, 2025

2.4.3

Feb 15, 2025

2.4.2

Feb 15, 2025

This version

2.4.1

Feb 15, 2025

2.4

Feb 15, 2025

2.3.1

Feb 15, 2025

2.1

Feb 15, 2025

2.0

Feb 15, 2025

1.9

Feb 15, 2025

1.8

Feb 15, 2025

1.7

Feb 15, 2025

1.6

Feb 15, 2025

1.5

Feb 15, 2025

1.4

Feb 15, 2025

1.3

Feb 15, 2025

1.2

Feb 15, 2025

1.1

Feb 15, 2025

1.0

Feb 15, 2025

0.1.9

Feb 15, 2025

0.1.8

Feb 15, 2025

0.1.7

Feb 15, 2025

0.1.6

Feb 15, 2025

0.1.5

Feb 15, 2025

0.1.4

Feb 15, 2025

0.1.3

Feb 15, 2025

0.1.2

Feb 15, 2025

0.1.0

Feb 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spiderforce4ai-2.4.1.tar.gz (18.3 kB view details)

Uploaded Feb 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spiderforce4ai-2.4.1-py3-none-any.whl (15.1 kB view details)

Uploaded Feb 15, 2025 Python 3

File details

Details for the file spiderforce4ai-2.4.1.tar.gz.

File metadata

Download URL: spiderforce4ai-2.4.1.tar.gz
Upload date: Feb 15, 2025
Size: 18.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for spiderforce4ai-2.4.1.tar.gz
Algorithm	Hash digest
SHA256	`8eab2ad84cb7a40908e8e0d529223c8a4caaf99863c25fb09eab82033036bb7d`
MD5	`070a018a78c0095deb6bc0636117c368`
BLAKE2b-256	`6c10299867c5c2d093e6832562c79e60c14c8623e378727ff0987f850840afc9`

See more details on using hashes here.

File details

Details for the file spiderforce4ai-2.4.1-py3-none-any.whl.

File metadata

Download URL: spiderforce4ai-2.4.1-py3-none-any.whl
Upload date: Feb 15, 2025
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for spiderforce4ai-2.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6002c1a90709ef56208d0f9a448e3e0244b03534576abbbcca4457bb535b9b3`
MD5	`c261ab04c09b7fed23e7c0846b6d54ff`
BLAKE2b-256	`c29e9357b406c88c241ffea60c0a6917d5a930c8dae6bea3c7b2ec4f8a69cd56`

See more details on using hashes here.

spiderforce4ai 2.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SpiderForce4AI Python Wrapper

Features

Installation

Quick Start

Key Features

1. Smart Retry Mechanism

2. Custom Webhook Integration

3. Flexible Report Generation

Crawling Methods

1. Single URL Processing

2. Multiple URLs

3. Sitemap Processing

Configuration Options

Progress Tracking

Output Structure

1. Directory Layout

2. Report Format

Performance Optimization

Error Handling

Requirements

Dependencies

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes