Skip to main content

A web crawler implemented in Go with Python bindings

Project description

Pathik

Pathik Logo

A high-performance web crawler implemented in Go with Python and JavaScript bindings. It converts web pages to both HTML and Markdown formats.

Features

  • Fast crawling with Go's concurrency model
  • Clean content extraction
  • Markdown conversion
  • Parallel URL processing
  • Cloudflare R2 integration
  • Kafka streaming support with configurable buffer sizes
  • Enhanced security with URL and IP validation
  • Memory-efficient (uses ~10x less memory than browser automation tools)
  • Automatic binary version management
  • Customizable compression options (Gzip and Snappy support)
  • Session-based message tracking for multi-user environments

Performance Benchmarks

Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

Memory Usage Comparison

Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

Python Performance

Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling

JavaScript Performance

Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.

Installation

pip install pathik

The package will automatically download the correct binary for your platform from GitHub releases on first use.

Binary Version Management

Pathik now automatically handles binary version checking and updates:

  • When you install or upgrade the Python package, it will check if the binary matches the package version

  • If the versions don't match, it will automatically download the correct binary

  • You can manually check and update the binary with:

    # Force binary update
    import pathik
    from pathik.crawler import get_binary_path
    binary_path = get_binary_path(force_download=True)
    
  • Command line options:

    # Check if binary is up to date
    pathik --check-binary
    
    # Force update of the binary
    pathik --force-update-binary
    

This ensures you always have the correct binary version with all the latest features, especially when using new functionality like Kafka streaming with session IDs.

Usage

Python API

Basic Crawling

import pathik

# Crawl a single URL
result = pathik.crawl("https://example.com")
print(f"HTML saved to: {result['https://example.com']['html']}")
print(f"Markdown saved to: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel
results = pathik.crawl([
    "https://example.com",
    "https://httpbin.org/html",
    "https://jsonplaceholder.typicode.com"
])

# To disable parallel crawling
results = pathik.crawl(urls, parallel=False)

# To specify output directory
results = pathik.crawl(urls, output_dir="./output")

R2 Upload

import pathik
import uuid

# Generate a UUID or use your own
my_uuid = str(uuid.uuid4())

# Crawl and upload to R2
results = pathik.crawl_to_r2("https://example.com", uuid_str=my_uuid)
print(f"UUID: {results['https://example.com']['uuid']}")
print(f"R2 HTML key: {results['https://example.com']['r2_html_key']}")
print(f"R2 Markdown key: {results['https://example.com']['r2_markdown_key']}")

# Upload multiple URLs
results = pathik.crawl_to_r2([
    "https://example.com",
    "https://httpbin.org/html"
], uuid_str=my_uuid)

Secure Kafka Streaming with Buffer Customization

import pathik
import uuid

# Generate a session ID for tracking
session_id = str(uuid.uuid4())

# Stream a single URL to Kafka
result = pathik.stream_to_kafka("https://example.com", session=session_id)
print(f"Success: {result['https://example.com']['success']}")

# Stream multiple URLs with custom options
results = pathik.stream_to_kafka(
    urls=["https://example.com", "https://httpbin.org/html"],
    content_type="html",              # Options: "html", "markdown", or "both"
    topic="custom_topic",             # Optional custom topic
    session=session_id,               # Optional session ID
    parallel=True,                    # Process URLs in parallel (default)
    max_message_size=15728640,        # 15MB message size limit
    buffer_memory=157286400           # 150MB buffer memory
)

# Check results
for url, status in results.items():
    if status["success"]:
        print(f"Successfully streamed {url}")
    else:
        print(f"Failed to stream {url}: {status.get('error')}")

Command Line

# Crawl a single URL
pathik crawl https://example.com

# Crawl multiple URLs
pathik crawl https://example.com https://httpbin.org/html

# Specify output directory
pathik crawl -o ./output https://example.com

# Use sequential (non-parallel) mode
pathik crawl -s https://example.com https://httpbin.org/html

# Upload to R2 (Cloudflare)
pathik r2 https://example.com

# Stream crawled content to Kafka
pathik kafka https://example.com

# Stream only HTML content to Kafka
pathik kafka -c html https://example.com

# Stream only Markdown content to Kafka
pathik kafka -c markdown https://example.com

# Stream to a specific Kafka topic
pathik kafka -t user1_crawl_data https://example.com

# Add a session ID for multi-user environments
pathik kafka --session user123 https://example.com

# Set custom buffer sizes
pathik kafka --max-message-size 15728640 --buffer-memory 157286400 https://example.com

# Combine options
pathik kafka -c html -t user1_data --session user123 --max-message-size 15728640 https://example.com

Kafka Streaming

Pathik supports streaming crawled content directly to Kafka. This is useful for real-time processing pipelines.

Kafka Configuration

Configure Kafka connection details using environment variables or .env file:

KAFKA_BROKERS=localhost:9092        # Comma-separated list of brokers
KAFKA_TOPIC=pathik_crawl_data       # Topic to publish to
KAFKA_USERNAME=                     # Optional username for SASL authentication
KAFKA_PASSWORD=                     # Optional password for SASL authentication
KAFKA_CLIENT_ID=pathik-crawler      # Client ID for Kafka
KAFKA_USE_TLS=false                 # Whether to use TLS
KAFKA_MAX_MESSAGE_SIZE=10485760     # 10MB max message size (default)
KAFKA_BUFFER_MEMORY=104857600       # 100MB buffer memory (default)
KAFKA_MAX_REQUEST_SIZE=20971520     # 20MB max request size (default)

Buffer Size Customization

You can customize Kafka producer and consumer buffer sizes for handling large content:

# Custom buffer sizes for producer
pathik.stream_to_kafka(
    "https://example.com",
    max_message_size=15728640,    # 15MB max message size (default: 10MB)
    buffer_memory=157286400       # 150MB buffer memory (default: 100MB)
)

# For consuming large messages
python kafka_consumer_direct.py --session=YOUR_SESSION_ID --max-bytes=20971520 --max-partition-bytes=10485760

For debugging or handling large web pages, you may need to increase these buffer sizes to prevent message size errors.

Optional Kafka Dependencies

To use Kafka streaming, install the required dependencies:

# Install pathik with Kafka support
pip install "pathik[kafka]"

# Or install the dependency separately
pip install kafka-python

# For Snappy compression support (recommended)
pip install python-snappy

If Kafka dependencies are not available, Pathik will use a fallback simulation mode that logs the messages locally without actually sending them to Kafka.

Kafka Message Format

When streaming to Kafka, Pathik sends two messages per URL:

  1. HTML Content:

    • Key: URL
    • Value: Raw HTML content
    • Headers:
      • url: The original URL
      • contentType: "text/html"
      • timestamp: ISO 8601 timestamp
      • sessionID: Session ID (if provided)
  2. Markdown Content:

    • Key: URL
    • Value: Markdown content
    • Headers:
      • url: The original URL
      • contentType: "text/markdown"
      • timestamp: ISO 8601 timestamp
      • sessionID: Session ID (if provided)

Security Enhancements

Pathik includes several security features to ensure safe and reliable crawling:

URL Validation

URLs are validated to prevent security issues:

  • Only HTTP and HTTPS schemes are allowed
  • Private IP addresses and localhost access are restricted
  • Hostnames are resolved and checked against private IP ranges

Input Sanitization

All inputs (file paths, session IDs, etc.) are sanitized to prevent injection attacks:

  • File paths are checked for directory traversal attempts
  • Session IDs are validated against a safe pattern (alphanumeric and some special chars)
  • Topic names and other parameters are validated for safe characters

Rate Limiting

Built-in rate limiting prevents accidentally overloading target servers:

  • Default rate limit of 1 request per second with configurable burst
  • Automatic retries with exponential backoff
  • Delay between retries is configurable

Error Handling

Robust error handling ensures graceful failure:

  • Detailed error messages for troubleshooting
  • Automatic retries for transient failures
  • Graceful shutdown with proper cleanup

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.3.0.tar.gz (62.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathik-0.3.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file pathik-0.3.0.tar.gz.

File metadata

  • Download URL: pathik-0.3.0.tar.gz
  • Upload date:
  • Size: 62.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2ac823c8ee5b92079e37cd9b3fba92a0c4e494c05f3168bdc9facfcd4150555b
MD5 995f31d65e197a67066128d4a195f5a2
BLAKE2b-256 9f9c4669a9d58fc4b9929e8ba674b811d43b89250fc88009786aad34609e2579

See more details on using hashes here.

File details

Details for the file pathik-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: pathik-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 092f39d324301dc0c6fb3dd8d34616857610de72cb5a835e909d047521d7400a
MD5 a1d8c85391cf73d94e24ab7ec1b04022
BLAKE2b-256 6e809bb79ef9ab01e15e5051466b66a76bd523ddc1c083e03679d9197c0aa17f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page