Skip to main content

A web crawler implemented in Go with Python bindings

Project description

Pathik

Pathik Logo

A high-performance web crawler implemented in Go with Python and JavaScript bindings. It converts web pages to both HTML and Markdown formats.

Features

  • Fast crawling with Go's concurrency model
  • Clean content extraction
  • Markdown conversion
  • Parallel URL processing
  • Cloudflare R2 integration
  • Memory-efficient (uses ~10x less memory than browser automation tools)

Performance Benchmarks

Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

Memory Usage Comparison

Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

Python Performance

Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling

JavaScript Performance

Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.

Installation

pip install pathik

The package will automatically download the correct binary for your platform from GitHub releases on first use.

Usage

Python API

import pathik

# Crawl a single URL
result = pathik.crawl("https://example.com")
print(f"HTML saved to: {result['https://example.com']['html']}")
print(f"Markdown saved to: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel
results = pathik.crawl([
    "https://example.com",
    "https://httpbin.org/html",
    "https://jsonplaceholder.typicode.com"
])

# To disable parallel crawling
results = pathik.crawl(urls, parallel=False)

# To specify output directory
results = pathik.crawl(urls, output_dir="./output")

Command Line

# Crawl a single URL
pathik crawl https://example.com

# Crawl multiple URLs
pathik crawl https://example.com https://httpbin.org/html

# Specify output directory
pathik crawl -o ./output https://example.com

# Use sequential (non-parallel) mode
pathik crawl -s https://example.com https://httpbin.org/html

# Upload to R2 (Cloudflare)
pathik r2 https://example.com

# Stream crawled content to Kafka
pathik kafka https://example.com

# Stream only HTML content to Kafka
pathik kafka -content html https://example.com

# Stream only Markdown content to Kafka
pathik kafka -content markdown https://example.com

# Stream to a specific Kafka topic
pathik kafka -topic user1_crawl_data https://example.com

# Add a session ID for multi-user environments
pathik kafka --session user123 https://example.com

# Combine options
pathik kafka -content html -topic user1_data --session user123 https://example.com

Kafka Streaming

Pathik supports streaming crawled content directly to Kafka. This is useful for real-time processing pipelines.

Kafka Configuration

Configure Kafka connection details in the .env file:

KAFKA_BROKERS=localhost:9092        # Comma-separated list of brokers
KAFKA_TOPIC=pathik_crawl_data       # Topic to publish to
KAFKA_USERNAME=                     # Optional username for SASL authentication
KAFKA_PASSWORD=                     # Optional password for SASL authentication
KAFKA_CLIENT_ID=pathik-crawler      # Client ID for Kafka
KAFKA_USE_TLS=false                 # Whether to use TLS

Kafka Message Format

When streaming to Kafka, Pathik sends two messages per URL:

  1. HTML Content:

    • Key: URL
    • Value: Raw HTML content
    • Headers:
      • url: The original URL
      • contentType: "text/html"
      • timestamp: ISO 8601 timestamp
  2. Markdown Content:

    • Key: URL
    • Value: Markdown content
    • Headers:
      • url: The original URL
      • contentType: "text/markdown"
      • timestamp: ISO 8601 timestamp

Usage with Kafka Consumers

Here's a minimal example of consuming Pathik messages with a Kafka consumer:

package main

import (
    "context"
    "fmt"
    "github.com/segmentio/kafka-go"
)

func main() {
    reader := kafka.NewReader(kafka.ReaderConfig{
        Brokers:   []string{"localhost:9092"},
        Topic:     "pathik_crawl_data",
        Partition: 0,
        MinBytes:  10e3, // 10KB
        MaxBytes:  10e6, // 10MB
    })
    defer reader.Close()

    for {
        m, err := reader.ReadMessage(context.Background())
        if err != nil {
            break
        }
        
        url := string(m.Key)
        contentType := ""
        
        // Extract content type from headers
        for _, header := range m.Headers {
            if header.Key == "contentType" {
                contentType = string(header.Value)
                break
            }
        }
        
        fmt.Printf("Received from %s: %s content (%d bytes)\n", 
            url, contentType, len(m.Value))
    }
}

Using in Docker

When using Pathik in a Docker container, you need to install the required dependencies for Chromium:

FROM python:3.10-slim

# Install Chromium dependencies
RUN apt-get update && apt-get install -y \
    libglib2.0-0 \
    libgtk-3-0 \
    libx11-6 \
    libxcomposite1 \
    libxcursor1 \
    libxdamage1 \
    libxi6 \
    libxtst6 \
    libnss3 \
    libcups2 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libgdk-pixbuf2.0-0 \
    libpango-1.0-0 \
    libcairo2 \
    libdrm2 \
    libgbm1 \
    libasound2 \
    fonts-freefont-ttf

# Install pathik
RUN pip install pathik

Development

Setup

# Clone the repository
git clone https://github.com/justrach/pathik.git
cd pathik

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.2.17.tar.gz (72.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathik-0.2.17-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file pathik-0.2.17.tar.gz.

File metadata

  • Download URL: pathik-0.2.17.tar.gz
  • Upload date:
  • Size: 72.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.2.17.tar.gz
Algorithm Hash digest
SHA256 5c52e1fb819fb31d1bd196e00a25ec8250ed81675d30053d5c420d9bdb4a46b1
MD5 48d935d075ef73c7b5cc4fab81cc9a39
BLAKE2b-256 eb9c99c2d93aced5749d2a92593e7e19d52f9b3ff00dc8988589bd57d52b072c

See more details on using hashes here.

File details

Details for the file pathik-0.2.17-py3-none-any.whl.

File metadata

  • Download URL: pathik-0.2.17-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.2.17-py3-none-any.whl
Algorithm Hash digest
SHA256 1c40768e62ad6212c29864b9b24074df39a61c65621352cd07289a8b4a3cb527
MD5 03e5237f6a55e25bac0e89acac856f6d
BLAKE2b-256 5c0e9c624169a670afb39f400846de8e48133342064ddbf547f8226bb6e3a111

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page