Skip to main content

A web crawler implemented in Go with Python bindings

Project description

Pathik

Pathik Logo

A high-performance web crawler implemented in Go with Python and JavaScript bindings. It converts web pages to both HTML and Markdown formats.

Features

  • Fast crawling with Go's concurrency model
  • Clean content extraction
  • Markdown conversion
  • Parallel URL processing
  • Cloudflare R2 integration
  • Memory-efficient (uses ~10x less memory than browser automation tools)

Performance Benchmarks

Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

Memory Usage Comparison

Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

Python Performance

Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling

JavaScript Performance

Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.

Installation

pip install pathik

The package will automatically download the correct binary for your platform from GitHub releases on first use.

Usage

Python API

import pathik

# Crawl a single URL
result = pathik.crawl("https://example.com")
print(f"HTML saved to: {result['https://example.com']['html']}")
print(f"Markdown saved to: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel
results = pathik.crawl([
    "https://example.com",
    "https://httpbin.org/html",
    "https://jsonplaceholder.typicode.com"
])

# To disable parallel crawling
results = pathik.crawl(urls, parallel=False)

# To specify output directory
results = pathik.crawl(urls, output_dir="./output")

Command Line

# Crawl a single URL
pathik crawl https://example.com

# Crawl multiple URLs
pathik crawl https://example.com https://httpbin.org/html

# Specify output directory
pathik crawl -o ./output https://example.com

# Use sequential (non-parallel) mode
pathik crawl -s https://example.com https://httpbin.org/html

# Upload to R2 (Cloudflare)
pathik r2 https://example.com

Using in Docker

When using Pathik in a Docker container, you need to install the required dependencies for Chromium:

FROM python:3.10-slim

# Install Chromium dependencies
RUN apt-get update && apt-get install -y \
    libglib2.0-0 \
    libgtk-3-0 \
    libx11-6 \
    libxcomposite1 \
    libxcursor1 \
    libxdamage1 \
    libxi6 \
    libxtst6 \
    libnss3 \
    libcups2 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libgdk-pixbuf2.0-0 \
    libpango-1.0-0 \
    libcairo2 \
    libdrm2 \
    libgbm1 \
    libasound2 \
    fonts-freefont-ttf

# Install pathik
RUN pip install pathik

Development

Setup

# Clone the repository
git clone https://github.com/justrach/pathik.git
cd pathik

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Building Binaries Locally

# Build for current platform
python build_binary.py

# Build for all platforms
python build_binary.py --all

# Build for specific platform
python build_binary.py --os linux --arch amd64

Release Process

Pathik uses GitHub Actions to automate the release process:

  1. Create and push a new tag:

    git tag -a v0.2.2 -m "Release v0.2.2"
    git push origin v0.2.2
    
  2. GitHub Actions will:

    • Build binaries for all supported platforms
    • Create a GitHub Release with the binaries
    • Build and publish the Python package to PyPI

The PyPI package will download the appropriate binary from GitHub releases when needed.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.2.12.tar.gz (66.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathik-0.2.12-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file pathik-0.2.12.tar.gz.

File metadata

  • Download URL: pathik-0.2.12.tar.gz
  • Upload date:
  • Size: 66.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.2.12.tar.gz
Algorithm Hash digest
SHA256 c3ee7432b1475f3fed6f3af927e91934da5c919d142f7aa7903019ec792a2202
MD5 13342f4fe5e59ec04ae09ad6bd1c7a74
BLAKE2b-256 dc18b3eec251b5bf028d8968c6a98fe5c0b67c6092871cf4d615184d52eee985

See more details on using hashes here.

File details

Details for the file pathik-0.2.12-py3-none-any.whl.

File metadata

  • Download URL: pathik-0.2.12-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for pathik-0.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 a06737ed6ebc325c83e6ccf1ca78fc70367bf344de6f4c3bbe3ccc3e20b049d0
MD5 e479e7a0ca09e9cae2941d0c26a38ed2
BLAKE2b-256 27f6c6b9c8ffa2b2b9b7fa496e7731b3fcf59630aa36aa93cd0127e62ee1be8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page