Skip to main content

A web crawler implemented in Go with Python bindings

Project description

Pathik

A powerful web crawling tool with Go implementation and Python bindings. Supports local storage and optional Cloudflare R2 storage.

INSTALLATION

Prerequisites

  • Go 1.16+
  • Python 3.6+

Install Python Package

pip install pathik

Clone Repository

git clone https://github.com/yourusername/pathik.git
cd pathik

Install in Development Mode

pip install -e .

BUILDING GO BINARY

Navigate to Pathik Directory

cd pathik

Build Binary Using Script

python build_binary.py

Expected Output:

Building Go binary in /path/to/pathik
Build successful!
Binary located at: /path/to/pathik/pathik_bin
Testing binary...
Binary output: [Help text from binary]

USAGE

Python Usage

Basic Crawling

import pathik
import os

output_dir = os.path.abspath("output_data")
os.makedirs(output_dir, exist_ok=True)

urls = ["https://example.com"]
results = pathik.crawl(urls, output_dir)

for url, files in results.items():
    print(f"URL: {url}")
    print(f"HTML: {files['html']}")
    print(f"Markdown: {files['markdown']}")

R2 Upload (Optional)

results = pathik.crawl_to_r2(
    ["https://example.com"],
    uuid_str="my-id"
)

for url, info in results.items():
    print(f"R2 HTML Key: {info['r2_html_key']}")
    print(f"Local File: {info['local_html_file']}")

Direct Go Usage

Local Crawling

./pathik_bin -crawl -outdir ./output https://example.com

R2 Upload

./pathik_bin -r2 -uuid my-id -dir ./output https://example.com

TROUBLESHOOTING

Missing Binary

cd pathik
python build_binary.py

Path Issues

# Use absolute paths
output_dir = os.path.abspath("./output")

Import Errors

pip uninstall -y pathik
cd pathik && pip install -e .

PROJECT STRUCTURE

  • main.go - CLI interface
  • crawler/ - Web crawling logic
  • storage/ - File storage handlers
  • pathik/ - Python bindings
  • __init__.py - Package setup
  • crawler.py - Go integration
  • simple.py - Python fallback

CONFIGURATION

Configure R2 credentials in storage.go or through environment variables.

LICENSE

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.1.1.tar.gz (11.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathik-0.1.1-py3-none-any.whl (10.9 MB view details)

Uploaded Python 3

File details

Details for the file pathik-0.1.1.tar.gz.

File metadata

  • Download URL: pathik-0.1.1.tar.gz
  • Upload date:
  • Size: 11.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.1.1.tar.gz
Algorithm Hash digest
SHA256 94bad69fa49fd4029e33c7830055821b7091fbcff2b4847900ae4edbfd8e8768
MD5 ef3aa3cdc603d77176dd5d7e23c69d94
BLAKE2b-256 c6e0b96b8ff3d3547cfcde2e1fbc8b69c110b4d6b3cc738c7e7d51e8350b18fe

See more details on using hashes here.

File details

Details for the file pathik-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pathik-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a77fd7d69a803b20e0b77e01ea3b552d6dbc51c084099f84df1c67bae6cfcf6c
MD5 420219fb606157e9cf8cc93395faa874
BLAKE2b-256 4f87aa0c943740d1be32c8093bf69e0ce8957d75417313c9938306e59de8d98b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page