Skip to main content

A web crawler implemented in Go with Python bindings

Project description

Pathik

A high-performance web crawler implemented in Go with Python and JavaScript bindings.

Features

  • Fast crawling with Go's concurrency model
  • Clean content extraction
  • Markdown conversion
  • Parallel URL processing
  • Cloudflare R2 integration
  • Memory-efficient (uses ~10x less memory than browser automation tools)

Python Installation

pip install pathik

JavaScript Installation

npm install pathik

Python Usage

import pathik
import os

# Create an output directory with an absolute path
output_dir = os.path.abspath("output_data")
os.makedirs(output_dir, exist_ok=True)

# Crawl a single URL
result = pathik.crawl('https://example.com', output_dir=output_dir)
print(f"HTML file: {result['https://example.com']['html']}")
print(f"Markdown file: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel (default behavior)
urls = [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://github.com",
    "https://wikipedia.org"
]
results = pathik.crawl(urls, output_dir=output_dir)

# Crawl URLs sequentially (parallel disabled)
results = pathik.crawl(urls, output_dir=output_dir, parallel=False)

# Crawl and upload to R2
r2_results = pathik.crawl_to_r2(urls, uuid_str='my-unique-id', parallel=True)

JavaScript Usage

const pathik = require('pathik');
const path = require('path');
const fs = require('fs');

// Create output directory
const outputDir = path.resolve('./output_data');
fs.mkdirSync(outputDir, { recursive: true });

// Crawl a single URL
pathik.crawl('https://example.com', { outputDir })
  .then(results => {
    console.log(`HTML file: ${results['https://example.com'].html}`);
  });

// Crawl multiple URLs in parallel (default behavior)
const urls = [
  'https://example.com',
  'https://news.ycombinator.com',
  'https://github.com'
];

pathik.crawl(urls, { outputDir })
  .then(results => {
    console.log(`Crawled ${Object.keys(results).length} URLs`);
  });

// Crawl URLs sequentially
pathik.crawl(urls, { outputDir, parallel: false })
  .then(results => {
    console.log(`Crawled ${Object.keys(results).length} URLs sequentially`);
  });

// Upload to R2
pathik.crawlToR2(urls, { uuid: 'my-unique-id' })
  .then(results => {
    console.log('R2 Upload complete');
  });

Python API

pathik.crawl(urls, output_dir=None, parallel=True)

Crawl URLs and save the content locally.

Parameters:

  • urls: A single URL string or a list of URLs to crawl
  • output_dir: Directory to save crawled files (uses a temporary directory if None)
  • parallel: Whether to use parallel crawling (default: True)

Returns:

  • A dictionary mapping URLs to file paths: {url: {"html": html_path, "markdown": markdown_path}}

pathik.crawl_to_r2(urls, uuid_str=None, parallel=True)

Crawl URLs and upload the content to Cloudflare R2.

Parameters:

  • urls: A single URL string or a list of URLs to crawl
  • uuid_str: UUID to prefix filenames for uploads (generates one if None)
  • parallel: Whether to use parallel crawling (default: True)

Returns:

  • A dictionary with R2 upload information

JavaScript API

pathik.crawl(urls, options)

Crawl URLs and save content locally.

Parameters:

  • urls: String or array of URLs to crawl
  • options: Object with crawl options
    • outputDir: Directory to save output (uses temp dir if null)
    • parallel: Enable/disable parallel crawling (default: true)

Returns:

  • Promise resolving to an object mapping URLs to file paths

pathik.crawlToR2(urls, options)

Crawl URLs and upload content to R2.

Parameters:

  • urls: String or array of URLs to crawl
  • options: Object with R2 options
    • uuid: UUID to prefix filenames (generates random UUID if null)
    • parallel: Enable/disable parallel crawling (default: true)

Returns:

  • Promise resolving to an object mapping URLs to R2 keys

Requirements

  • Go 1.18+ (for building the binary)
  • Python 3.6+ or Node.js 14+

Building from Source

For Python:

python build_binary.py
pip install -e .

For JavaScript:

npm run build-binary
npm install

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.2.0.tar.gz (11.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pathik-0.2.0-py3-none-any.whl (10.9 MB view details)

Uploaded Python 3

File details

Details for the file pathik-0.2.0.tar.gz.

File metadata

  • Download URL: pathik-0.2.0.tar.gz
  • Upload date:
  • Size: 11.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.2.0.tar.gz
Algorithm Hash digest
SHA256 015c21dcf96c414b387a448ec575c5936d557720db69de7168416ddde060008d
MD5 90f43e732d19b98fb886bed215ba7458
BLAKE2b-256 ac728c763ba8ce1d5abe68234ac8083b076bb8cdbdbdf1f2cde173000a956000

See more details on using hashes here.

File details

Details for the file pathik-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pathik-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d127c1d8e500130c69782893d961d939dd61e12a02e71bc6a93b1a3b22dcfdc0
MD5 ea9e6f4051e4d0663d462071d36cbfa5
BLAKE2b-256 2c4e7ee09dfadc4b8b7402a50528597b6612ce382ab0440093bac397db2ab16e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page