A web crawler implemented in Go with Python bindings

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Pathik

Pathik Logo

A high-performance web crawler implemented in Go with Python and JavaScript bindings.

Features

Fast crawling with Go's concurrency model
Clean content extraction
Markdown conversion
Parallel URL processing
Cloudflare R2 integration
Memory-efficient (uses ~10x less memory than browser automation tools)

Performance Benchmarks

Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

Memory Usage Comparison

Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

Python Performance

Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling

JavaScript Performance

Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.

Python Installation

pip install pathik

JavaScript Installation

npm install pathik

Python Usage

import pathik
import os

# Create an output directory with an absolute path
output_dir = os.path.abspath("output_data")
os.makedirs(output_dir, exist_ok=True)

# Crawl a single URL
result = pathik.crawl('https://example.com', output_dir=output_dir)
print(f"HTML file: {result['https://example.com']['html']}")
print(f"Markdown file: {result['https://example.com']['markdown']}")

# Crawl multiple URLs in parallel (default behavior)
urls = [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://github.com",
    "https://wikipedia.org"
]
results = pathik.crawl(urls, output_dir=output_dir)

# Crawl URLs sequentially (parallel disabled)
results = pathik.crawl(urls, output_dir=output_dir, parallel=False)

# Crawl and upload to R2
r2_results = pathik.crawl_to_r2(urls, uuid_str='my-unique-id', parallel=True)

JavaScript Usage

const pathik = require('pathik');
const path = require('path');
const fs = require('fs');

// Create output directory
const outputDir = path.resolve('./output_data');
fs.mkdirSync(outputDir, { recursive: true });

// Crawl a single URL
pathik.crawl('https://example.com', { outputDir })
  .then(results => {
    console.log(`HTML file: ${results['https://example.com'].html}`);
  });

// Crawl multiple URLs in parallel (default behavior)
const urls = [
  'https://example.com',
  'https://news.ycombinator.com',
  'https://github.com'
];

pathik.crawl(urls, { outputDir })
  .then(results => {
    console.log(`Crawled ${Object.keys(results).length} URLs`);
  });

// Crawl URLs sequentially
pathik.crawl(urls, { outputDir, parallel: false })
  .then(results => {
    console.log(`Crawled ${Object.keys(results).length} URLs sequentially`);
  });

// Upload to R2
pathik.crawlToR2(urls, { uuid: 'my-unique-id' })
  .then(results => {
    console.log('R2 Upload complete');
  });

Python API

pathik.crawl(urls, output_dir=None, parallel=True)

Crawl URLs and save the content locally.

Parameters:

urls: A single URL string or a list of URLs to crawl
output_dir: Directory to save crawled files (uses a temporary directory if None)
parallel: Whether to use parallel crawling (default: True)

Returns:

A dictionary mapping URLs to file paths: {url: {"html": html_path, "markdown": markdown_path}}

pathik.crawl_to_r2(urls, uuid_str=None, parallel=True)

Crawl URLs and upload the content to Cloudflare R2.

Parameters:

urls: A single URL string or a list of URLs to crawl
uuid_str: UUID to prefix filenames for uploads (generates one if None)
parallel: Whether to use parallel crawling (default: True)

Returns:

A dictionary with R2 upload information

JavaScript API

pathik.crawl(urls, options)

Crawl URLs and save content locally.

Parameters:

urls: String or array of URLs to crawl
options: Object with crawl options
- outputDir: Directory to save output (uses temp dir if null)
- parallel: Enable/disable parallel crawling (default: true)

Returns:

Promise resolving to an object mapping URLs to file paths

pathik.crawlToR2(urls, options)

Crawl URLs and upload content to R2.

Parameters:

urls: String or array of URLs to crawl
options: Object with R2 options
- uuid: UUID to prefix filenames (generates random UUID if null)
- parallel: Enable/disable parallel crawling (default: true)

Returns:

Promise resolving to an object mapping URLs to R2 keys

Requirements

Go 1.18+ (for building the binary)
Python 3.6+ or Node.js 14+

Building from Source

For Python:

python build_binary.py
pip install -e .

For JavaScript:

npm run build-binary
npm install

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.11

Mar 27, 2025

0.3.1

Mar 27, 2025

0.3.0

Mar 27, 2025

0.2.26

Mar 27, 2025

0.2.25

Mar 27, 2025

0.2.24

Mar 27, 2025

0.2.23

Mar 27, 2025

0.2.19

Mar 27, 2025

0.2.18

Mar 27, 2025

0.2.17

Mar 27, 2025

0.2.16

Mar 27, 2025

0.2.15

Mar 26, 2025

0.2.14

Mar 26, 2025

0.2.13

Mar 26, 2025

0.2.12

Mar 26, 2025

0.2.11

Mar 26, 2025

0.2.10

Mar 26, 2025

0.2.2

Mar 25, 2025

This version

0.2.1

Mar 25, 2025

0.2.0

Mar 16, 2025

0.1.1

Mar 3, 2025

0.1.0

Mar 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pathik-0.2.1.tar.gz (65.6 MB view details)

Uploaded Mar 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pathik-0.2.1-py3-none-any.whl (65.8 MB view details)

Uploaded Mar 25, 2025 Python 3

File details

Details for the file pathik-0.2.1.tar.gz.

File metadata

Download URL: pathik-0.2.1.tar.gz
Upload date: Mar 25, 2025
Size: 65.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`009d744d7f92b1fe9c85c2e44fdc2e8c54631bff7d8a26d7d658f74fb1193ec8`
MD5	`a45689b4da26001dc7e02ea177b936f5`
BLAKE2b-256	`3ce3544ffb5e8f415acf8abef22d45d29f0494eb890f095183ac0d400453ce1d`

See more details on using hashes here.

File details

Details for the file pathik-0.2.1-py3-none-any.whl.

File metadata

Download URL: pathik-0.2.1-py3-none-any.whl
Upload date: Mar 25, 2025
Size: 65.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pathik-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fccc13e14f31adcca796fb545c2d51aa01785d95e45fe7c58a2289300a29de97`
MD5	`cb82b5912c795639d005c5767830c9c6`
BLAKE2b-256	`a1e0775454e3ccc220154ee586dddbd889e31ff8beb1f365d3b7e580044a8dc1`

See more details on using hashes here.

pathik 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pathik

Features

Performance Benchmarks

Memory Usage Comparison

Parallel Crawling Performance

Python Performance

JavaScript Performance

Python Installation

JavaScript Installation

Python Usage

JavaScript Usage

Python API

pathik.crawl(urls, output_dir=None, parallel=True)

pathik.crawl_to_r2(urls, uuid_str=None, parallel=True)

JavaScript API

pathik.crawl(urls, options)

pathik.crawlToR2(urls, options)

Requirements

Building from Source

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes