A web crawler implemented in Go with Python bindings
Project description
Pathik
A high-performance web crawler implemented in Go with Python and JavaScript bindings.
Features
- Fast crawling with Go's concurrency model
- Clean content extraction
- Markdown conversion
- Parallel URL processing
- Cloudflare R2 integration
- Memory-efficient (uses ~10x less memory than browser automation tools)
Performance Benchmarks
Memory Usage Comparison
Pathik is significantly more memory-efficient than browser automation tools like Playwright:
Parallel Crawling Performance
Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:
Python Performance
Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling
JavaScript Performance
Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling
Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the parallel parameter.
Python Installation
pip install pathik
JavaScript Installation
npm install pathik
Python Usage
import pathik
import os
# Create an output directory with an absolute path
output_dir = os.path.abspath("output_data")
os.makedirs(output_dir, exist_ok=True)
# Crawl a single URL
result = pathik.crawl('https://example.com', output_dir=output_dir)
print(f"HTML file: {result['https://example.com']['html']}")
print(f"Markdown file: {result['https://example.com']['markdown']}")
# Crawl multiple URLs in parallel (default behavior)
urls = [
"https://example.com",
"https://news.ycombinator.com",
"https://github.com",
"https://wikipedia.org"
]
results = pathik.crawl(urls, output_dir=output_dir)
# Crawl URLs sequentially (parallel disabled)
results = pathik.crawl(urls, output_dir=output_dir, parallel=False)
# Crawl and upload to R2
r2_results = pathik.crawl_to_r2(urls, uuid_str='my-unique-id', parallel=True)
JavaScript Usage
const pathik = require('pathik');
const path = require('path');
const fs = require('fs');
// Create output directory
const outputDir = path.resolve('./output_data');
fs.mkdirSync(outputDir, { recursive: true });
// Crawl a single URL
pathik.crawl('https://example.com', { outputDir })
.then(results => {
console.log(`HTML file: ${results['https://example.com'].html}`);
});
// Crawl multiple URLs in parallel (default behavior)
const urls = [
'https://example.com',
'https://news.ycombinator.com',
'https://github.com'
];
pathik.crawl(urls, { outputDir })
.then(results => {
console.log(`Crawled ${Object.keys(results).length} URLs`);
});
// Crawl URLs sequentially
pathik.crawl(urls, { outputDir, parallel: false })
.then(results => {
console.log(`Crawled ${Object.keys(results).length} URLs sequentially`);
});
// Upload to R2
pathik.crawlToR2(urls, { uuid: 'my-unique-id' })
.then(results => {
console.log('R2 Upload complete');
});
Python API
pathik.crawl(urls, output_dir=None, parallel=True)
Crawl URLs and save the content locally.
Parameters:
urls: A single URL string or a list of URLs to crawloutput_dir: Directory to save crawled files (uses a temporary directory if None)parallel: Whether to use parallel crawling (default: True)
Returns:
- A dictionary mapping URLs to file paths:
{url: {"html": html_path, "markdown": markdown_path}}
pathik.crawl_to_r2(urls, uuid_str=None, parallel=True)
Crawl URLs and upload the content to Cloudflare R2.
Parameters:
urls: A single URL string or a list of URLs to crawluuid_str: UUID to prefix filenames for uploads (generates one if None)parallel: Whether to use parallel crawling (default: True)
Returns:
- A dictionary with R2 upload information
JavaScript API
pathik.crawl(urls, options)
Crawl URLs and save content locally.
Parameters:
urls: String or array of URLs to crawloptions: Object with crawl optionsoutputDir: Directory to save output (uses temp dir if null)parallel: Enable/disable parallel crawling (default: true)
Returns:
- Promise resolving to an object mapping URLs to file paths
pathik.crawlToR2(urls, options)
Crawl URLs and upload content to R2.
Parameters:
urls: String or array of URLs to crawloptions: Object with R2 optionsuuid: UUID to prefix filenames (generates random UUID if null)parallel: Enable/disable parallel crawling (default: true)
Returns:
- Promise resolving to an object mapping URLs to R2 keys
Requirements
- Go 1.18+ (for building the binary)
- Python 3.6+ or Node.js 14+
Building from Source
For Python:
python build_binary.py
pip install -e .
For JavaScript:
npm run build-binary
npm install
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pathik-0.2.1.tar.gz.
File metadata
- Download URL: pathik-0.2.1.tar.gz
- Upload date:
- Size: 65.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
009d744d7f92b1fe9c85c2e44fdc2e8c54631bff7d8a26d7d658f74fb1193ec8
|
|
| MD5 |
a45689b4da26001dc7e02ea177b936f5
|
|
| BLAKE2b-256 |
3ce3544ffb5e8f415acf8abef22d45d29f0494eb890f095183ac0d400453ce1d
|
File details
Details for the file pathik-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pathik-0.2.1-py3-none-any.whl
- Upload date:
- Size: 65.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fccc13e14f31adcca796fb545c2d51aa01785d95e45fe7c58a2289300a29de97
|
|
| MD5 |
cb82b5912c795639d005c5767830c9c6
|
|
| BLAKE2b-256 |
a1e0775454e3ccc220154ee586dddbd889e31ff8beb1f365d3b7e580044a8dc1
|