A web crawler implemented in Go with Python bindings
Project description
Pathik
A powerful web crawling tool with Go implementation and Python bindings. Supports local storage and optional Cloudflare R2 storage.
INSTALLATION
Prerequisites
- Go 1.16+
- Python 3.6+
Install Python Package
pip install pathik
Clone Repository
git clone https://github.com/yourusername/pathik.git
cd pathik
Install in Development Mode
pip install -e .
BUILDING GO BINARY
Navigate to Pathik Directory
cd pathik
Build Binary Using Script
python build_binary.py
Expected Output:
Building Go binary in /path/to/pathik
Build successful!
Binary located at: /path/to/pathik/pathik_bin
Testing binary...
Binary output: [Help text from binary]
USAGE
Python Usage
Basic Crawling
import pathik
import os
output_dir = os.path.abspath("output_data")
os.makedirs(output_dir, exist_ok=True)
urls = ["https://example.com"]
results = pathik.crawl(urls, output_dir)
for url, files in results.items():
print(f"URL: {url}")
print(f"HTML: {files['html']}")
print(f"Markdown: {files['markdown']}")
R2 Upload (Optional)
results = pathik.crawl_to_r2(
["https://example.com"],
uuid_str="my-id"
)
for url, info in results.items():
print(f"R2 HTML Key: {info['r2_html_key']}")
print(f"Local File: {info['local_html_file']}")
Direct Go Usage
Local Crawling
./pathik_bin -crawl -outdir ./output https://example.com
R2 Upload
./pathik_bin -r2 -uuid my-id -dir ./output https://example.com
TROUBLESHOOTING
Missing Binary
cd pathik
python build_binary.py
Path Issues
# Use absolute paths
output_dir = os.path.abspath("./output")
Import Errors
pip uninstall -y pathik
cd pathik && pip install -e .
PROJECT STRUCTURE
main.go- CLI interfacecrawler/- Web crawling logicstorage/- File storage handlerspathik/- Python bindings__init__.py- Package setupcrawler.py- Go integrationsimple.py- Python fallback
CONFIGURATION
Configure R2 credentials in storage.go or through environment variables.
LICENSE
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pathik-0.1.1.tar.gz
(11.1 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pathik-0.1.1-py3-none-any.whl
(10.9 MB
view details)
File details
Details for the file pathik-0.1.1.tar.gz.
File metadata
- Download URL: pathik-0.1.1.tar.gz
- Upload date:
- Size: 11.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94bad69fa49fd4029e33c7830055821b7091fbcff2b4847900ae4edbfd8e8768
|
|
| MD5 |
ef3aa3cdc603d77176dd5d7e23c69d94
|
|
| BLAKE2b-256 |
c6e0b96b8ff3d3547cfcde2e1fbc8b69c110b4d6b3cc738c7e7d51e8350b18fe
|
File details
Details for the file pathik-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pathik-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a77fd7d69a803b20e0b77e01ea3b552d6dbc51c084099f84df1c67bae6cfcf6c
|
|
| MD5 |
420219fb606157e9cf8cc93395faa874
|
|
| BLAKE2b-256 |
4f87aa0c943740d1be32c8093bf69e0ce8957d75417313c9938306e59de8d98b
|