A web crawler that converts web pages to markdown and prepares them for LLM consumption
Project description
TezzCrawler
A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.
Features
- Single page scraping with markdown conversion
- Full website crawling using sitemap.xml
- Proxy support for web scraping
- Simple CLI interface
- Easy to use as a Python package
Installation
pip install TezzCrawler
Usage
Command Line Interface
- Scrape a single page:
tezzcrawler scrape-page https://example.com --output ./output
- Crawl from sitemap:
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
- Using with proxy:
tezzcrawler scrape-page https://example.com \
--proxy-url proxy.example.com \
--proxy-port 8080 \
--proxy-username user \
--proxy-password pass \
--output ./output
Python Package
from tezzcrawler import Scraper, Crawler
from pathlib import Path
# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))
# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))
# With proxy configuration
scraper = Scraper(
proxy_url="proxy.example.com",
proxy_port=8080,
proxy_username="user",
proxy_password="pass"
)
Development
- Clone the repository:
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
- Install development dependencies:
pip install -e ".[dev]"
License
MIT License - see LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tezzcrawler-0.3.0.tar.gz
(5.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tezzcrawler-0.3.0.tar.gz.
File metadata
- Download URL: tezzcrawler-0.3.0.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d0d87460bca5e30e5f67269f72aaa46bb7256fd88df20fa2e4cf4754e976410
|
|
| MD5 |
b56d0cf65f1a5be0229f688cf94aea22
|
|
| BLAKE2b-256 |
ed625eab7b7cd2fbc05fabc50c3f61a75009652770c86d7d4099a0c282fecd2a
|
File details
Details for the file TezzCrawler-0.3.0-py3-none-any.whl.
File metadata
- Download URL: TezzCrawler-0.3.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c693352088aaddab76b19740c0b207b6515ea4b13f55fc0932f687cf4ef1761
|
|
| MD5 |
f729ea4d23d025a632659d5dcfd7c133
|
|
| BLAKE2b-256 |
4b472259d282c08c3be90991860f2d90f81387174e491715fbcd1f1371034986
|