A web crawler that converts web pages to markdown and prepares them for LLM consumption
Project description
TezzCrawler
A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.
Features
- Single page scraping with markdown conversion
- Full website crawling using sitemap.xml
- Proxy support for web scraping
- Simple CLI interface
- Easy to use as a Python package
Installation
pip install TezzCrawler
Usage
Command Line Interface
- Scrape a single page:
tezzcrawler scrape-page https://example.com --output ./output
- Crawl from sitemap:
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
- Using with proxy:
tezzcrawler scrape-page https://example.com \
--proxy-url proxy.example.com \
--proxy-port 8080 \
--proxy-username user \
--proxy-password pass \
--output ./output
Python Package
from tezzcrawler import Scraper, Crawler
from pathlib import Path
# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))
# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))
# With proxy configuration
scraper = Scraper(
proxy_url="proxy.example.com",
proxy_port=8080,
proxy_username="user",
proxy_password="pass"
)
Development
- Clone the repository:
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
- Install development dependencies:
pip install -e ".[dev]"
License
MIT License - see LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tezzcrawler-0.3.1.tar.gz
(5.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tezzcrawler-0.3.1.tar.gz.
File metadata
- Download URL: tezzcrawler-0.3.1.tar.gz
- Upload date:
- Size: 5.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52ae2fb799947aabdb0d39ebc86ea2ccdbbd7f288dc835a44f648fbfb03a7d5c
|
|
| MD5 |
c8db369c4410ce95bcaad1cd979b9522
|
|
| BLAKE2b-256 |
590bc8ab9351e23e6e937f0baf6478d6caef025fbf48f329d3588b216dd70e1d
|
File details
Details for the file TezzCrawler-0.3.1-py3-none-any.whl.
File metadata
- Download URL: TezzCrawler-0.3.1-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cce7b954a6e2cea64ef001bb890ec27ef9ec482bc0a1e23b561b66ede45aaf3
|
|
| MD5 |
5e279a6d2beb9311d910cdf8c0afb2e3
|
|
| BLAKE2b-256 |
3b9775f47d63aaab2909ce71d82aa729bdba98c2f1761ceba0cb841f4eef4200
|