Skip to main content

A web crawler that converts web pages to markdown and prepares them for LLM consumption

Project description

TezzCrawler

A powerful web crawler that converts web pages to markdown format, making them ready for LLM consumption.

Features

  • Single page scraping with markdown conversion
  • Full website crawling using sitemap.xml
  • Proxy support for web scraping
  • Simple CLI interface
  • Easy to use as a Python package

Installation

pip install TezzCrawler

Usage

Command Line Interface

  1. Scrape a single page:
tezzcrawler scrape-page https://example.com --output ./output
  1. Crawl from sitemap:
tezzcrawler crawl-from-sitemap https://example.com/sitemap.xml --output ./output
  1. Using with proxy:
tezzcrawler scrape-page https://example.com \
    --proxy-url proxy.example.com \
    --proxy-port 8080 \
    --proxy-username user \
    --proxy-password pass \
    --output ./output

Python Package

from tezzcrawler import Scraper, Crawler
from pathlib import Path

# Scrape a single page
scraper = Scraper()
scraper.scrape_page("https://example.com", Path("./output"))

# Crawl from sitemap
crawler = Crawler()
crawler.crawl_sitemap("https://example.com/sitemap.xml", Path("./output"))

# With proxy configuration
scraper = Scraper(
    proxy_url="proxy.example.com",
    proxy_port=8080,
    proxy_username="user",
    proxy_password="pass"
)

Development

  1. Clone the repository:
git clone https://github.com/TezzLabs/TezzCrawler.git
cd TezzCrawler
  1. Install development dependencies:
pip install -e ".[dev]"

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tezzcrawler-0.3.0.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TezzCrawler-0.3.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file tezzcrawler-0.3.0.tar.gz.

File metadata

  • Download URL: tezzcrawler-0.3.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for tezzcrawler-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9d0d87460bca5e30e5f67269f72aaa46bb7256fd88df20fa2e4cf4754e976410
MD5 b56d0cf65f1a5be0229f688cf94aea22
BLAKE2b-256 ed625eab7b7cd2fbc05fabc50c3f61a75009652770c86d7d4099a0c282fecd2a

See more details on using hashes here.

File details

Details for the file TezzCrawler-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: TezzCrawler-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for TezzCrawler-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c693352088aaddab76b19740c0b207b6515ea4b13f55fc0932f687cf4ef1761
MD5 f729ea4d23d025a632659d5dcfd7c133
BLAKE2b-256 4b472259d282c08c3be90991860f2d90f81387174e491715fbcd1f1371034986

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page